# US Health Medicare Payments Exploratory Data Analysis with Spark

### Examine Four Years of US Health Medicare Payments Data Using Spark

Combine four years of data and perform some preliminary data analysis.


In [None]:
#Let's mount Google Drive So We can Retrieve the Data
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
#There is Five Files of Health Payments Data From 2014 to 2017. Source  XXXXXXX
!ls "/content/gdrive/My Drive/Health Data/"

Health14to17.parquet  PGYR15_P011819.ZIP  PGYR17_P011819.ZIP
PGYR14_P011819.ZIP    PGYR16_P011819.ZIP


In [None]:
#Unzip the files
#!rm -r *
!unzip -qq "/content/gdrive/My Drive/Health Data/PGYR14_P011819.ZIP"
!unzip -qq "/content/gdrive/My Drive/Health Data/PGYR15_P011819.ZIP"
!unzip -qq "/content/gdrive/My Drive/Health Data/PGYR16_P011819.ZIP"
!unzip -qq "/content/gdrive/My Drive/Health Data/PGYR17_P011819.ZIP"

In [None]:
#Examine the Readme File OP_PGYR2013_README_P01182019...

!cat OP_PGYR2014_README_P01182019.txt

Filename: OP_PGYR2014_README_P01182019.txt
Version: 1.0
Date: January 2019

1. Program Year 2014 Data Files

This data set includes records submitted for the 2014 program year that have been matched with total confidence to a particular covered recipient (i.e., physician or teaching hospital) and displays information about that recipient. This data set includes the most recent attested-to data for Program Year 2014 as of December 31, 2018.

The data set contained in the comma-separated values (CSV) file includes only the data that is eligible for publication. Consult the Open Payments Methodology and Data Dictionary Document for an explanation of the criteria that the Centers for Medicare and Medicaid Services (CMS) used to determine what data to publish. This document can be found on the Resources page of the Open Payments website (https://www.cms.gov/OpenPayments/About/Resources.html). The Methodology and Data Dictionary Document also includes information on the data collecti

In [None]:
!ls

gdrive
OP_DTL_GNRL_PGYR2014_P01182019.csv
OP_DTL_GNRL_PGYR2015_P01182019.csv
OP_DTL_GNRL_PGYR2016_P01182019.csv
OP_DTL_GNRL_PGYR2017_P01182019.csv
OP_DTL_OWNRSHP_PGYR2014_P01182019.csv
OP_DTL_OWNRSHP_PGYR2015_P01182019.csv
OP_DTL_OWNRSHP_PGYR2016_P01182019.csv
OP_DTL_OWNRSHP_PGYR2017_P01182019.csv
OP_DTL_RSRCH_PGYR2014_P01182019.csv
OP_DTL_RSRCH_PGYR2015_P01182019.csv
OP_DTL_RSRCH_PGYR2016_P01182019.csv
OP_DTL_RSRCH_PGYR2017_P01182019.csv
OP_PGYR2014_README_P01182019.txt
OP_PGYR2015_README_P01182019.txt
OP_PGYR2016_README_P01182019.txt
OP_PGYR2017_README_P01182019.txt
OP_REMOVED_DELETED_PGYR2014_P01182019.csv
OP_REMOVED_DELETED_PGYR2015_P01182019.csv
OP_REMOVED_DELETED_PGYR2016_P01182019.csv
OP_REMOVED_DELETED_PGYR2017_P01182019.csv
sample_data


#### ANALYSIS  
Each ZIP file contains 4 CSV Data Files. We are interested in the 4 payments files for 2014 to 2017...OP_DTL_GNRL_PGYR2014_P01182019.csv, etc. 

This file contains the data set of General Payments reported for the 2014 program year. General Payments are defined as payments or other transfers of value made to a covered recipient (physician or teaching hospital) that are not made in connection with a research agreement or research protocol.


# **Install and Load Up Spark**

In [None]:
#Install Latest Version of Spark As of Current Data. 2.4.3

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
#Import the csv files for 2014 to 2017, examine and aggregate into one file. 


df2014 = spark.read.format("csv").option("inferSchema", True).option("header", True).load('OP_DTL_GNRL_PGYR2014_P01182019.csv')
df2015 = spark.read.format("csv").option("inferSchema", True).option("header", True).load('OP_DTL_GNRL_PGYR2015_P01182019.csv')
df2016 = spark.read.format("csv").option("inferSchema", True).option("header", True).load('OP_DTL_GNRL_PGYR2016_P01182019.csv')
df2017 = spark.read.format("csv").option("inferSchema", True).option("header", True).load('OP_DTL_GNRL_PGYR2017_P01182019.csv')

In [None]:
#Examine the data to see if fields match up.

df2014.limit(5).toPandas()

Unnamed: 0,Change_Type,Covered_Recipient_Type,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Physician_Profile_ID,Physician_First_Name,Physician_Middle_Name,Physician_Last_Name,Physician_Name_Suffix,Recipient_Primary_Business_Street_Address_Line1,Recipient_Primary_Business_Street_Address_Line2,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country,Recipient_Province,Recipient_Postal_Code,Physician_Primary_Type,Physician_Specialty,Physician_License_State_code1,Physician_License_State_code2,Physician_License_State_code3,Physician_License_State_code4,Physician_License_State_code5,Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country,Total_Amount_of_Payment_USDollars,Date_of_Payment,Number_of_Payments_Included_in_Total_Amount,Form_of_Payment_or_Transfer_of_Value,Nature_of_Payment_or_Transfer_of_Value,City_of_Travel,State_of_Travel,Country_of_Travel,Physician_Ownership_Indicator,Third_Party_Payment_Recipient_Indicator,Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value,Charity_Indicator,Third_Party_Equals_Covered_Recipient_Indicator,Contextual_Information,Delay_in_Publication_Indicator,Record_ID,Dispute_Status_for_Publication,Product_Indicator,Name_of_Associated_Covered_Drug_or_Biological1,Name_of_Associated_Covered_Drug_or_Biological2,Name_of_Associated_Covered_Drug_or_Biological3,Name_of_Associated_Covered_Drug_or_Biological4,Name_of_Associated_Covered_Drug_or_Biological5,NDC_of_Associated_Covered_Drug_or_Biological1,NDC_of_Associated_Covered_Drug_or_Biological2,NDC_of_Associated_Covered_Drug_or_Biological3,NDC_of_Associated_Covered_Drug_or_Biological4,NDC_of_Associated_Covered_Drug_or_Biological5,Name_of_Associated_Covered_Device_or_Medical_Supply1,Name_of_Associated_Covered_Device_or_Medical_Supply2,Name_of_Associated_Covered_Device_or_Medical_Supply3,Name_of_Associated_Covered_Device_or_Medical_Supply4,Name_of_Associated_Covered_Device_or_Medical_Supply5,Program_Year,Payment_Publication_Date
0,UNCHANGED,Covered Recipient Teaching Hospital,360059.0,1574.0,Metro Health Medical Center,,,,,,2500 Metrohealth Dr.,,Cleveland,OH,44109,United States,,,,,,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,365.0,11/25/2014,1,Cash or cash equivalent,Gift,,,,,No Third Party Payment,,,,,No,106273408,No,,,,,,,,,,,,BT5M5S8AS,,,,,2014,01/18/2019
1,UNCHANGED,Covered Recipient Physician,,,,349191.0,Dale,,Buchbinder,,6569 N. Charles St.,,Baltimore,MD,21204,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Surgery|Va...,MD,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,55.64,12/11/2014,1,Cash or cash equivalent,Food and Beverage,,,,No,No Third Party Payment,,No,,,No,106272962,No,,,,,,,,,,,,,,,,,2014,01/18/2019
2,UNCHANGED,Covered Recipient Physician,,,,349191.0,Dale,,Buchbinder,,6569 N. Charles St.,,Baltimore,MD,21204,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Surgery|Va...,MD,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,107.4,12/11/2014,1,Cash or cash equivalent,Royalty or License,,,,No,No Third Party Payment,,No,,,No,106272964,No,,,,,,,,,,,,,,,,,2014,01/18/2019
3,UNCHANGED,Covered Recipient Physician,,,,543097.0,Rajesh,V,Lalla,,145 Webster Hill Blvd.,,West Hartford,CT,6107,United States,,,Doctor of Dentistry,Dental Providers|Dentist,CT,,,,,"FERA PHARMACEUTICALS, LLC",100000010769,"FERA PHARMACEUTICALS, LLC",NY,United States,250.0,05/23/2014,1,Cash or cash equivalent,Consulting Fee,,,,No,No Third Party Payment,,No,,,No,106451934,No,Covered,Moxatag,,,,,58463-0002-3,,,,,,,,,,2014,01/18/2019
4,UNCHANGED,Covered Recipient Physician,,,,61784.0,ERIC,,WALSH,,588 PAWTUCKET AVE,,PAWTUCKET,RI,2860,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Orthopaedi...,RI,,,,,"Surgi-Care, Inc.",100000005671,"Surgi-Care, Inc.",MA,United States,121.63,10/20/2014,1,Cash or cash equivalent,Food and Beverage,,,,No,No Third Party Payment,,,,,No,107350616,No,Covered,,,,,,,,,,,QmedRX,,,,,2014,01/18/2019


In [None]:
df2017.limit(5).toPandas()

Unnamed: 0,Change_Type,Covered_Recipient_Type,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Physician_Profile_ID,Physician_First_Name,Physician_Middle_Name,Physician_Last_Name,Physician_Name_Suffix,Recipient_Primary_Business_Street_Address_Line1,Recipient_Primary_Business_Street_Address_Line2,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country,Recipient_Province,Recipient_Postal_Code,Physician_Primary_Type,Physician_Specialty,Physician_License_State_code1,Physician_License_State_code2,Physician_License_State_code3,Physician_License_State_code4,Physician_License_State_code5,Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country,Total_Amount_of_Payment_USDollars,Date_of_Payment,Number_of_Payments_Included_in_Total_Amount,Form_of_Payment_or_Transfer_of_Value,Nature_of_Payment_or_Transfer_of_Value,City_of_Travel,State_of_Travel,Country_of_Travel,Physician_Ownership_Indicator,Third_Party_Payment_Recipient_Indicator,Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value,Charity_Indicator,Third_Party_Equals_Covered_Recipient_Indicator,Contextual_Information,Delay_in_Publication_Indicator,Record_ID,Dispute_Status_for_Publication,Related_Product_Indicator,Covered_or_Noncovered_Indicator_1,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_1,Product_Category_or_Therapeutic_Area_1,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_1,Associated_Drug_or_Biological_NDC_1,Covered_or_Noncovered_Indicator_2,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_2,Product_Category_or_Therapeutic_Area_2,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_2,Associated_Drug_or_Biological_NDC_2,Covered_or_Noncovered_Indicator_3,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_3,Product_Category_or_Therapeutic_Area_3,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3,Associated_Drug_or_Biological_NDC_3,Covered_or_Noncovered_Indicator_4,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_4,Product_Category_or_Therapeutic_Area_4,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4,Associated_Drug_or_Biological_NDC_4,Covered_or_Noncovered_Indicator_5,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5,Product_Category_or_Therapeutic_Area_5,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5,Associated_Drug_or_Biological_NDC_5,Program_Year,Payment_Publication_Date
0,UNCHANGED,Covered Recipient Physician,,,,326860,NAZEM,,ABRAHAM,,422 KINETIC DR,SUITE B,HUNTINGTON,WV,25701,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|General Pr...,WV,,,,,Mission Pharmacal Company,100000000186,Mission Pharmacal Company,TX,United States,12.0,11/24/2017,1,In-kind items and services,Food and Beverage,,,,No,No Third Party Payment,,,,,No,421243947,No,Yes,Covered,Drug,Antibacterial (topical),Plexion,57883-402-10,,,,,,,,,,,,,,,,,,,,,2017,01/18/2019
1,UNCHANGED,Covered Recipient Physician,,,,604392,Charles,,Pak,,5323 Harry Hines Blvd,,Dallas,TX,75390,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Internal M...,TX,,,,,Mission Pharmacal Company,100000000186,Mission Pharmacal Company,TX,United States,300.0,12/15/2017,1,"Dividend, profit or other return on investment",Charitable Contribution,,,,No,Entity,Charles Y C Pak Foundation,Yes,Yes,,No,421243939,No,No,,,,,,,,,,,,,,,,,,,,,,,,,,2017,01/18/2019
2,UNCHANGED,Covered Recipient Physician,,,,326860,Nazem,,Abraham,,422 KINETIC DR,SUITE B,HUNTINGTON,WV,25701,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|General Pr...,WV,,,,,Mission Pharmacal Company,100000000186,Mission Pharmacal Company,TX,United States,13.39,06/30/2017,1,In-kind items and services,Food and Beverage,,,,No,No Third Party Payment,,,,,No,421243941,No,Yes,Covered,Drug,Antibacterial (topical),Ovace,0178-0620-02,,,,,,,,,,,,,,,,,,,,,2017,01/18/2019
3,UNCHANGED,Covered Recipient Physician,,,,326860,NAZEM,,ABRAHAM,,422 KINETIC DR,SUITE B,HUNTINGTON,WV,25701,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|General Pr...,WV,,,,,Mission Pharmacal Company,100000000186,Mission Pharmacal Company,TX,United States,13.39,08/18/2017,1,In-kind items and services,Food and Beverage,,,,No,No Third Party Payment,,,,,No,421243943,No,Yes,Covered,Drug,Antibacterial (topical),Avar,0178-0640-30,,,,,,,,,,,,,,,,,,,,,2017,01/18/2019
4,UNCHANGED,Covered Recipient Physician,,,,1307901,HANY,,AHMED,,1919 NORTH LOOP W,SUITE 115,HOUSTON,TX,77008,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|General Pr...,TX,,,,,Mission Pharmacal Company,100000000186,Mission Pharmacal Company,TX,United States,12.03,06/18/2017,1,In-kind items and services,Food and Beverage,,,,No,No Third Party Payment,,,,,No,421243961,No,Yes,Covered,Drug,PRENATAL VITAMIN & MINERAL,CITRANATAL,0178-0796-30,,,,,,,,,,,,,,,,,,,,,2017,01/18/2019


**ANALYSIS**  
Examining the tables above (2014 and 2017 shown for simplicity), it looks like new fields were added for year 2016 and 2017. Most of the fields have the same name across the files. Key fields such as Physician information and Payment information looks exactly the same across the files. For convenience, we will drop fields that do not match the 2014 data and aggregate the data into one file.

In [None]:
#Lets determine which fields are in all files and drop fields that are in not 2013.
col2014 = df2014.columns
colkeep = [column for column in df2017.columns if column in col2014]


df2014 = df2014.select(colkeep)
df2015 = df2015.select(colkeep)
df2016 = df2016.select(colkeep)
df2017 = df2017.select(colkeep)

print("Fields in respective files...", len(df2014.columns), len(df2015.columns), len(df2016.columns), len(df2017.columns))

Fields in respective files... 49 49 49 49


**ANALYSIS**  
There are 49 common fields among the files. Below we ensure that the record number matches the total of the indiviuals files. Then we will combine the four dataframes into one dataframe and save it as a parquet file.

In [None]:
#total number of records
print('Total Number of Records....', (df2014.count() + df2015.count() + df2016.count() + df2017.count()))

Total Number of Records.... 43999854


In [None]:
#Spark doesn't have a function to append multiple dataframes. So we have to use a workaround.

from functools import reduce
from pyspark.sql import DataFrame

def unionAll(*dfa):
    return reduce(DataFrame.unionAll, dfa)

df = unionAll(df2014, df2015, df2016, df2017)
df.cache()

In [None]:
print('Number of records in the unified file....', df.count()) #The number of records match the aggregate of the number of records in the individual files.

Number of records in the unified file.... 43999854


In [None]:
#Lets looks at the combined dataframe
df.limit(5).toPandas().style.hide_index()

Change_Type,Covered_Recipient_Type,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Physician_Profile_ID,Physician_First_Name,Physician_Middle_Name,Physician_Last_Name,Physician_Name_Suffix,Recipient_Primary_Business_Street_Address_Line1,Recipient_Primary_Business_Street_Address_Line2,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country,Recipient_Province,Recipient_Postal_Code,Physician_Primary_Type,Physician_Specialty,Physician_License_State_code1,Physician_License_State_code2,Physician_License_State_code3,Physician_License_State_code4,Physician_License_State_code5,Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State,Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country,Total_Amount_of_Payment_USDollars,Date_of_Payment,Number_of_Payments_Included_in_Total_Amount,Form_of_Payment_or_Transfer_of_Value,Nature_of_Payment_or_Transfer_of_Value,City_of_Travel,State_of_Travel,Country_of_Travel,Physician_Ownership_Indicator,Third_Party_Payment_Recipient_Indicator,Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value,Charity_Indicator,Third_Party_Equals_Covered_Recipient_Indicator,Contextual_Information,Delay_in_Publication_Indicator,Record_ID,Dispute_Status_for_Publication,Program_Year,Payment_Publication_Date
UNCHANGED,Covered Recipient Teaching Hospital,360059.0,1574.0,Metro Health Medical Center,,,,,,2500 Metrohealth Dr.,,Cleveland,OH,44109,United States,,,,,,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,365.0,11/25/2014,1,Cash or cash equivalent,Gift,,,,,No Third Party Payment,,,,,No,106273408,No,2014,01/18/2019
UNCHANGED,Covered Recipient Physician,,,,349191.0,Dale,,Buchbinder,,6569 N. Charles St.,,Baltimore,MD,21204,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Surgery|Vascular Surgery,MD,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,55.64,12/11/2014,1,Cash or cash equivalent,Food and Beverage,,,,No,No Third Party Payment,,No,,,No,106272962,No,2014,01/18/2019
UNCHANGED,Covered Recipient Physician,,,,349191.0,Dale,,Buchbinder,,6569 N. Charles St.,,Baltimore,MD,21204,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Surgery|Vascular Surgery,MD,,,,,"Koven Technology, Inc.",100000000191,"Koven Technology, Inc.",MO,United States,107.4,12/11/2014,1,Cash or cash equivalent,Royalty or License,,,,No,No Third Party Payment,,No,,,No,106272964,No,2014,01/18/2019
UNCHANGED,Covered Recipient Physician,,,,543097.0,Rajesh,V,Lalla,,145 Webster Hill Blvd.,,West Hartford,CT,6107,United States,,,Doctor of Dentistry,Dental Providers|Dentist,CT,,,,,"FERA PHARMACEUTICALS, LLC",100000010769,"FERA PHARMACEUTICALS, LLC",NY,United States,250.0,05/23/2014,1,Cash or cash equivalent,Consulting Fee,,,,No,No Third Party Payment,,No,,,No,106451934,No,2014,01/18/2019
UNCHANGED,Covered Recipient Physician,,,,61784.0,ERIC,,WALSH,,588 PAWTUCKET AVE,,PAWTUCKET,RI,2860,United States,,,Medical Doctor,Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Hand Surgery,RI,,,,,"Surgi-Care, Inc.",100000005671,"Surgi-Care, Inc.",MA,United States,121.63,10/20/2014,1,Cash or cash equivalent,Food and Beverage,,,,No,No Third Party Payment,,,,,No,107350616,No,2014,01/18/2019


In [None]:
#Convert numerical fields from string to appropriate datatype
#numerical: Total_Amount_of_Payment_USDollars, Total_Amount_of_Payment_USDollars

df = df.withColumn('Total_Amount_of_Payment_USDollars', df.Total_Amount_of_Payment_USDollars.cast('Decimal'))
df = df.withColumn('Number_of_Payments_Included_in_Total_Amount', df.Number_of_Payments_Included_in_Total_Amount.cast('Decimal'))

#Change Payment date to Date Type
from pyspark.sql import functions as Func
df = df.withColumn('Date_of_Payment', Func.to_date('Date_of_Payment', 'MM/dd/yyyy'))

In [None]:
#Save to Parquet

#df.write.parquet("/content/gdrive/My Drive/Health Data/Health14to17.parquet")

#Exploratory Data Analysis on 4 Million Plus Records!

In [None]:
#Read parquet file
df = spark.read.parquet("/content/gdrive/My Drive/Health Data/Health14to17.parquet")

In [None]:
#Look at the distribution of Total Payments (combined from 2013 to 2017)
import pandas as pd
df.describe('Total_Amount_of_Payment_USDollars').toPandas().style.hide_index()\
.set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "right"})

summary,Total_Amount_of_Payment_USDollars
count,43999839.0
mean,249.9117
stddev,21290.3431381885
min,0.0
max,41414329.0


In [None]:
#Total Payments By Program Year
df.groupby('Program_year').sum('Total_Amount_of_Payment_USDollars').orderBy('Program_year', ascending = False)\
.toPandas().style.hide_index()\
.set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "right"})

#.style.format({'sum(Total_Amount_of_Payment_USDollars)':'${0:,.0f}'})  #breaks formatting in github

Program_year,sum(Total_Amount_of_Payment_USDollars)
2017,2814070049
2016,2817576704
2015,2687032307
2014,2677298935
01/18/2019,18
,97971


In [None]:
#Chart Spending by Month
from pyspark.sql import functions as Func

df.select(Func.date_format('Date_of_Payment','MM YYYY').alias('Month'), 'Total_Amount_of_Payment_USDollars')\
.groupby('Month').sum('Total_Amount_of_Payment_USDollars').sort('Month').toPandas().style.hide_index()\
.set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "right"})

Month,sum(Total_Amount_of_Payment_USDollars)
,6
01 2014,169155805
01 2015,192509866
01 2016,184107334
01 2017,172138967
01 2018,145
01 2101,117
02 2014,262910129
02 2015,284802485
02 2016,304864087


In [None]:
#What is the Total Paymnents 2014 to 2016 by Recipient State?
df.groupby('Recipient_State').sum('Total_Amount_of_Payment_USDollars').orderBy('sum(Total_Amount_of_Payment_USDollars)', ascending=False).toPandas()\
.style.hide_index().set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "right"})

Recipient_State,sum(Total_Amount_of_Payment_USDollars)
CA,2585721163.0
NY,876674225.0
TX,744063026.0
FL,561503220.0
MA,552870662.0
PA,531882138.0
OH,375891076.0
IL,343435599.0
NC,288178863.0
TN,280284319.0


#### ANALYSIS  
As you would expect, the largest States have the biggest spending. California dwarfs all other States. Note that there are errors in data entry such as coding cities as States.

In [None]:
# What type of payments are made?
df.groupby('Nature_of_Payment_or_Transfer_of_Value').count().orderBy('count', ascending=False).toPandas()\
.style.hide_index().set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "left"})

Nature_of_Payment_or_Transfer_of_Value,count
Food and Beverage,38385424
Travel and Lodging,2279111
Education,1173001
"Compensation for services other than consulting, including serving as faculty or as a speaker at a venue other than a continuing education program",1002953
Consulting Fee,556955
Gift,265935
Honoraria,97666
Royalty or License,59146
Compensation for serving as faculty or as a speaker for a non-accredited and noncertified continuing education program,40277
Space rental or facility fees(teaching hospital only),39754


In [None]:
#Display the Nature of the Payment by Dollars Amount in Groups, 0 to $1000, $1000 to $10,000, $10,000 to $100,000, $100,000 to $10,000,000, $10,0000+

splits = [-float("inf"), 0, 1000, 10000, 100000, 10000000, float("inf")]

from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits = splits, inputCol="Total_Amount_of_Payment_USDollars", outputCol="Total Amount Payment Buckets")
df = bucketizer.transform(df)


crosspdf = df.stat.crosstab('Nature_of_Payment_or_Transfer_of_Value', 'Total Amount Payment Buckets').toPandas()
crosspdf['TotalRow'] = crosspdf.sum(axis=1)
crosspdf.loc[-1] = crosspdf.sum(axis=0)
crosspdf.iloc[19,0] = "TotalCol"
crosspdf.sort_values('TotalRow', ascending = False).style.hide_index().set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "left"})


############ 1.0 = 0 to $1000, 2.0 = $1000 to $10,000, 3.0 = $10,000 to $100,000, 4.0 = $100,000 to $10,000,000, 5.0 $10,0000+ ##############

Nature_of_Payment_or_Transfer_of_Value_Total Amount Payment Buckets,1.0,2.0,3.0,4.0,5.0,null,TotalRow
TotalCol,42551031,1376325,64929,7514,40,15,43999854
Food and Beverage,38382946,2403,75,0,0,0,38385424
Travel and Lodging,2174732,102505,1873,1,0,0,2279111
Education,1118094,52965,1896,46,0,0,1173001
"Compensation for services other than consulting, including serving as faculty or as a speaker at a venue other than a continuing education program",261247,733188,8041,470,7,0,1002953
Consulting Fee,232440,300494,23364,657,0,0,556955
Gift,252519,11901,1448,67,0,0,265935
Honoraria,24450,71972,1229,15,0,0,97666
Royalty or License,17433,21103,15300,5281,29,0,59146
Compensation for serving as faculty or as a speaker for a non-accredited and noncertified continuing education program,10116,29560,601,0,0,0,40277


####ANALYSIS
The bulk of the payments between 2014 and 2017, some 38 million, are made are from 0 to 1000 Dollars on food and beverage category. Travel and Lodging is the second biggest category. Interesting, there are 10,000 Dollar or more payments; some 29 in Royalty or License.

In [None]:
# What is the Physician Specialty?
df.groupby('Physician_Primary_Type').count().orderBy('count', ascending=False).toPandas()\
.style.hide_index().set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "left"})

Physician_Primary_Type,count
Medical Doctor,37639654
Doctor of Osteopathy,3681225
Doctor of Dentistry,1203755
Doctor of Optometry,817511
Doctor of Podiatric Medicine,468644
,174606
Chiropractor,14450
DC,3
MO,3
Dental Providers|Dentist,1


In [None]:
# What is the Physician Specialty?
df.groupby('Physician_Specialty').count().orderBy('count', ascending=False).toPandas().head(10)\
.style.hide_index().set_properties(**{'background-color': 'lightgrey', 'color': 'Black','border-color': 'white', "text-align" : "left"})

Physician_Specialty,count
Allopathic & Osteopathic Physicians|Family Medicine,7536387
Allopathic & Osteopathic Physicians|Internal Medicine,6742717
Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease,2581140
Allopathic & Osteopathic Physicians|Psychiatry & Neurology|Neurology,1753926
Allopathic & Osteopathic Physicians|Internal Medicine|Gastroenterology,1414703
Allopathic & Osteopathic Physicians|Psychiatry & Neurology|Psychiatry,1412532
"Allopathic & Osteopathic Physicians|Internal Medicine|Endocrinology, Diabetes & Metabolism",1237311
Allopathic & Osteopathic Physicians|Internal Medicine|Hematology & Oncology,1169231
Allopathic & Osteopathic Physicians|Dermatology,1089112
Allopathic & Osteopathic Physicians|Obstetrics & Gynecology,1010884


#### ANALYSIS  
As expected in a health payments dataset, the largest category for number of payments is Medical Doctor. This is followed by Doctor of Dentistry. With regards to speciality, Family Medicine is the largest category followed by Internal Medicine.