# Data Importing and Cleaning

For this project, we'll be working with a set of Public Use Files (PUFs) from the Centers of Medicare and Medicaid Services (CMS). Namely, we will be working with the Medicare Physician & Other Practitioners by Provider and Service datasets over the available years, from 2013 to 2020. 

## Steps

Before we can get to any interesting applications for this project, we need to do the following:
- Import the data into Python
- Get the data in a usable format
- Select the variables of interest for our applications

### Importing the Data

First, we need to import the data that we wish to use. To begin, we first download the data from the CMS data website ([data.cms.gov](https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service)).

Each year's dataset is about 3 GB total, so altogether, this is about 24 GB of data. We will download this and save to our project folder. Let's check and see if each year's file is there. First, we'll import the `os` module and make sure that our project folder is our current working directory.

In [1]:
# Import the os module
import os
import gc
os.getcwd()

'/geode2/home/u010/ausknies/Carbonate/Desktop/Erdos Fall 2022 Project'

Cool, we're good to go. Now, let's check if our data files are there. The data is split into annual releases of `.csv` files, and I have saved them under the name `DXX_Prov_Svc.csv` where `XX` $\in \{13,14,15,16,17,18,19,20\} \subset \mathbb{N}$ for each year of data. 

To check, let's make a list of each of the files we want to use, and then we can build a simple `for` loop that tells us if each file is there.

In [2]:
file_list = ['D13_Prov_Svc.csv','D14_Prov_Svc.csv','D15_Prov_Svc.csv','D16_Prov_Svc.csv',
             'D17_Prov_Svc.csv', 'D18_Prov_Svc.csv','D19_Prov_Svc.csv','D20_Prov_Svc.csv']

for i in file_list:
    print("True or false:",i,"exists.")
    print(os.path.isfile(i))

True or false: D13_Prov_Svc.csv exists.
True
True or false: D14_Prov_Svc.csv exists.
True
True or false: D15_Prov_Svc.csv exists.
True
True or false: D16_Prov_Svc.csv exists.
True
True or false: D17_Prov_Svc.csv exists.
True
True or false: D18_Prov_Svc.csv exists.
True
True or false: D19_Prov_Svc.csv exists.
True
True or false: D20_Prov_Svc.csv exists.
True


Alright, our files are all there. Now, let's read them in.

In [3]:
import pandas as pd

In [4]:
D13 = pd.read_csv("D13_Prov_Svc.csv")
D13['Data_Year'] = 2013
D14 = pd.read_csv("D14_Prov_Svc.csv")
D14['Data_Year'] = 2014
dfmain = pd.concat([D13,D14], ignore_index=True)
del D13
del D14
gc.collect()
D15 = pd.read_csv("D15_Prov_Svc.csv")
D15['Data_Year'] = 2015
dfmain = pd.concat([dfmain,D15], ignore_index=True)
del D15
gc.collect()
D16 = pd.read_csv("D16_Prov_Svc.csv")
D16['Data_Year'] = 2016
dfmain = pd.concat([dfmain,D16], ignore_index=True)
del D16
gc.collect()
D17 = pd.read_csv("D17_Prov_Svc.csv")
D17['Data_Year'] = 2017
dfmain = pd.concat([dfmain,D17], ignore_index=True)
del D17
gc.collect()
D18 = pd.read_csv("D18_Prov_Svc.csv")
D18['Data_Year'] = 2018
dfmain = pd.concat([dfmain,D18], ignore_index=True)
del D18
gc.collect()
D19 = pd.read_csv("D19_Prov_Svc.csv", encoding_errors = "ignore") # had some encoding errors
D19['Data_Year'] = 2019
dfmain = pd.concat([dfmain,D19], ignore_index=True)
del D19
gc.collect()
D20 = pd.read_csv("D20_Prov_Svc.csv", encoding_errors = "ignore") # had some encoding errors
D20['Data_Year'] = 2020
dfmain = pd.concat([dfmain,D20], ignore_index=True)
del D20
gc.collect()

  D13 = pd.read_csv("D13_Prov_Svc.csv")
  D14 = pd.read_csv("D14_Prov_Svc.csv")
  D15 = pd.read_csv("D15_Prov_Svc.csv")
  D16 = pd.read_csv("D16_Prov_Svc.csv")
  D17 = pd.read_csv("D17_Prov_Svc.csv")
  D18 = pd.read_csv("D18_Prov_Svc.csv")
  D19 = pd.read_csv("D19_Prov_Svc.csv", encoding_errors = "ignore") # had some encoding errors
  D20 = pd.read_csv("D20_Prov_Svc.csv", encoding_errors = "ignore") # had some encoding errors


0

In [5]:
print("Total number of observations:")
print(dfmain['Data_Year'].count())
print("Total number of unique providers:")
print(dfmain['Rndrng_NPI'].nunique())
print("Same number of observations for a different variable?")
print(dfmain['Rndrng_NPI'].count())
print("Difference between count and size from missing values?")
print(dfmain['Rndrng_NPI'].size)

Total number of observations:
77213483
Total number of unique providers:
1490518
Same number of observations for a different variable?
77213483
Difference between count and size from missing values?
77213483


In [6]:
dfmain.head()

Unnamed: 0,Rndrng_NPI,Rndrng_Prvdr_Last_Org_Name,Rndrng_Prvdr_First_Name,Rndrng_Prvdr_MI,Rndrng_Prvdr_Crdntls,Rndrng_Prvdr_Gndr,Rndrng_Prvdr_Ent_Cd,Rndrng_Prvdr_St1,Rndrng_Prvdr_St2,Rndrng_Prvdr_City,...,HCPCS_Drug_Ind,Place_Of_Srvc,Tot_Benes,Tot_Srvcs,Tot_Bene_Day_Srvcs,Avg_Sbmtd_Chrg,Avg_Mdcr_Alowd_Amt,Avg_Mdcr_Pymt_Amt,Avg_Mdcr_Stdzd_Amt,Data_Year
0,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,N,F,138,142.0,142,368.626761,132.17007,104.299718,107.211127,2013
1,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,N,F,95,96.0,96,524.604167,196.932396,155.901146,157.598854,2013
2,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,N,F,47,61.0,61,97.0,37.688197,30.065246,30.584918,2013
3,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,N,F,381,777.0,777,187.594595,69.433539,55.091351,55.957773,2013
4,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,N,F,106,170.0,170,271.976471,101.070353,80.641235,80.939824,2013


### Cleaning the Data

Now that we've loaded the data, there are three variables (characteristics) that we want to use in our analysis:

1. Share of total services (beneficiary per day) provided
2. Average submitted charge
3. Average paid amount

The latter two are already in our dataset at the provider-HCPCS-year level. However, we will have to do some data transformation to get the first variable. We have the total number of services provided annually, with a filter that limits the duplication to counting only services to one beneficiary per day. 

To get the share, we first need to calculate the total number of services provided annually by each provider across all procedure codes. Using `pandas`, we will create a data frame that sums all services by provider and year, and then we will merge this back in to the main dataset.

In [7]:
# group by provider and year, sum across services by HCPCS code
dftotsvcs = dfmain.groupby(['Rndrng_NPI','Data_Year'],as_index=False)['Tot_Bene_Day_Srvcs'].sum()

In [8]:
# rename the aggregated service count so no conflict when merging
dftotsvcs = dftotsvcs.rename(columns={"Tot_Bene_Day_Srvcs":"Agg_Ann_Tot_Svcs"})# rename
dftotsvcs.head()

Unnamed: 0,Rndrng_NPI,Data_Year,Agg_Ann_Tot_Svcs
0,1003000126,2013,1607
1,1003000126,2014,2728
2,1003000126,2015,2751
3,1003000126,2016,1450
4,1003000126,2017,1637


In [9]:
# Merge the summarized file back with the main data frame
dfmain = dfmain.merge(dftotsvcs, how='left', on=['Rndrng_NPI','Data_Year'])

Now that we've got the annual total number of services by provider, we just create a new variable that divides total services by HCPCS code by the annual total to get the share.

In [10]:
# generate new column representing share
dfmain = dfmain.assign(Share_Srvcs=dfmain['Tot_Bene_Day_Srvcs']/dfmain['Agg_Ann_Tot_Svcs'])

In [11]:
# and we have our variables
dfmain.head()

Unnamed: 0,Rndrng_NPI,Rndrng_Prvdr_Last_Org_Name,Rndrng_Prvdr_First_Name,Rndrng_Prvdr_MI,Rndrng_Prvdr_Crdntls,Rndrng_Prvdr_Gndr,Rndrng_Prvdr_Ent_Cd,Rndrng_Prvdr_St1,Rndrng_Prvdr_St2,Rndrng_Prvdr_City,...,Tot_Benes,Tot_Srvcs,Tot_Bene_Day_Srvcs,Avg_Sbmtd_Chrg,Avg_Mdcr_Alowd_Amt,Avg_Mdcr_Pymt_Amt,Avg_Mdcr_Stdzd_Amt,Data_Year,Agg_Ann_Tot_Svcs,Share_Srvcs
0,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,138,142.0,142,368.626761,132.17007,104.299718,107.211127,2013,1607,0.088363
1,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,95,96.0,96,524.604167,196.932396,155.901146,157.598854,2013,1607,0.059739
2,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,47,61.0,61,97.0,37.688197,30.065246,30.584918,2013,1607,0.037959
3,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,381,777.0,777,187.594595,69.433539,55.091351,55.957773,2013,1607,0.48351
4,1003000126,Enkeshafi,Ardalan,,M.D.,M,I,900 Seton Dr,,Cumberland,...,106,170.0,170,271.976471,101.070353,80.641235,80.939824,2013,1607,0.105787


Now that we've gotten the characteristics we need, the final step in the cleaning process is going to be dropping variables we don't need and then limiting our dataset to the 100 most common procedures. This will help us in terms of memory as well as relevancy. First, we'll find the top 100 most common procedures, here in terms of how many unique provider-year pairs there are for a given HCPCS code.

In [12]:
dfmain['HCPCS_Cd'].value_counts().nlargest(100)

99213    3600322
99214    3331442
99204    1394313
99232    1384385
99203    1376769
          ...   
76700     156610
J1030     154888
Q2037     153984
G0101     153618
69210     149524
Name: HCPCS_Cd, Length: 100, dtype: int64

In [13]:
# save top 100 most frequent HCPCS codes to list
top100codes = dfmain['HCPCS_Cd'].value_counts().nlargest(100).index.tolist()
top100codes

['99213',
 '99214',
 '99204',
 '99232',
 '99203',
 'G0008',
 '99212',
 '99223',
 '99233',
 '99215',
 '99222',
 '36415',
 '93000',
 '90662',
 'G0009',
 '99231',
 'G0439',
 '96372',
 '99291',
 '99205',
 '99284',
 '93010',
 '97110',
 '99285',
 '99283',
 '99238',
 '99239',
 '99202',
 '90670',
 '97140',
 '20610',
 '81002',
 '81003',
 '90732',
 '83036',
 '73030',
 '71020',
 '99221',
 '93306',
 '85025',
 '99308',
 '92014',
 '73630',
 '72100',
 '99211',
 '99217',
 '99309',
 '85610',
 '97112',
 'J3301',
 '93880',
 '99220',
 '92004',
 '73562',
 '90686',
 '74177',
 '73560',
 '97530',
 '98941',
 '80053',
 '73610',
 'G0180',
 '71250',
 '74176',
 '76942',
 '93971',
 '80061',
 '00740',
 '77080',
 'G0438',
 '92083',
 '71260',
 '70450',
 '93970',
 '73502',
 '71046',
 '92012',
 '76770',
 '73130',
 '72170',
 '00142',
 'J1100',
 '00810',
 '92250',
 '92133',
 '73110',
 '84443',
 '76705',
 '92134',
 '17000',
 '97001',
 '72148',
 'G0283',
 '78452',
 '73564',
 '76700',
 'J1030',
 'Q2037',
 'G0101',
 '69210']

In [14]:
# filter dataframe to top 100 HCPCS codes only
dfmain = dfmain[dfmain['HCPCS_Cd'].isin(top100codes)]

In [15]:
# can compare to before
print("Total number of observations:")
print(dfmain['Data_Year'].count())
print("Total number of unique providers:")
print(dfmain['Rndrng_NPI'].nunique())
print("Same number of observations for a different variable?")
print(dfmain['Rndrng_NPI'].count())
print("Difference between count and size from missing values?")
print(dfmain['Rndrng_NPI'].size)

Total number of observations:
42308976
Total number of unique providers:
1331404
Same number of observations for a different variable?
42308976
Difference between count and size from missing values?
42308976


Now, let's drop the rest of the variables that we don't need. Again, this will help with memory.

In [16]:
dfmain = dfmain[['Rndrng_NPI', 'HCPCS_Cd', 'Data_Year', 'Share_Srvcs', 'Avg_Sbmtd_Chrg','Avg_Mdcr_Pymt_Amt']]

In [17]:
dfmain.to_csv("dfmain.csv") # save cleaned data for easy read for analysis

That's it for data importing and cleaning! Now, we can use the cleaned data for the next steps.