## Read in datasets for 2015, 2016, 2017.

 - keep specific columns as discussed in brainstorming
 - add column for year

Columns to keep

- National Provider Identifier

- Last Name/Organization Name of the Provider

- Entity Type of the Provider

- City of the Provider

- Zip Code of the Provider

- State Code of the Provider

- Provider Type

- Place of Service

- Number of Services

- Number of Medicare Beneficiaries

- Number of Distinct Medicare Beneficiary/Per Day Services

- Average Medicare Allowed Amount

Data links:

- https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2017

- https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2016

- https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2015

In [1]:
import pandas as pd
import pickle

## Read in using chunks

In [13]:
#%%time

#add year column inside for loop

#cols = ['National Provider Identifier', 
#        'Last Name/Organization Name of the Provider', 
#        'Entity Type of the Provider', 
#        'City of the Provider', 
#        'Zip Code of the Provider', 
#        'State Code of the Provider', 
#        'Provider Type', 
#        'Place of Service', 
#        'Number of Services', 
#        'Number of Medicare Beneficiaries', 
#        'Number of Distinct Medicare Beneficiary/Per Day Services', 
#        'Average Medicare Allowed Amount']

#payment_rows = []
#for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
#                         chunksize = 1000, usecols = cols):
#    chunk['year'] = 2017
#    payment_rows.append(chunk) 
                
#df_payments_2017 = pd.concat(payment_rows, ignore_index=True)

#CPU times: user 1min 10s, sys: 21 s, total: 1min 31s
#Wall time: 1min 32s


In [20]:
#%%time
#add year column outside for loop, more efficient

#cols = ['National Provider Identifier',
#        'Last Name/Organization Name of the Provider',
#        'Entity Type of the Provider',
#        'City of the Provider',
#        'Zip Code of the Provider',
#        'State Code of the Provider',
#        'Provider Type',
#        'Place of Service',
#        'Number of Services',
#        'Number of Medicare Beneficiaries',
#        'Number of Distinct Medicare Beneficiary/Per Day Services',
#        'Average Medicare Allowed Amount']

#payment_rows = []
#for chunk in pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
#                         chunksize = 1000, usecols = cols):
#    payment_rows.append(chunk)
    
#df_payments_2017 = pd.concat(payment_rows, ignore_index=True)
#df_payments_2017['year'] = 2017

#CPU times: user 1min 5s, sys: 23.7 s, total: 1min 29s
#Wall time: 1min 29s


## Read in full csv without for loop method

In [19]:
%%time
# read in without using chunk method - preferable since we're not filtering data on the way in

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2017 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                               usecols = cols, low_memory = False)
df_payments_2017['year'] = 2017

#CPU times: user 25.3 s, sys: 12 s, total: 37.3 s
#Wall time: 40.6 s


CPU times: user 25.3 s, sys: 12 s, total: 37.3 s
Wall time: 40.6 s


In [21]:
print(df_payments_2017.shape)
df_payments_2017.head()

(9847443, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


In [23]:
df_payments_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9847443 entries, 0 to 9847442
Data columns (total 13 columns):
 #   Column                                                    Dtype  
---  ------                                                    -----  
 0   National Provider Identifier                              int64  
 1   Last Name/Organization Name of the Provider               object 
 2   Entity Type of the Provider                               object 
 3   City of the Provider                                      object 
 4   Zip Code of the Provider                                  object 
 5   State Code of the Provider                                object 
 6   Provider Type                                             object 
 7   Place of Service                                          object 
 8   Number of Services                                        float64
 9   Number of Medicare Beneficiaries                          int64  
 10  Number of Distinct Medicare Be

In [26]:
# make all lowercase, replace spaces with _
df_payments_2017.columns = map(str.lower, df_payments_2017.columns)
df_payments_2017.columns = df_payments_2017.columns.str.replace(' ', '_')
df_payments_2017.columns = df_payments_2017.columns.str.replace('/', '_')

df_payments_2017.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


In [27]:
df_payments_2017.to_pickle("../data/df_payments_2017.pkl")

In [28]:
df_payments_2017 = pd.read_pickle("../data/df_payments_2017.pkl")

In [29]:
df_payments_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9847443 entries, 0 to 9847442
Data columns (total 13 columns):
 #   Column                                                    Dtype  
---  ------                                                    -----  
 0   national_provider_identifier                              int64  
 1   last_name_organization_name_of_the_provider               object 
 2   entity_type_of_the_provider                               object 
 3   city_of_the_provider                                      object 
 4   zip_code_of_the_provider                                  object 
 5   state_code_of_the_provider                                object 
 6   provider_type                                             object 
 7   place_of_service                                          object 
 8   number_of_services                                        float64
 9   number_of_medicare_beneficiaries                          int64  
 10  number_of_distinct_medicare_be

In [30]:
df_payments_2017.info(verbose = True, null_counts = True)

#no null values in any column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9847443 entries, 0 to 9847442
Data columns (total 13 columns):
 #   Column                                                    Non-Null Count    Dtype  
---  ------                                                    --------------    -----  
 0   national_provider_identifier                              9847443 non-null  int64  
 1   last_name_organization_name_of_the_provider               9847297 non-null  object 
 2   entity_type_of_the_provider                               9847443 non-null  object 
 3   city_of_the_provider                                      9847441 non-null  object 
 4   zip_code_of_the_provider                                  9847441 non-null  object 
 5   state_code_of_the_provider                                9847443 non-null  object 
 6   provider_type                                             9847443 non-null  object 
 7   place_of_service                                          9847443 non-null  objec

# Read in, drop columns, rename columns, and pickle 2016 and 2015 csv files

In [33]:
%%time
# read in without using chunk method - preferable since we're not filtering data on the way in

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2016 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2016.csv', 
                               usecols = cols, low_memory = False)
df_payments_2016['year'] = 2016

#CPU times: user 25.2 s, sys: 9.3 s, total: 34.5 s
#Wall time: 38.4 s


CPU times: user 25.2 s, sys: 9.3 s, total: 34.5 s
Wall time: 38.4 s


In [35]:
print(df_payments_2016.shape)
df_payments_2016.head()

(9714896, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,57.0,55,57,72.743158,2016
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,38.0,38,38,135.01,2016
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23,23,189.239565,2016
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,20.0,20,20,100.75,2016
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,96.0,87,96,136.25,2016


In [37]:
# make all lowercase, replace spaces with _
df_payments_2016.columns = map(str.lower, df_payments_2016.columns)
df_payments_2016.columns = df_payments_2016.columns.str.replace(' ', '_')
df_payments_2016.columns = df_payments_2016.columns.str.replace('/', '_')

df_payments_2016.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,57.0,55,57,72.743158,2016
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,38.0,38,38,135.01,2016
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23,23,189.239565,2016
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,20.0,20,20,100.75,2016
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,96.0,87,96,136.25,2016


In [38]:
df_payments_2016.to_pickle('../data/df_payments_2016.pkl')

In [39]:
df_payments_2016 = pd.read_pickle('../data/df_payments_2016.pkl')

In [40]:
df_payments_2016.info(verbose = True, null_counts = True)

#null values in last_name (136), zip_code (2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9714896 entries, 0 to 9714895
Data columns (total 13 columns):
 #   Column                                                    Non-Null Count    Dtype  
---  ------                                                    --------------    -----  
 0   national_provider_identifier                              9714896 non-null  int64  
 1   last_name_organization_name_of_the_provider               9714760 non-null  object 
 2   entity_type_of_the_provider                               9714896 non-null  object 
 3   city_of_the_provider                                      9714896 non-null  object 
 4   zip_code_of_the_provider                                  9714894 non-null  object 
 5   state_code_of_the_provider                                9714896 non-null  object 
 6   provider_type                                             9714896 non-null  object 
 7   place_of_service                                          9714896 non-null  objec

In [41]:
%%time
# read in without using chunk method - preferable since we're not filtering data on the way in

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2015 = pd.read_csv('../data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2015.csv', 
                               usecols = cols, low_memory = False)
df_payments_2015['year'] = 2015

#CPU times: user 25.3 s, sys: 8.15 s, total: 33.4 s
#Wall time: 35.4 s


CPU times: user 25.3 s, sys: 8.15 s, total: 33.4 s
Wall time: 35.4 s


In [42]:
print(df_payments_2015.shape)
df_payments_2015.head()

(9497892, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


In [43]:
# make all lowercase, replace spaces with _
df_payments_2015.columns = map(str.lower, df_payments_2015.columns)
df_payments_2015.columns = df_payments_2015.columns.str.replace(' ', '_')
df_payments_2015.columns = df_payments_2015.columns.str.replace('/', '_')

df_payments_2015.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


In [44]:
df_payments_2015.to_pickle('../data/df_payments_2015.pkl')

In [45]:
df_payments_2015 = pd.read_pickle('../data/df_payments_2015.pkl')

In [46]:
df_payments_2015.info(verbose = True, null_counts = True)

#null values in last_name (145), entity (1), city (4), 1 each in: zip_code, state, provider_type, place, 
# number_of_services, number_of_medicare_beneficiaries, number_of_distinct, average_medicare_allowed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9497892 entries, 0 to 9497891
Data columns (total 13 columns):
 #   Column                                                    Non-Null Count    Dtype  
---  ------                                                    --------------    -----  
 0   national_provider_identifier                              9497892 non-null  int64  
 1   last_name_organization_name_of_the_provider               9497747 non-null  object 
 2   entity_type_of_the_provider                               9497891 non-null  object 
 3   city_of_the_provider                                      9497888 non-null  object 
 4   zip_code_of_the_provider                                  9497891 non-null  object 
 5   state_code_of_the_provider                                9497891 non-null  object 
 6   provider_type                                             9497891 non-null  object 
 7   place_of_service                                          9497891 non-null  objec