## Read in 1 dataset
- bring only specific columns to save time
- add year data

#### Columns to keep

- National Provider Identifier -- 1000 non-null int64
- Last Name/Organization Name of the Provider -- 1000 non-null object
- Entity Type of the Provider -- 1000 non-null object
- City of the Provider -- 1000 non-null object
- Zip Code of the Provider -- 1000 non-null int64
- State Code of the Provider -- 1000 non-null object
- Provider Type -- 1000 non-null object
- Place of Service -- 1000 non-null object
- HCPCS Code -- 1000 non-null object
- HCPCS Description -- 1000 non-null object
- Number of Services -- 1000 non-null float64
- Number of Medicare Beneficiaries -- 1000 non-null int64
- Number of Distinct Medicare Beneficiary/Per Day Services -- 1000 non-null int64
- Average Medicare Allowed Amount -- 1000 non-null float64
ADD:  Year (in each df on import)


In [2]:
import pandas as pd
import pickle

### code using chunks
want to compare processing time vs bringing in full data

In [3]:
%%time
#option 1
cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

payment_rows =[]
for chunk in pd.read_csv('../data/1_medicare_data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000, usecols = cols):
    chunk['year'] = 2017
    payment_rows.append(chunk)
    
df_payments_2017 = pd.concat(payment_rows, ignore_index=True)

Wall time: 2min 20s


Chunking option 1 took 2 min 20 sec

In [4]:
%%time
#option 2

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

payment_rows =[]
for chunk in pd.read_csv('../data/1_medicare_data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                         chunksize = 1000, usecols = cols):
    payment_rows.append(chunk)
    
df_payments_2017 = pd.concat(payment_rows, ignore_index=True)
df_payments_2017['year'] = 2017

Wall time: 1min 57s


chunking option 2 took 1 min 57 sec (and already had a big dataframe saved from option 1 which may have made it slower)

In [8]:
print(df_payments_2017.shape)
df_payments_2017.head()

(9847443, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


### Code without chunks

In [5]:
%%time

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2017 = pd.read_csv('../data/1_medicare_data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv', 
                               usecols = cols)
df_payments_2017['year'] = 2017



Wall time: 24.6 s


In [6]:
print(df_payments_2017.shape)
df_payments_2017.head()

(9847443, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


code without chunks took 24.6 seconds!

In [10]:
#make all lowercase, replace spaces with _, replace '/' with _
df_payments_2017.columns = map(str.lower, df_payments_2017.columns)
df_payments_2017.columns = df_payments_2017.columns.str.replace(" ", "_")
df_payments_2017.columns = df_payments_2017.columns.str.replace("/", "_")

df_payments_2017.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


In [11]:
df_payments_2017.to_pickle("../data/1_medicare_data/pickled_files/payments_2017.pkl")

#### Option 3 is the fastest, using that to bring in other years
#### 2016

In [12]:
%%time

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2016 = pd.read_csv('../data/1_medicare_data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2016.csv', 
                               usecols = cols)
df_payments_2016['year'] = 2016



Wall time: 18.7 s


In [13]:
print(df_payments_2016.shape)
df_payments_2016.head()

(9714896, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,57.0,55,57,72.743158,2016
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,38.0,38,38,135.01,2016
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23,23,189.239565,2016
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,20.0,20,20,100.75,2016
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,96.0,87,96,136.25,2016


In [14]:
#make all lowercase, replace spaces with _, replace '/' with _
df_payments_2016.columns = map(str.lower, df_payments_2016.columns)
df_payments_2016.columns = df_payments_2016.columns.str.replace(" ", "_")
df_payments_2016.columns = df_payments_2016.columns.str.replace("/", "_")

df_payments_2016.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,57.0,55,57,72.743158,2016
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,38.0,38,38,135.01,2016
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23,23,189.239565,2016
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,20.0,20,20,100.75,2016
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,96.0,87,96,136.25,2016


In [16]:
df_payments_2016.to_pickle("../data/1_medicare_data/pickled_files/payments_2016.pkl")

#### 2015

In [18]:
%%time

cols = ['National Provider Identifier',
        'Last Name/Organization Name of the Provider',
        'Entity Type of the Provider',
        'City of the Provider',
        'Zip Code of the Provider',
        'State Code of the Provider',
        'Provider Type',
        'Place of Service',
        'Number of Services',
        'Number of Medicare Beneficiaries',
        'Number of Distinct Medicare Beneficiary/Per Day Services',
        'Average Medicare Allowed Amount']

df_payments_2015 = pd.read_csv('../data/1_medicare_data/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2015.csv', 
                               usecols = cols)
df_payments_2015['year'] = 2015

Wall time: 19.4 s


In [19]:
print(df_payments_2015.shape)
df_payments_2015.head()

(9497892, 13)


Unnamed: 0,National Provider Identifier,Last Name/Organization Name of the Provider,Entity Type of the Provider,City of the Provider,Zip Code of the Provider,State Code of the Provider,Provider Type,Place of Service,Number of Services,Number of Medicare Beneficiaries,Number of Distinct Medicare Beneficiary/Per Day Services,Average Medicare Allowed Amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


In [20]:
#make all lowercase, replace spaces with _, replace '/' with _
df_payments_2015.columns = map(str.lower, df_payments_2015.columns)
df_payments_2015.columns = df_payments_2015.columns.str.replace(" ", "_")
df_payments_2015.columns = df_payments_2015.columns.str.replace("/", "_")

df_payments_2015.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


In [21]:
df_payments_2015.dtypes

national_provider_identifier                                  int64
last_name_organization_name_of_the_provider                  object
entity_type_of_the_provider                                  object
city_of_the_provider                                         object
zip_code_of_the_provider                                     object
state_code_of_the_provider                                   object
provider_type                                                object
place_of_service                                             object
number_of_services                                          float64
number_of_medicare_beneficiaries                            float64
number_of_distinct_medicare_beneficiary_per_day_services    float64
average_medicare_allowed_amount                             float64
year                                                          int64
dtype: object

In [22]:
df_payments_2015.to_pickle("../data/1_medicare_data/pickled_files/payments_2015.pkl")