## Main EDA file for teamwork - Phoenix
- Team members
    - Jack McCann (leader)
    - Nicole Muldowney
    - Teresa Whitesell
    - Ari Khursheed
    - Diego Alvarez
    - Lori Butler
    
**MVP:  Set Goal analysis and presentation:**
- Dashboard showing cost/utilization over the 3 years
    - Chart showing payments over time 
    - Chart showing counts over time 
    - Chart showing procedures with largest change in avg payment
    - OPTIONAL/Stretch: Interesting to see if we can find total # people covered under medicare each year, to show change as change in %. See link below for annual # of beneficiaries
    - Chart showing largest change in utilization
    - OPTIONAL/Stretch: Interesting to see if we can find total # people covered under medicare each year, to show change as change in %
    - Research into causes of oddities
    - Filter to specific procedure codes, providers, cities/states, etc

**Done by 1pm Tuesday**   
ETL: by year files   
	Extract  
	Transform (requires EDA first; how to handle nulls)  
	Load  

**Done Thursday by 1pm**  
Analysis (make final visualizations - some in Python, others in PowerBI/Tableau) - by goal (payments, utilization counts)   

**Done by Friday 1 pm (walkthrough)**  
Presentation (TBD)

In [31]:
import pandas as pd
import pickle
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns

## Read in 2015

In [22]:
df_payments_2015 = pd.read_pickle('../data/pickled_files/payments_2015.pkl')
print(df_payments_2015.shape)
df_payments_2015.head()

(9497892, 13)


Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


In [23]:
# To find item that needs to be dropped from 2015, has irrelevant text in last name field and
# no other data in any rows

df_payments_2015[df_payments_2015.national_provider_identifier == 1]

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
7205022,1,CPT copyright 2014 American Medical Associatio...,,,,,,,,,,,2015


In [24]:
# To drop the irrelevant row, index # 7205022

df_payments_2015 = df_payments_2015.drop(labels = 7205022)

In [25]:
# To ensure that the irrelevant row was dropped  (it was)

df_payments_2015[df_payments_2015.national_provider_identifier == 1]

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year


In [26]:
#To see number of values that are null

df_payments_2015.info(verbose = True, null_counts = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9497891 entries, 0 to 9497891
Data columns (total 13 columns):
national_provider_identifier                                9497891 non-null int64
last_name_organization_name_of_the_provider                 9497746 non-null object
entity_type_of_the_provider                                 9497891 non-null object
city_of_the_provider                                        9497888 non-null object
zip_code_of_the_provider                                    9497891 non-null object
state_code_of_the_provider                                  9497891 non-null object
provider_type                                               9497891 non-null object
place_of_service                                            9497891 non-null object
number_of_services                                          9497891 non-null float64
number_of_medicare_beneficiaries                            9497891 non-null float64
number_of_distinct_medicare_beneficiary_per_da

## Read in 2016

In [28]:
df_payments_2016 = pd.read_pickle('../data/pickled_files/payments_2016.pkl')
print(df_payments_2016.shape)
df_payments_2016.head()

(9714896, 13)


Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,57.0,55,57,72.743158,2016
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,38.0,38,38,135.01,2016
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23,23,189.239565,2016
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,20.0,20,20,100.75,2016
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,96.0,87,96,136.25,2016


## Read in 2017

In [29]:
df_payments_2017 = pd.read_pickle('../data/pickled_files/payments_2017.pkl')
print(df_payments_2017.shape)
df_payments_2017.head()

(9847443, 13)


Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,100.0,96,100,73.3988,2017
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,26.0,25,26,100.08,2017
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,52.0,51,52,136.38,2017
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,59,59,190.363729,2017
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,16.0,16,16,101.68,2017


## Create DataFrame by concatenating dfs for 2015, 2016, 2017

In [34]:
# STEP 1  Create a list of dfs

payments_2015to2017_list = sorted(glob('../data/pickled_files/payments_*.pkl'))   #This creates a list of the 3 files

['../data/pickled_files\\payments_2015.pkl',
 '../data/pickled_files\\payments_2016.pkl',
 '../data/pickled_files\\payments_2017.pkl']

In [37]:
# STEP 2 of concatenating files for 2015, 2016, 2017: 
# Using the list of 3 .pkl files to create concatenated df with all 3 years

# STEPS 1 & 2 could be run in same cell. Running separately here to view how it's working

df_payments_2015to2017 = pd.concat((pd.read_pickle(file)
          for file in payments_2015to2017_list), ignore_index= True)

In [38]:
df_payments_2015to2017.shape

(29060231, 13)

In [39]:
df_payments_2015to2017.head()

Unnamed: 0,national_provider_identifier,last_name_organization_name_of_the_provider,entity_type_of_the_provider,city_of_the_provider,zip_code_of_the_provider,state_code_of_the_provider,provider_type,place_of_service,number_of_services,number_of_medicare_beneficiaries,number_of_distinct_medicare_beneficiary_per_day_services,average_medicare_allowed_amount,year
0,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,23.0,23.0,23.0,72.68,2015
1,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,18.0,18.0,18.0,135.85,2015
2,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,59.0,58.0,59.0,101.365085,2015
3,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,132.0,130.0,132.0,139.010455,2015
4,1003000126,ENKESHAFI,I,CUMBERLAND,215021854,MD,Internal Medicine,F,220.0,215.0,220.0,205.185955,2015


## Next steps: 
### EDA
- .info() to check nulls, dtypes
    - .isnull().sum() to get count of nulls
- df.column_name.value_counts()  to get list of values and count of each for specific columns
- df.column_name.name.nunique()  to get number of unique values in a column
- df.column_name.name.unique()  to get list of unique values in a column
- df.describe()  to get statistics (count, mean, std, quartiles, min/max) for whole df
- df.column_name.describe()  to get statistics (count, mean, std, quartiles, min/max) for single column

#### Plot and Histogram:
    print(df.plot())
    print(df.hist()) 
    messy, but good place to start

#### Seaborn pairplot:
#set the seaborn theme, style, color palette
sns.set(style="ticks", color_codes=True)

#make a correlation plot that looks at each variables relationship with every other variable
#and plots the distribution of each variable along the diagonal
sns.pairplot(df);

'''Top/right triangle is mirror image of bottom/left. 
Good way to just step back and look for surprising correlations.'''


#### BAR PLOT
plt.bar('column_name1', 'column_name2', data = df)
plt.xticks(rotation = 70)
plt.title('ColumnTitle');

#### Open class_notebook "eda_workflow" and use various plots and charts for exploration.
- Add notes to new PYTHON EDA note in EverNote



### Begin to look in this direction. Impediments? Cleaning needed?
- Dashboard showing cost/utilization over the 3 years
    - Chart showing payments over time 
    - Chart showing counts over time 
    
    