# Life Expectancy Data Wrangling

In this project, I will be looking at a few selected health features to see how they vary across the states in the United States and to see if how they affect life expectancy. 

First, I will add the libraries needed and the csv files

In [1]:
import csv
import pandas as pd

In [2]:
file = ('rankmd.csv')
df=pd.read_csv(file, sep=';')
df.head()

Unnamed: 0,FIPS,State,County,Unreliable,premature_deathDeaths,premature_deathYears_of_Potential_Life_Lost_Rate,premature_death_95% CILow,premature_death_95% CI - High,premature_death_Quartile,premature_death_YPLL Rate (AIAN),...,drive_alone_to_work_% Drive Alone (Hispanic) 95% CI - High,drive_alone_to_work_% Drive Alone (White),drive_alone_to_work_% Drive Alone (White) 95% CI - Low,drive_alone_to_work_% Drive Alone (White) 95% CI - High,long_commute_driving_alone_# Workers who Drive Alone,long_commute_driving_alone_% Long Commute - Drives Alone,long_commute_driving_alone_95% CI - Low,long_commute_driving_alone_95% CI - High,long_commute_driving_alone_Quartile,Unnamed: 249
0,1000,Alabama,,,82249.0,9820.0,9718.0,9922.0,,5145.0,...,78.0,87.0,87.0,87.0,2073072,35,34,35,,
1,1001,Alabama,Autauga,,787.0,7830.0,6998.0,8662.0,1.0,,...,,82.0,78.0,87.0,24635,38,34,42,2.0,
2,1003,Alabama,Baldwin,,3147.0,7680.0,7237.0,8124.0,1.0,,...,83.0,82.0,80.0,84.0,93141,40,38,43,3.0,
3,1005,Alabama,Barbour,,515.0,11477.0,9908.0,13045.0,3.0,,...,,86.0,82.0,91.0,8231,31,26,36,2.0,
4,1007,Alabama,Bibb,,476.0,12173.0,10506.0,13839.0,4.0,,...,,,,,8167,52,44,60,4.0,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3193 entries, 0 to 3192
Columns: 250 entries, FIPS to Unnamed: 249
dtypes: float64(194), int64(34), object(22)
memory usage: 6.1+ MB


The life expectancy column was on a different excel sheet, so I will add the csv here and isolate the column of data I need

In [4]:
xls=pd.ExcelFile('rankexcel.xlsx')
df1=pd.read_excel(xls,'Additional Measure Data')
data_le=df1.iloc[:, 0:4]
life_exp=data_le.reset_index()
life_exp2=life_exp.rename(columns={"Unnamed: 1":"State","Unnamed: 2":"County"})
life_exp3=life_exp2.drop(['index','Unnamed: 0'], axis=1)
life_exp3.head()

Unnamed: 0,State,County,Life expectancy
0,State,County,Life Expectancy
1,Alabama,,75.548075
2,Alabama,Autauga,77.162581
3,Alabama,Baldwin,78.213405
4,Alabama,Barbour,74.054741


Now I will merge the life expectancy column onto first dataframe

In [5]:
data_merged=df.merge(life_exp3, on=['County','State'])
data_merged.head()

Unnamed: 0,FIPS,State,County,Unreliable,premature_deathDeaths,premature_deathYears_of_Potential_Life_Lost_Rate,premature_death_95% CILow,premature_death_95% CI - High,premature_death_Quartile,premature_death_YPLL Rate (AIAN),...,drive_alone_to_work_% Drive Alone (White),drive_alone_to_work_% Drive Alone (White) 95% CI - Low,drive_alone_to_work_% Drive Alone (White) 95% CI - High,long_commute_driving_alone_# Workers who Drive Alone,long_commute_driving_alone_% Long Commute - Drives Alone,long_commute_driving_alone_95% CI - Low,long_commute_driving_alone_95% CI - High,long_commute_driving_alone_Quartile,Unnamed: 249,Life expectancy
0,1000,Alabama,,,82249.0,9820.0,9718.0,9922.0,,5145.0,...,87.0,87.0,87.0,2073072,35,34,35,,,75.548075
1,1001,Alabama,Autauga,,787.0,7830.0,6998.0,8662.0,1.0,,...,82.0,78.0,87.0,24635,38,34,42,2.0,,77.162581
2,1003,Alabama,Baldwin,,3147.0,7680.0,7237.0,8124.0,1.0,,...,82.0,80.0,84.0,93141,40,38,43,3.0,,78.213405
3,1005,Alabama,Barbour,,515.0,11477.0,9908.0,13045.0,3.0,,...,86.0,82.0,91.0,8231,31,26,36,2.0,,74.054741
4,1007,Alabama,Bibb,,476.0,12173.0,10506.0,13839.0,4.0,,...,,,,8167,52,44,60,4.0,,73.408784


The column names are a bit messy, so I will rename them to make it easier to read

In [6]:
rename_data={}
rename_data = data_merged.rename(columns={"preventable_hospital_stays_Preventable Hospitalization Rate": "Preventable Hospital Stays",
                                  'primary_care_physicians_# Primary Care Physicians': 'PCP Number', 
                              'primary_care_physicians_Primary Care Physicians Rate': 'PCP Rate', 
                              'mental_health_providers_# Mental Health Providers': 'MHP Number', 
                              'mental_health_providers_Mental Health Provider Rate': 'MHP Rate',
                                  'adult_smoking_% Smokers': 'Smoking %',
                                  'adult_obesity_% Adults with Obesity':'Obesity %',
                                  'physical_inactivity_% Physically Inactive':'Physical Inactivy %',
                                  'mammography_screening_% With Annual Mammogram':'Mammogram %',
                                  'flu_vaccinations_% Vaccinated':'Flu Vaccine %',
                                  'income_inequality_80th Percentile Income':'Income 80%', 
                                  'income_inequality_20th Percentile Income':'Income 20%'
                                 })

all_data=rename_data[['State','County','Preventable Hospital Stays','PCP Number', 'PCP Rate','MHP Number', 'MHP Rate','Smoking %','Obesity %','Physical Inactivy %','Mammogram %','Flu Vaccine %','Income 80%', 'Income 20%']]

all_data.head()

Unnamed: 0,State,County,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
0,Alabama,,5466.0,3187.0,65.0,5310.0,108.0,20,36,29,40.0,43.0,103239.0,19709.0
1,Alabama,Autauga,6650.0,26.0,47.0,16.0,29.0,20,33,31,39.0,42.0,109062.0,21425.0
2,Alabama,Baldwin,3471.0,153.0,70.0,220.0,99.0,19,30,25,43.0,46.0,116934.0,26666.0
3,Alabama,Barbour,5314.0,8.0,32.0,3.0,12.0,26,41,28,44.0,39.0,74745.0,12495.0
4,Alabama,Bibb,6690.0,12.0,54.0,6.0,27.0,23,37,33,33.0,40.0,84389.0,16869.0


In [7]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3193 entries, 0 to 3192
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   State                       3193 non-null   object 
 1   County                      3142 non-null   object 
 2   Preventable Hospital Stays  3148 non-null   float64
 3   PCP Number                  3041 non-null   float64
 4   PCP Rate                    3043 non-null   float64
 5   MHP Number                  2972 non-null   float64
 6   MHP Rate                    2972 non-null   float64
 7   Smoking %                   3193 non-null   int64  
 8   Obesity %                   3193 non-null   int64  
 9   Physical Inactivy %         3193 non-null   int64  
 10  Mammogram %                 3173 non-null   float64
 11  Flu Vaccine %               3175 non-null   float64
 12  Income 80%                  3192 non-null   float64
 13  Income 20%                  3192 

Checking for duplicated values. 

In [8]:
all_data[all_data.duplicated() == True]

Unnamed: 0,State,County,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%


In [9]:
missing = pd.concat([all_data.isnull().sum(), 100 * all_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
State,0,0.0
Smoking %,0,0.0
Obesity %,0,0.0
Physical Inactivy %,0,0.0
Income 80%,1,0.031319
Income 20%,1,0.031319
Flu Vaccine %,18,0.563733
Mammogram %,20,0.62637
Preventable Hospital Stays,45,1.409333
County,51,1.597244


I noticed there are 51 missing County's. I am curious if there is a row in each state for the average or sum of the state. 

In [10]:
df_ave=all_data.set_index('State').groupby('State').mean()
df_ave.head()

Unnamed: 0_level_0,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Alabama,5847.558824,96.575758,45.742424,158.462687,67.149254,22.823529,38.264706,31.897059,38.455882,40.382353,89184.352941,17422.985294
Alaska,3186.0,53.076923,94.538462,251.965517,466.275862,23.566667,33.933333,22.866667,30.529412,30.058824,123749.8,30594.333333
Arizona,3031.0625,588.5,53.125,1274.8125,101.625,19.0625,30.5625,24.125,35.3125,39.0625,96056.375,21087.875
Arkansas,5205.684211,54.054054,46.094595,195.424658,156.219178,25.092105,36.394737,33.697368,35.539474,42.907895,84665.815789,17991.802632
California,3213.423729,1069.728814,68.983051,5000.423729,372.338983,14.898305,27.152542,21.033898,37.169492,41.084746,133574.525424,28302.254237


In [11]:
df_sum=all_data.set_index('State').groupby('State').sum()
df_sum.head()

Unnamed: 0_level_0,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Alabama,397634.0,6374.0,3019.0,10617.0,4499.0,1552,2602,2169,2615.0,2746.0,6064536.0,1184763.0
Alaska,47790.0,1380.0,2458.0,7307.0,13522.0,707,1018,686,519.0,511.0,3712494.0,917830.0
Arizona,48497.0,9416.0,850.0,20397.0,1626.0,305,489,386,565.0,625.0,1536902.0,337406.0
Arkansas,395632.0,4000.0,3411.0,14266.0,11404.0,1907,2766,2561,2701.0,3261.0,6434602.0,1367377.0
California,189592.0,63114.0,4070.0,295025.0,21968.0,879,1602,1241,2193.0,2424.0,7880897.0,1669833.0


In [12]:
all_data.head()

Unnamed: 0,State,County,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
0,Alabama,,5466.0,3187.0,65.0,5310.0,108.0,20,36,29,40.0,43.0,103239.0,19709.0
1,Alabama,Autauga,6650.0,26.0,47.0,16.0,29.0,20,33,31,39.0,42.0,109062.0,21425.0
2,Alabama,Baldwin,3471.0,153.0,70.0,220.0,99.0,19,30,25,43.0,46.0,116934.0,26666.0
3,Alabama,Barbour,5314.0,8.0,32.0,3.0,12.0,26,41,28,44.0,39.0,74745.0,12495.0
4,Alabama,Bibb,6690.0,12.0,54.0,6.0,27.0,23,37,33,33.0,40.0,84389.0,16869.0


The row for Alabama that had a null value in the county section did not match numbers with any of the averages or sums for the other features. 

I decided to drop the null value

In [13]:
dropped=all_data.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
dropped.head()

Unnamed: 0,State,County,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
1,Alabama,Autauga,6650.0,26.0,47.0,16.0,29.0,20,33,31,39.0,42.0,109062.0,21425.0
2,Alabama,Baldwin,3471.0,153.0,70.0,220.0,99.0,19,30,25,43.0,46.0,116934.0,26666.0
3,Alabama,Barbour,5314.0,8.0,32.0,3.0,12.0,26,41,28,44.0,39.0,74745.0,12495.0
4,Alabama,Bibb,6690.0,12.0,54.0,6.0,27.0,23,37,33,33.0,40.0,84389.0,16869.0
5,Alabama,Blount,4440.0,12.0,21.0,10.0,17.0,23,33,33,37.0,40.0,95721.0,21618.0


In [14]:
dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2794 entries, 1 to 3192
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   State                       2794 non-null   object 
 1   County                      2794 non-null   object 
 2   Preventable Hospital Stays  2794 non-null   float64
 3   PCP Number                  2794 non-null   float64
 4   PCP Rate                    2794 non-null   float64
 5   MHP Number                  2794 non-null   float64
 6   MHP Rate                    2794 non-null   float64
 7   Smoking %                   2794 non-null   int64  
 8   Obesity %                   2794 non-null   int64  
 9   Physical Inactivy %         2794 non-null   int64  
 10  Mammogram %                 2794 non-null   float64
 11  Flu Vaccine %               2794 non-null   float64
 12  Income 80%                  2794 non-null   float64
 13  Income 20%                  2794 

In [15]:
healthdata=dropped
healthdata.head()

Unnamed: 0,State,County,Preventable Hospital Stays,PCP Number,PCP Rate,MHP Number,MHP Rate,Smoking %,Obesity %,Physical Inactivy %,Mammogram %,Flu Vaccine %,Income 80%,Income 20%
1,Alabama,Autauga,6650.0,26.0,47.0,16.0,29.0,20,33,31,39.0,42.0,109062.0,21425.0
2,Alabama,Baldwin,3471.0,153.0,70.0,220.0,99.0,19,30,25,43.0,46.0,116934.0,26666.0
3,Alabama,Barbour,5314.0,8.0,32.0,3.0,12.0,26,41,28,44.0,39.0,74745.0,12495.0
4,Alabama,Bibb,6690.0,12.0,54.0,6.0,27.0,23,37,33,33.0,40.0,84389.0,16869.0
5,Alabama,Blount,4440.0,12.0,21.0,10.0,17.0,23,33,33,37.0,40.0,95721.0,21618.0
