# Data Cleaning Part 1
After acquiring data from the BLS (Bureau of Labor Statistics) and Government Census websites, it is necessary to clean the data to preserve only the information necessary to our model, in an interpretable form.

#### Importing package(s) and reading in data

In [1]:
import pandas as pd

Median wage of union & non-union members; each table contains two years of data.

In [2]:
mwage2010 = (pd.read_html('../data/data/wage_occ_2010-11.html'))[0]
mwage2012 = (pd.read_html('../data/data/wage_occ_2012-13.html'))[0]
mwage2014 = (pd.read_html('../data/data/wage_occ_2014-15.html'))[0]
mwage2016 = (pd.read_html('../data/data/wage_occ_2016-17.html'))[0]

In [3]:
mwage2010.head()

Unnamed: 0_level_0,Occupation and industry,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Unnamed: 0_level_1,Occupation and industry,Total,Membersofunions,Representedby unions(2),Non-union(3),Total,Membersofunions(1),Representedby unions(2),Non-union(3)
0,OCCUPATION,,,,,,,,
1,Total full-time wage and salary workers,$747,$917,$911,$717,$756,$938,$934,$729
2,"Management, professional, and related occupations",1063,1059,1055,1064,1082,1090,1082,1082
3,"Management, business, and financial operations...",1155,1138,1145,1156,1160,1169,1171,1159
4,Management occupations,1230,1161,1187,1231,1237,1287,1300,1232


All of these wage tables are in the same format. We will be using a function to perform the necessary changes across all of them.

In [4]:
def wage_dfs(df):
    # drop layer from Multilevel Index
    df.columns = df.columns.droplevel(0)
    # drop extra row & reset index
    df.drop(0, inplace=True)
    df.reset_index(drop=True, inplace=True)
    # drop column with unneeded information
    df.drop(columns=['Representedby unions(2)'], axis=1, inplace=True)
    # in the first row, values begin with a $; map to remove
    df.iloc[0] = df.iloc[0].map(lambda x: x.lstrip('$'))
    return df
    

In [5]:
mwage2010 = wage_dfs(mwage2010)

In [6]:
mwage2010.columns = ['Occupation/Industry','2010_Total $', '2010_Union $','2010_Non Union $','2011_Total $','2011_Union $','2011_Non Union $']
mwage2010.head()

Unnamed: 0,Occupation/Industry,2010_Total $,2010_Union $,2010_Non Union $,2011_Total $,2011_Union $,2011_Non Union $
0,Total full-time wage and salary workers,747,917,717,756,938,729
1,"Management, professional, and related occupations",1063,1059,1064,1082,1090,1082
2,"Management, business, and financial operations...",1155,1138,1156,1160,1169,1159
3,Management occupations,1230,1161,1231,1237,1287,1232
4,Business and financial operations occupations,1036,1082,1035,1038,999,1042


In [7]:
mwage2012 = wage_dfs(mwage2012)

In [8]:
mwage2012.columns = ['Occupation/Industry','2012_Total $', '2012_Union $','2012_Non Union $','2013_Total $','2013_Union $','2013_Non Union $']
mwage2012.head()

Unnamed: 0,Occupation/Industry,2012_Total $,2012_Union $,2012_Non Union $,2013_Total $,2013_Union $,2013_Non Union $
0,Total full-time wage and salary workers,768,943,742,776,950,750
1,"Management, professional, and related occupations",1108,1108,1111,1132,1121,1135
2,"Management, business, and financial operations...",1171,1159,1172,1208,1202,1207
3,Management occupations,1248,1261,1247,1285,1305,1280
4,Business and financial operations occupations,1058,1060,1060,1091,1086,1092


In [9]:
mwage2014 = wage_dfs(mwage2014)

In [10]:
mwage2014.columns = ['Occupation/Industry','2014_Total $', '2014_Union $','2014_Non Union $','2015_Total $','2015_Union $','2015_Non Union $']
mwage2014.head()

Unnamed: 0,Occupation/Industry,2014_Total $,2014_Union $,2014_Non Union $,2015_Total $,2015_Union $,2015_Non Union $
0,Total full-time wage and salary workers,791,970,763,809,980,776
1,"Management, professional, and related occupations",1137,1132,1139,1158,1152,1160
2,"Management, business, and financial operations...",1227,1246,1226,1258,1273,1257
3,Management occupations,1295,1333,1292,1351,1386,1349
4,Business and financial operations occupations,1107,1135,1104,1137,1108,1138


In [11]:
mwage2016 = wage_dfs(mwage2016)

In [12]:
mwage2016.columns = ['Occupation/Industry','2016_Total $', '2016_Union $','2016_Non Union $','2017_Total $','2017_Union $','2017_Non Union $']
mwage2016.head()

Unnamed: 0,Occupation/Industry,2016_Total $,2016_Union $,2016_Non Union $,2017_Total $,2017_Union $,2017_Non Union $
0,Total full-time wage and salary workers,832,1004,802,860,1041,829
1,"Management, professional, and related occupations",1188,1166,1197,1224,1215,1227
2,"Management, business, and financial operations...",1284,1263,1285,1327,1276,1329
3,Management occupations,1370,1389,1368,1392,1349,1395
4,Business and financial operations occupations,1161,1146,1164,1174,1188,1174


In [13]:
# saving cleaned data csvs
mwage2010.to_csv('../data/cleaned_data/wages/median_wage_2010-11.csv')
mwage2012.to_csv('../data/cleaned_data/wages/median_wage_2012-13.csv')
mwage2014.to_csv('../data/cleaned_data/wages/median_wage_2014-15.csv')
mwage2016.to_csv('../data/cleaned_data/wages/median_wage_2016-17.csv')

Union Membership by State, total membership and construction membership.

In [14]:
union_s10 = pd.read_csv('../data/data/State_U_2010.csv')
union_s11 = pd.read_csv('../data/data/State_U_2011.csv')
union_s12 = pd.read_csv('../data/data/State_U_2012.csv')
union_s13 = pd.read_csv('../data/data/State_U_2013.csv')
union_s14 = pd.read_csv('../data/data/State_U_2014.csv')
union_s15 = pd.read_csv('../data/data/State_U_2015.csv')
union_s16 = pd.read_csv('../data/data/State_U_2016.csv')
union_s17 = pd.read_csv('../data/data/State_U_2017.csv')

In [15]:
union_s10.head()

Unnamed: 0,Code,State,Sector,Obs,Employment,Members,Covered,%Mem,%Cov
0,63,Alabama,Total,1722,1808807,183338,202789,10.1,11.2
1,63,Alabama,Private,1388,1463322,83562,91171,5.7,6.2
2,63,Alabama,Public,334,345485,99776,111617,28.9,32.3
3,63,Alabama,Priv. Construction,92,100593,4785,4785,4.8,4.8
4,63,Alabama,Priv. Manufacturing,236,245221,31989,35585,13.0,14.5


For the purposes of our model, we do not require the information on population covered who are not members of unions. Again, we will use a function to clean these dataframes, and subsequently save them into separate csv's.

In [16]:
def union_dfs(df):
    df.drop(['Obs','Covered','%Cov', 'Code'], axis=1, inplace=True)
    # limiting df to total # across industries and #'s in the construction industry
    df = df[(df['Sector'] == 'Priv. Construction') | (df['Sector'] == 'Total')]
    df.replace('Priv. Construction', 'Construction', inplace=True)
    # setting index back to 0
    df.reset_index(drop=True, inplace=True)
    return df

In [17]:
union_s10 = union_dfs(union_s10);
union_s10.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


Unnamed: 0,State,Sector,Employment,Members,%Mem
0,Alabama,Total,1808807,183338,10.1
1,Alabama,Construction,100593,4785,4.8
2,Alaska,Total,295063,67624,22.9
3,Alaska,Construction,17440,3731,21.4
4,Arizona,Total,2506723,160989,6.4


In [18]:
union_s11 = union_dfs(union_s11)
union_s12 = union_dfs(union_s12)
union_s13 = union_dfs(union_s13)
union_s14 = union_dfs(union_s14)
union_s15 = union_dfs(union_s15)
union_s16 = union_dfs(union_s16)
union_s17 = union_dfs(union_s17);

In [19]:
# saving cleaned csvs
union_s10.to_csv('../data/cleaned_data/membership/State_UMem_2010.csv')
union_s11.to_csv('../data/cleaned_data/membership/State_UMem_2011.csv')
union_s12.to_csv('../data/cleaned_data/membership/State_UMem_2012.csv')
union_s13.to_csv('../data/cleaned_data/membership/State_UMem_2013.csv')
union_s14.to_csv('../data/cleaned_data/membership/State_UMem_2014.csv')
union_s15.to_csv('../data/cleaned_data/membership/State_UMem_2015.csv')
union_s16.to_csv('../data/cleaned_data/membership/State_UMem_2016.csv')
union_s17.to_csv('../data/cleaned_data/membership/State_UMem_2017.csv')

Fatalities in states by industry and year.

In [20]:
AL_fatal= pd.read_html('../data/data/AL_fatal.html')[0]
AL_fatal.head()

Unnamed: 0,Index,2010_Total,2011_Total,2012_Total,2013_Total,2014_Total,2015_Total,2016_Total,2017_Total,Contactwith objectsandequipment,Falls,Exposure toharmfulsubstances orenvironments,Transpor-tationincidents,Firesorexplosions,Assaultsandviolentacts
0,Total,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0,16.0,13.0,5.0,34.0,4.0,20.0
1,,,,,,,,,,,,,,,
2,Private industry,80.0,16.0,13.0,4.0,29.0,3.0,15.0,,,,,,,
3,Goods Producing,36.0,11.0,9.0,,11.0,,,,,,,,,
4,Natural resources and mining,12.0,3.0,,,7.0,,,,,,,,,


In [21]:
fatal_dfs = []

AL_fatal = pd.read_html('../data/data/AL_fatal.html')[0]
AK_fatal = pd.read_html('../data/data/AK_fatal.html')[0]
AR_fatal = pd.read_html('../data/data/AR_fatal.html')[0]
AZ_fatal = pd.read_html('../data/data/AZ_fatal.html')[0]
CA_fatal = pd.read_html('../data/data/CA_fatal.html')[0]
CO_fatal = pd.read_html('../data/data/CO_fatal.html')[0]
CT_fatal = pd.read_html('../data/data/CT_fatal.html')[0]
DC_fatal = pd.read_html('../data/data/DC_fatal.html')[0]
DE_fatal = pd.read_html('../data/data/DE_fatal.html')[0]
FL_fatal = pd.read_html('../data/data/FL_fatal.html')[0]
GA_fatal = pd.read_html('../data/data/GA_fatal.html')[0]
HI_fatal = pd.read_html('../data/data/HI_fatal.html')[0]
IA_fatal = pd.read_html('../data/data/IA_fatal.html')[0]
ID_fatal = pd.read_html('../data/data/ID_fatal.html')[0]
IL_fatal = pd.read_html('../data/data/IL_fatal.html')[0]
IN_fatal = pd.read_html('../data/data/IN_fatal.html')[0]
KS_fatal = pd.read_html('../data/data/KS_fatal.html')[0]
KY_fatal = pd.read_html('../data/data/KY_fatal.html')[0]
LA_fatal = pd.read_html('../data/data/LA_fatal.html')[0]
MA_fatal = pd.read_html('../data/data/MA_fatal.html')[0]
ME_fatal = pd.read_html('../data/data/ME_fatal.html')[0]
MD_fatal = pd.read_html('../data/data/MD_fatal.html')[0]
MI_fatal = pd.read_html('../data/data/MI_fatal.html')[0]
MN_fatal = pd.read_html('../data/data/MN_fatal.html')[0]
MO_fatal = pd.read_html('../data/data/MO_fatal.html')[0]
MS_fatal = pd.read_html('../data/data/MS_fatal.html')[0]
MT_fatal = pd.read_html('../data/data/MT_fatal.html')[0]
NC_fatal = pd.read_html('../data/data/NC_fatal.html')[0]
ND_fatal = pd.read_html('../data/data/ND_fatal.html')[0]
NE_fatal = pd.read_html('../data/data/NE_fatal.html')[0]
NH_fatal = pd.read_html('../data/data/NH_fatal.html')[0]
NJ_fatal = pd.read_html('../data/data/NJ_fatal.html')[0]
NM_fatal = pd.read_html('../data/data/NM_fatal.html')[0]
NV_fatal = pd.read_html('../data/data/NV_fatal.html')[0]
NY_fatal = pd.read_html('../data/data/NY_fatal.html')[0]
OH_fatal = pd.read_html('../data/data/OH_fatal.html')[0]
OK_fatal = pd.read_html('../data/data/OK_fatal.html')[0]
OR_fatal = pd.read_html('../data/data/OR_fatal.html')[0]
PA_fatal = pd.read_html('../data/data/PA_fatal.html')[0]
RI_fatal = pd.read_html('../data/data/RI_fatal.html')[0]
SC_fatal = pd.read_html('../data/data/SC_fatal.html')[0]
SD_fatal = pd.read_html('../data/data/SD_fatal.html')[0]
TN_fatal = pd.read_html('../data/data/TN_fatal.html')[0]
TX_fatal = pd.read_html('../data/data/TX_fatal.html')[0]
UT_fatal = pd.read_html('../data/data/UT_fatal.html')[0]
VA_fatal = pd.read_html('../data/data/VA_fatal.html')[0]
VT_fatal = pd.read_html('../data/data/VT_fatal.html')[0]
WA_fatal = pd.read_html('../data/data/WA_fatal.html')[0]
WI_fatal = pd.read_html('../data/data/WI_fatal.html')[0]
WV_fatal = pd.read_html('../data/data/WV_fatal.html')[0]
WY_fatal = pd.read_html('../data/data/WY_fatal.html')[0]

In [22]:
# function to return only the year total fatalities and fatalities in construction
def fatal_fix(df):
    df= df.iloc[0:8,0:9]
    df.drop([1,2,3,4,5,6], axis=0, inplace=True)
    df.set_index('Index', inplace=True)
    df.fillna(0, inplace=True)
    return df

In [23]:
AL_fatal = fatal_fix(AL_fatal)
AK_fatal = fatal_fix(AK_fatal)
AR_fatal = fatal_fix(AR_fatal)
AZ_fatal = fatal_fix(AZ_fatal)
CA_fatal = fatal_fix(CA_fatal)
CO_fatal =fatal_fix(CO_fatal)
CT_fatal =fatal_fix(CT_fatal)
DC_fatal =fatal_fix(DC_fatal)
DE_fatal =fatal_fix(DE_fatal)
FL_fatal =fatal_fix(FL_fatal)
GA_fatal =fatal_fix(GA_fatal)
HI_fatal =fatal_fix(HI_fatal)
IA_fatal =fatal_fix(IA_fatal)
ID_fatal =fatal_fix(ID_fatal)
IL_fatal =fatal_fix(IL_fatal)
IN_fatal =fatal_fix(IN_fatal)
KS_fatal =fatal_fix(KS_fatal)
KY_fatal=fatal_fix(KY_fatal)
LA_fatal =fatal_fix(LA_fatal)
MA_fatal =fatal_fix(MA_fatal)
ME_fatal =fatal_fix(ME_fatal)
MD_fatal =fatal_fix(MD_fatal)
MI_fatal =fatal_fix(MI_fatal)
MN_fatal =fatal_fix(MN_fatal)
MO_fatal =fatal_fix(MO_fatal)
MS_fatal =fatal_fix(MS_fatal)
MT_fatal =fatal_fix(MT_fatal)
NC_fatal =fatal_fix(NC_fatal)
ND_fatal =fatal_fix(ND_fatal)
NE_fatal =fatal_fix(NE_fatal)
NH_fatal =fatal_fix(NH_fatal)
NJ_fatal =fatal_fix(NJ_fatal)
NM_fatal =fatal_fix(NM_fatal)
NV_fatal =fatal_fix(NV_fatal)
NY_fatal =fatal_fix(NY_fatal)
OH_fatal =fatal_fix(OH_fatal)
OK_fatal =fatal_fix(OK_fatal)
OR_fatal =fatal_fix(OR_fatal)
PA_fatal =fatal_fix(PA_fatal)
RI_fatal =fatal_fix(RI_fatal)
SC_fatal =fatal_fix(SC_fatal)
SD_fatal =fatal_fix(SD_fatal)
TN_fatal =fatal_fix(TN_fatal)
TX_fatal =fatal_fix(TX_fatal)
UT_fatal =fatal_fix(UT_fatal)
VA_fatal =fatal_fix(VA_fatal)
VT_fatal =fatal_fix(VT_fatal)
WA_fatal =fatal_fix(WA_fatal)
WI_fatal =fatal_fix(WI_fatal)
WV_fatal =fatal_fix(WV_fatal)
WY_fatal =fatal_fix(WY_fatal)

In [24]:
# saving cleaned files to csvs

AL_fatal.to_csv('../data/cleaned_data/fatalities/AL_fatalc.csv')
AK_fatal.to_csv('../data/cleaned_data/fatalities/AK_fatalc.csv')
AR_fatal.to_csv('../data/cleaned_data/fatalities/AR_fatalc.csv')
AZ_fatal.to_csv('../data/cleaned_data/fatalities/AZ_fatalc.csv')
CA_fatal.to_csv('../data/cleaned_data/fatalities/CA_fatalc.csv')
CO_fatal.to_csv('../data/cleaned_data/fatalities/CO_fatalc.csv')
CT_fatal.to_csv('../data/cleaned_data/fatalities/CT_fatalc.csv')
DC_fatal.to_csv('../data/cleaned_data/fatalities/DC_fatalc.csv')
DE_fatal.to_csv('../data/cleaned_data/fatalities/DE_fatalc.csv')
FL_fatal.to_csv('../data/cleaned_data/fatalities/FL_fatalc.csv')
GA_fatal.to_csv('../data/cleaned_data/fatalities/GA_fatalc.csv')
HI_fatal.to_csv('../data/cleaned_data/fatalities/HI_fatalc.csv')
IA_fatal.to_csv('../data/cleaned_data/fatalities/IA_fatalc.csv')
ID_fatal.to_csv('../data/cleaned_data/fatalities/ID_fatalc.csv')
IL_fatal.to_csv('../data/cleaned_data/fatalities/IL_fatalc.csv')
IN_fatal.to_csv('../data/cleaned_data/fatalities/IN_fatalc.csv')
KS_fatal.to_csv('../data/cleaned_data/fatalities/KS_fatalc.csv')
KY_fatal.to_csv('../data/cleaned_data/fatalities/KY_fatalc.csv')
LA_fatal.to_csv('../data/cleaned_data/fatalities/LA_fatalc.csv')
MA_fatal.to_csv('../data/cleaned_data/fatalities/MA_fatalc.csv')
ME_fatal.to_csv('../data/cleaned_data/fatalities/ME_fatalc.csv')
MD_fatal.to_csv('../data/cleaned_data/fatalities/MD_fatalc.csv')
MI_fatal.to_csv('../data/cleaned_data/fatalities/MI_fatalc.csv')
MN_fatal.to_csv('../data/cleaned_data/fatalities/MN_fatalc.csv')
MO_fatal.to_csv('../data/cleaned_data/fatalities/MO_fatalc.csv')
MS_fatal.to_csv('../data/cleaned_data/fatalities/MS_fatalc.csv')
MT_fatal.to_csv('../data/cleaned_data/fatalities/MT_fatalc.csv')
NC_fatal.to_csv('../data/cleaned_data/fatalities/NC_fatalc.csv')
ND_fatal.to_csv('../data/cleaned_data/fatalities/ND_fatalc.csv')
NE_fatal.to_csv('../data/cleaned_data/fatalities/NE_fatalc.csv')
NH_fatal.to_csv('../data/cleaned_data/fatalities/NH_fatalc.csv')
NJ_fatal.to_csv('../data/cleaned_data/fatalities/NJ_fatalc.csv')
NM_fatal.to_csv('../data/cleaned_data/fatalities/NM_fatalc.csv')
NV_fatal.to_csv('../data/cleaned_data/fatalities/NV_fatalc.csv')
NY_fatal.to_csv('../data/cleaned_data/fatalities/NY_fatalc.csv')
OH_fatal.to_csv('../data/cleaned_data/fatalities/OH_fatalc.csv')
OK_fatal.to_csv('../data/cleaned_data/fatalities/OK_fatalc.csv')
OR_fatal.to_csv('../data/cleaned_data/fatalities/OR_fatalc.csv')
PA_fatal.to_csv('../data/cleaned_data/fatalities/PA_fatalc.csv')
RI_fatal.to_csv('../data/cleaned_data/fatalities/RI_fatalc.csv')
SC_fatal.to_csv('../data/cleaned_data/fatalities/SC_fatalc.csv')
SD_fatal.to_csv('../data/cleaned_data/fatalities/SD_fatalc.csv')
TN_fatal.to_csv('../data/cleaned_data/fatalities/TN_fatalc.csv')
TX_fatal.to_csv('../data/cleaned_data/fatalities/TX_fatalc.csv')
UT_fatal.to_csv('../data/cleaned_data/fatalities/UT_fatalc.csv')
VA_fatal.to_csv('../data/cleaned_data/fatalities/VA_fatalc.csv')
VT_fatal.to_csv('../data/cleaned_data/fatalities/VT_fatalc.csv')
WA_fatal.to_csv('../data/cleaned_data/fatalities/WA_fatalc.csv')
WI_fatal.to_csv('../data/cleaned_data/fatalities/WI_fatalc.csv')
WV_fatal.to_csv('../data/cleaned_data/fatalities/WV_fatalc.csv')
WY_fatal.to_csv('../data/cleaned_data/fatalities/WY_fatalc.csv')

Poverty Rates for states 2010-2017.

In [25]:
poverty_10 = pd.read_csv('../data/data/ACS_10.csv', header=1)
poverty_11 = pd.read_csv('../data/data/ACS_11.csv', header=1)
poverty_12 = pd.read_csv('../data/data/ACS_12.csv', header=1)
poverty_13 = pd.read_csv('../data/data/ACS_13.csv', header=1)
poverty_14 = pd.read_csv('../data/data/ACS_14.csv', header=1)
poverty_15 = pd.read_csv('../data/data/ACS_15.csv', header=1)
poverty_16 = pd.read_csv('../data/data/ACS_16.csv', header=1)
poverty_17 = pd.read_csv('../data/data/ACS_17.csv', header=1)

In [26]:
# function to clean for the information necessary to our modeling
def fix_pov(df):
    # set and rename index
    df.set_index('Geography', inplace=True)
    df.rename_axis('State', inplace=True)
    # necessary rows/columns only
    df = df.iloc[0:51, 2:8]
    df.drop(columns=['Total; Margin of Error; Population for whom poverty status is determined',
                     'Below poverty level; Margin of Error; Population for whom poverty status is determined',
                    'Percent below poverty level; Margin of Error; Population for whom poverty status is determined'],
                     inplace=True)
    return df

In [27]:
poverty_10 = fix_pov(poverty_10)
poverty_11 = fix_pov(poverty_11)
poverty_12 = fix_pov(poverty_12)
poverty_13 = fix_pov(poverty_13)
poverty_14 = fix_pov(poverty_14)
poverty_15 = fix_pov(poverty_15)
poverty_16 = fix_pov(poverty_16)
poverty_17 = fix_pov(poverty_17)

In [28]:
# renaming columns
poverty_10.columns=['2010_Total_Pop','2010_Pop_Below_Pov','2010_Pct_Below_Pov']
poverty_10.head(1)

Unnamed: 0_level_0,2010_Total_Pop,2010_Pop_Below_Pov,2010_Pct_Below_Pov
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,4666970,888290,19.0


In [29]:
poverty_11.columns=['2011_Total_Pop','2011_Pop_Below_Pov','2011_Pct_Below_Pov']
poverty_12.columns=['2012_Total_Pop','2012_Pop_Below_Pov','2012_Pct_Below_Pov']
poverty_13.columns=['2013_Total_Pop','2013_Pop_Below_Pov','2013_Pct_Below_Pov']
poverty_14.columns=['2014_Total_Pop','2014_Pop_Below_Pov','2014_Pct_Below_Pov']
poverty_15.columns=['2015_Total_Pop','2015_Pop_Below_Pov','2015_Pct_Below_Pov']
poverty_16.columns=['2016_Total_Pop','2016_Pop_Below_Pov','2016_Pct_Below_Pov']
poverty_17.columns=['2017_Total_Pop','2017_Pop_Below_Pov','2017_Pct_Below_Pov']

In [30]:
poverty_10.to_csv('../data/cleaned_data/poverty/state_pov_2010.csv')
poverty_11.to_csv('../data/cleaned_data/poverty/state_pov_2011.csv')
poverty_12.to_csv('../data/cleaned_data/poverty/state_pov_2012.csv')
poverty_13.to_csv('../data/cleaned_data/poverty/state_pov_2013.csv')
poverty_14.to_csv('../data/cleaned_data/poverty/state_pov_2014.csv')
poverty_15.to_csv('../data/cleaned_data/poverty/state_pov_2015.csv')
poverty_16.to_csv('../data/cleaned_data/poverty/state_pov_2016.csv')
poverty_17.to_csv('../data/cleaned_data/poverty/state_pov_2017.csv')

Median hourly and annual wages in states for all occupations v construction occupations.

In [31]:
medstate_2010 = pd.read_csv('../data/data/state_M2010.csv')
medstate_2011= pd.read_csv('../data/data/state_M2011.csv')
medstate_2012= pd.read_csv('../data/data/state_M2012.csv')
medstate_2013= pd.read_csv('../data/data/state_M2013.csv')
medstate_2014= pd.read_csv('../data/data/state_M2014.csv')
medstate_2015= pd.read_csv('../data/data/state_M2015.csv')
medstate_2016= pd.read_csv('../data/data/state_M2016.csv')
medstate_2017= pd.read_csv('../data/data/state_M2017.csv')

In [32]:
def clean_median(df):
    # retaining only columns relevant to modeling
    df = df[['STATE','OCC_TITLE','TOT_EMP','H_MEAN','H_MEDIAN','A_MEAN','A_MEDIAN']]
    # only interested in these values for occupation
    df = df[ (df['OCC_TITLE']== 'Construction and Extraction Occupations') | (df['OCC_TITLE']=='All Occupations')]
    # renaming
    df.replace(['All Occupations', 'Construction and Extraction Occupations'], ['All', 'Construction/Extraction'], inplace=True)
    df.rename(columns = {'H_MEAN': 'HR_MEAN $', 'H_MEDIAN': 'HR_MEDIAN $', 'A_MEAN': 'ANN_MEAN $','A_MEDIAN': 'ANN_MEDIAN $'}, inplace=True)
    # set index to state names
    df.set_index('STATE', inplace=True)
    df.rename_axis('State', inplace=True)
    # remove Puerto Rico, Guam, Virgin Islands
    df.drop(['Guam','Puerto Rico', 'Virgin Islands'], inplace=True)
    return df

In [33]:
medstate_2010 = clean_median(medstate_2010)
medstate_2011 = clean_median(medstate_2011)
medstate_2012 = clean_median(medstate_2012)
medstate_2013 = clean_median(medstate_2013)
medstate_2014 = clean_median(medstate_2014)
medstate_2015 = clean_median(medstate_2015)
medstate_2016 = clean_median(medstate_2016)
medstate_2017 = clean_median(medstate_2017)

In [34]:
medstate_2010.head()

Unnamed: 0_level_0,OCC_TITLE,TOT_EMP,HR_MEAN $,HR_MEDIAN $,ANN_MEAN $,ANN_MEDIAN $
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama,All,1807480,18.55,14.21,38590,29570
Alabama,Construction/Extraction,82970,16.83,15.26,35010,31750
Alaska,All,308050,24.21,20.02,50350,41640
Alaska,Construction/Extraction,20230,28.84,28.68,59980,59660
Arizona,All,2367120,20.38,15.89,42390,33040


In [35]:
# saving cleaned data to csv's
medstate_2010.to_csv('../data/cleaned_data/wages/med_mean_2010.csv')
medstate_2011.to_csv('../data/cleaned_data/wages/med_mean_2011.csv')
medstate_2012.to_csv('../data/cleaned_data/wages/med_mean_2012.csv')
medstate_2013.to_csv('../data/cleaned_data/wages/med_mean_2013.csv')
medstate_2014.to_csv('../data/cleaned_data/wages/med_mean_2014.csv')
medstate_2015.to_csv('../data/cleaned_data/wages/med_mean_2015.csv')
medstate_2016.to_csv('../data/cleaned_data/wages/med_mean_2016.csv')
medstate_2017.to_csv('../data/cleaned_data/wages/med_mean_2017.csv')