# Data Cleaning Part 2

Data on the mean and median wages, hourly and annually, for all workers v construction workers will be combined with data on union membership in across all industries v the construction industry.    
Final tables contain all state data based on year.

#### Importing package(s) and reading in data.

In [1]:
import pandas as pd

In [2]:
med2010 = pd.read_csv('../data/cleaned_data/wages/med_mean_2010.csv')
med2010.head(2)

Unnamed: 0,State,OCC_TITLE,TOT_EMP,HR_MEAN $,HR_MEDIAN $,ANN_MEAN $,ANN_MEDIAN $
0,Alabama,All,1807480,18.55,14.21,38590,29570
1,Alabama,Construction/Extraction,82970,16.83,15.26,35010,31750


In [3]:
mem2010 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2010.csv', index_col=0).drop(columns=['Members'])
mem2010.head(2)

Unnamed: 0,State,Sector,Employment,%Mem
0,Alabama,Total,1808807,10.1
1,Alabama,Construction,100593,4.8


The function below will combine dataframes of states' wages and membership data from same year.

In [4]:
def merge_yr(df1,df2):
    df = df1.merge(df2, on='State').drop(columns=['OCC_TITLE','TOT_EMP','HR_MEAN $','HR_MEDIAN $'])
    return df

As a result of the merge, every row appears twice. A simple 'drop duplicates' operator does not work, so we will break the new dataframe into slices of each set of columns and rejoin.

In [5]:
def join_yr(df):
    # breaks df into slices by employment sector, taking every other row
    sl1 = df[(df['Sector']=='Total')].iloc[::2]
    sl2 = df[(df['Sector']=='Construction')].iloc[1::2]
    # merging slices back into one df by state
    df = sl1.merge(sl2,on='State')
    # giving columns appropriate names
    df = df.rename(columns={'ANN_MEAN $_x':'Total_Avg_$',
                      'ANN_MEDIAN $_x': 'Total_Med_$',
                      'Employment_x':'Total_Employment',
                      '%Mem_x': 'Total_Mem_%',
                      'ANN_MEAN $_y':'Construction_Avg_$',
                      'ANN_MEDIAN $_y': 'Construction_Med_$',
                      'Employment_y':'Construction_Employment',
                      '%Mem_y': 'Construction_Mem_%'}).drop(columns=['Sector_x', 'Sector_y'])
    return df

In [6]:
mm2010 = merge_yr(med2010, mem2010)

In [7]:
mm2010 = join_yr(mm2010)
mm2010.head()

Unnamed: 0,State,Total_Avg_$,Total_Med_$,Total_Employment,Total_Mem_%,Construction_Avg_$,Construction_Med_$,Construction_Employment,Construction_Mem_%
0,Alabama,38590,29570,1808807,10.1,35010,31750,100593,4.8
1,Alaska,50350,41640,295063,22.9,59980,59660,17440,21.4
2,Arizona,42390,33040,2506723,6.4,39060,36380,163760,5.3
3,Arkansas,35460,27860,1081711,4.0,34270,31820,47617,6.2
4,California,50730,37870,13891632,17.5,51880,48980,707158,15.7


Functions have shown successful. We will now apply them to all our years of data.

In [8]:
# reading in all years of data
med2011 = pd.read_csv('../data/cleaned_data/wages/med_mean_2011.csv')
mem2011 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2011.csv', index_col=0).drop(columns=['Members'])
med2012 = pd.read_csv('../data/cleaned_data/wages/med_mean_2012.csv')
mem2012 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2012.csv', index_col=0).drop(columns=['Members'])
med2013 = pd.read_csv('../data/cleaned_data/wages/med_mean_2013.csv')
mem2013 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2013.csv', index_col=0).drop(columns=['Members'])
med2014 = pd.read_csv('../data/cleaned_data/wages/med_mean_2014.csv')
mem2014 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2014.csv', index_col=0).drop(columns=['Members'])
med2015 = pd.read_csv('../data/cleaned_data/wages/med_mean_2015.csv')
mem2015 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2015.csv', index_col=0).drop(columns=['Members'])
med2016 = pd.read_csv('../data/cleaned_data/wages/med_mean_2016.csv')
mem2016 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2016.csv', index_col=0).drop(columns=['Members'])
med2017 = pd.read_csv('../data/cleaned_data/wages/med_mean_2017.csv')
mem2017 =pd.read_csv('../data/cleaned_data/membership/State_UMem_2017.csv', index_col=0).drop(columns=['Members'])

In [9]:
# merging tables
mm2011 = merge_yr(med2011,mem2011)
mm2012 = merge_yr(med2012,mem2012)
mm2013 = merge_yr(med2013,mem2013)
mm2014 = merge_yr(med2014,mem2014)
mm2015 = merge_yr(med2015,mem2015)
mm2016 = merge_yr(med2016,mem2016)
mm2017 = merge_yr(med2017,mem2017)

In [10]:
# performing split and rejoin
mm2011 = join_yr(mm2011)
mm2012 = join_yr(mm2012)
mm2013 = join_yr(mm2013)
mm2014 = join_yr(mm2014)
mm2015 = join_yr(mm2015)
mm2016 = join_yr(mm2016)
mm2017 = join_yr(mm2017)

In [11]:
mm2010.to_csv('../data/cleaned_data/model_data/2010/mm2010.csv')
mm2011.to_csv('../data/cleaned_data/model_data/2011/mm2011.csv')
mm2012.to_csv('../data/cleaned_data/model_data/2012/mm2012.csv')
mm2013.to_csv('../data/cleaned_data/model_data/2013/mm2013.csv')
mm2014.to_csv('../data/cleaned_data/model_data/2014/mm2014.csv')
mm2015.to_csv('../data/cleaned_data/model_data/2015/mm2015.csv')
mm2016.to_csv('../data/cleaned_data/model_data/2016/mm2016.csv')
mm2017.to_csv('../data/cleaned_data/model_data/2017/mm2017.csv')