# Notebook for Creating High Wage Outputs

#### This notebook is currently written to create high wage outputs for 2019.

In [2]:
import pandas as pd
import numpy as np
import string
import warnings
import os
import re
from jqi_functions import *
warnings.filterwarnings('ignore')

## Creating IPUMS dataframe

#### IPUMS Data
`cleaned_ipums` is a function to generate a cleaned pandas dataframe using IPUMS data, filtering it down to California only and the desired year. The year needs to be entered in string format as a parameter.

In [3]:
ca_ipums = cleaned_ipums('2019')

In [4]:
county_info = pd.read_csv('data/county_to_regions_key.csv')

Cost of living needs to be updated each year.

In [5]:
cost_of_living = pd.read_csv('data/cost_of_living/united-way-col-1A1PS1C2019.csv')

In [6]:
ca_ipums = pd.merge(ca_ipums, county_info, on = 'COUNTYFIP')

Cleaning `ca_ipums` after merging with `county_info` and `cost_of_living`.

In [8]:
ca_ipums = ca_ipums[['INDNAICS', 'PERWT', 'INCWAGE',
       'NAICS Code', 'Industry Title_x', 'Main_Code', 'Sub_1_Code', 'Sub_2_Code', 'Sub_3_Code', 'Sub_4_Code', 'County', 'Rural/Urban', 'CDI Regions']]

In [9]:
ca_ipums = pd.merge(ca_ipums, cost_of_living, left_on = 'CDI Regions', right_on = 'Regions')
ca_ipums = ca_ipums.rename(columns = {'Cost of Living':'Regional COL'})

In [10]:
ca_ipums = pd.merge(ca_ipums, cost_of_living, left_on = 'Rural/Urban', right_on = 'Regions')
ca_ipums = ca_ipums.rename(columns = {'Cost of Living':'Rural/Urban COL'})

In [11]:
ca_ipums = pd.merge(ca_ipums, cost_of_living, left_on = 'County', right_on = 'Regions')
ca_ipums = ca_ipums.rename(columns = {'Cost of Living':'County COL'})

In [12]:
ca_ipums = ca_ipums[['INDNAICS', 'PERWT', 'INCWAGE',
       'NAICS Code', 'Industry Title_x', 'Main_Code', 'Sub_1_Code', 'Sub_2_Code', 'Sub_3_Code', 'Sub_4_Code', 'County', 'Rural/Urban', 'CDI Regions',
                    'Regional COL', 'Rural/Urban COL', 'County COL']]

In [13]:
ca_ipums['Regional Rural/Urban'] = ca_ipums['CDI Regions'] + ' ' + ca_ipums['Rural/Urban']

In [14]:
ca_ipums = pd.merge(ca_ipums, cost_of_living, left_on = 'Regional Rural/Urban', right_on = 'Regions')
ca_ipums = ca_ipums.rename(columns = {'Cost of Living':'Regional Rural/Urban COL'})

In [15]:
ca_ipums = ca_ipums.rename(columns = {'Industry Title_x':'Industry Title'})
ca_ipums = ca_ipums[['INDNAICS', 'PERWT', 'INCWAGE',
       'NAICS Code', 'Industry Title', 'Main_Code', 'Sub_1_Code', 'Sub_2_Code', 'Sub_3_Code',
       'Sub_4_Code', 'County', 'Rural/Urban', 'CDI Regions', 'Regional Rural/Urban',
                    'Regional COL', 'Rural/Urban COL', 'County COL', 'Regional Rural/Urban COL']]

In [16]:
ca_ipums['Industry Title'] = normalize_titles(ca_ipums['Industry Title'])

View of final `ca_ipums` dataframe.

In [17]:
ca_ipums.head()

Unnamed: 0,INDNAICS,PERWT,INCWAGE,NAICS Code,Industry Title,Main_Code,Sub_1_Code,Sub_2_Code,Sub_3_Code,Sub_4_Code,County,Rural/Urban,CDI Regions,Regional Rural/Urban,Regional COL,Rural/Urban COL,County COL,Regional Rural/Urban COL
0,4853,21.0,23100,4853,taxi and limousine service,400,430,480,483,483,Los Angeles,Urban,Los Angeles,Los Angeles Urban,80216,79472,80216,80216
1,4853,21.0,23100,4853,taxi and limousine service,400,430,480,483,483,Los Angeles,Urban,Los Angeles,Los Angeles Urban,80216,79472,80216,80216
2,4853,21.0,23100,4853,taxi and limousine service,400,430,480,483,483,Los Angeles,Urban,Los Angeles,Los Angeles Urban,80216,79472,80216,80216
3,4853,21.0,23100,4853,taxi and limousine service,400,430,480,483,483,Los Angeles,Urban,Los Angeles,Los Angeles Urban,80216,79472,80216,80216
4,4853,11.0,28000,4853,taxi and limousine service,400,430,480,483,483,Los Angeles,Urban,Los Angeles,Los Angeles Urban,80216,79472,80216,80216


## Create county lookup dataframe

Expanding the `county_info` dataframe to include cost of living metrics. This dataframe is used when industry information in a geographic area is too sparse and the next largest geographic area needs to be used instead.

In [18]:
county_info = county_info[['County', 'Rural/Urban', 'CDI Regions']]

In [19]:
county_info['Regional Rural/Urban'] = county_info['CDI Regions'] + ' ' + county_info['Rural/Urban']

In [20]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'County', right_on = 'Regions')

In [21]:
county_info = county_info.rename(columns = {'Cost of Living':'County COL'})
county_info = county_info.drop(columns=['Regions'])

In [22]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'Regional Rural/Urban', right_on = 'Regions')

In [23]:
county_info = county_info.rename(columns = {'Cost of Living':'Regional Rural/Urban COL'})
county_info = county_info.drop(columns=['Regions'])

In [24]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'CDI Regions', right_on = 'Regions')

In [25]:
county_info = county_info.rename(columns = {'Cost of Living':'Regional COL'})
county_info = county_info.drop(columns=['Regions'])

In [26]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'Rural/Urban', right_on = 'Regions')

In [27]:
county_info = county_info.rename(columns = {'Cost of Living':'Rural/Urban COL'})
county_info = county_info.drop(columns=['Regions'])

In [28]:
county_info['State COL'] = cost_of_living.iloc[11][1]

View of final `county_info` dataframe.

In [29]:
county_info.head()

Unnamed: 0,County,Rural/Urban,CDI Regions,Regional Rural/Urban,County COL,Regional Rural/Urban COL,Regional COL,Rural/Urban COL,State COL
0,Alameda,Urban,Bay Area,Bay Area Urban,88296,94329,93392,79472,74448
1,Contra Costa,Urban,Bay Area,Bay Area Urban,86284,94329,93392,79472,74448
2,Solano,Urban,Bay Area,Bay Area Urban,66751,94329,93392,79472,74448
3,San Mateo,Urban,Bay Area,Bay Area Urban,112606,94329,93392,79472,74448
4,Santa Clara,Urban,Bay Area,Bay Area Urban,107879,94329,93392,79472,74448


## Create EDD Dataframe

#### EDD Data
The year for EDD data must be specified.

In [30]:
edd = pd.read_csv('data/edd_2019_parsed.csv')

In [31]:
edd['Area Name'] = edd['Area Name'].str.replace(' County', '')
edd = edd.loc[edd['Area Type'] == 'County']
edd = edd.drop(columns=['Industry Title'])
edd = edd.rename(columns={"LMID Industry Title": "Industry Title"})

In [32]:
edd['Sub_1_Code'] = [str(x) for x in edd['Sub_1_Code']]
edd['Main_Code'] = [str(x) for x in edd['Main_Code']]

View of final `edd` dataframe.

In [33]:
edd.head()

Unnamed: 0,Industry Title,Parsed_Code,Area Type,Area Name,Date,Seasonally Adjusted,Current Employment,Main_EDD,Main_Code,Sub_1,Sub_1_Code,Sub_2,Sub_2_Code,Sub_3,Sub_3_Code,Sub_4,Sub_4_Code
0,county,939,County,Madera,3/1/19,N,1600,government,900,state and local government,940,local government,930,local government excluding education,932,county,939
1,county,939,County,Fresno,1/1/19,N,7800,government,900,state and local government,940,local government,930,local government excluding education,932,county,939
2,county,939,County,Kern,1/1/19,N,9900,government,900,state and local government,940,local government,930,local government excluding education,932,county,939
3,county,939,County,Los Angeles,1/1/19,N,106800,government,900,state and local government,940,local government,930,local government excluding education,932,county,939
4,county,939,County,Madera,1/1/19,N,1600,government,900,state and local government,940,local government,930,local government excluding education,932,county,939


## Load NAICS Crosswalk

In [34]:
naics = pd.read_csv('data/naics_parsed_crosswalk.csv').drop_duplicates(subset='INDNAICS').reset_index().iloc[:,1:]

In [35]:
naics['Industry Title'] = normalize_titles(naics['Industry Title'])
naics['Sub_1_Code'] = [str(x) for x in naics['Sub_1_Code']]
naics['Main_Code'] = [str(x) for x in naics['Main_Code']]

View of final `naics` dataframe.

In [36]:
naics.head()

Unnamed: 0,Industry Title,INDNAICS,Main_Code,Sub_1_Code,Sub_2_Code,Sub_3_Code,Sub_4_Code
0,crop production,111,111,111,111,111,111
1,support activities for agriculture and forestry,115,111,111,111,111,111
2,animal production and aquaculture,112,111,111,111,111,111
3,fishing hunting and trapping,114,111,111,111,111,111
4,forestry except logging,113m,100,113,113,113,113


## Add High Wage Features

`add_geo_high_wages` is a function that adds the following engineered features:
- Above Threshold (Number of records above respective cost of living threshold)
- Weighted above threshold (Above Threshold multiplied by person weight variable)
- Unweighted industry counts (Number of records in that industry)
- Weighted industry counts (Sum of person weight values in that industry)
- Weighted high wage percentage (Weighted Above Threshold divided by Weighted Industry Counts as a percentage)

The features are created for the following geographical levels:
- County
- Regional Rural/Urban
- Region
- Rural/Urban
- California

In [37]:
ca_ipums_hw = add_geo_high_wages(ca_ipums)

In [38]:
ca_ipums_hw['Sub_1_Code'] = [str(x) for x in ca_ipums_hw['Sub_1_Code']]
ca_ipums_hw['Main_Code'] = [str(x) for x in ca_ipums_hw['Main_Code']]

View of final `ca_ipums_hw` dataframe.

In [39]:
ca_ipums_hw.head().T

Unnamed: 0,0,1,2,3,4
INDNAICS,4853,4853,4853,4853,4853
PERWT,21.0,21.0,21.0,21.0,11.0
INCWAGE,23100,23100,23100,23100,28000
NAICS Code,4853,4853,4853,4853,4853
Industry Title,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service
Main_Code,400,400,400,400,400
Sub_1_Code,430,430,430,430,430
Sub_2_Code,480,480,480,480,480
Sub_3_Code,483,483,483,483,483
Sub_4_Code,483,483,483,483,483


## Create High Wage Outputs Dataframe

`edd_to_hw` is the function that outputs the values needed to create the high wage output dataframe. This portion of the notebook runs through every unique combination of county, industry, and date, to get that respective output and add it to the dataframe.

Getting unique values for each county, industry (as a parsed code), and date.

In [40]:
counties_edd = edd['Area Name'].unique()

In [41]:
parsed_codes = set(list(edd['Main_Code'].unique()) + list(edd['Sub_1_Code'].unique()) + list(edd['Sub_2_Code'].unique()) + list(edd['Sub_3_Code'].unique()) + list(edd['Sub_4_Code'].unique()))

In [42]:
dates_edd = edd['Date'].unique()

Initializing empty lists for the function's outputs to later be joined in a dataframe.

In [43]:
industries = []
dates = []
counties = []
counts = []
emp_counts = []

In [44]:
total_iterations = len(counties_edd) * len(parsed_codes) * len(dates_edd)

For loop to populate lists for the high wage outputs. This will take some time to finish running.

In [46]:
progress_count = 0
for county in counties_edd:
    for code in parsed_codes:
        for date in dates_edd:
            output, hw, industry, emp_count = edd_to_hw(edd, ca_ipums_hw, naics, county_info, county, str(code), date, 30)
            industries.append(industry)
            dates.append(date)
            counties.append(county)
            counts.append(hw)
            emp_counts.append(emp_count)
            progress_count += 1
            if progress_count % 10440 == 0:
                percent_done = int((progress_count / total_iterations) * 100)
                print(f'Progress: {percent_done}% Complete')

Progress: 10% Complete
Progress: 20% Complete
Progress: 30% Complete
Progress: 40% Complete
Progress: 50% Complete
Progress: 60% Complete
Progress: 70% Complete
Progress: 80% Complete
Progress: 90% Complete
Progress: 100% Complete


Creating a cleaned dataframe from the output lists.

In [48]:
df_dict = {'Industry':industries, 'Date':dates, 'County':counties, 'High Wage Count':counts, 'Employment Count':emp_counts}
hw_output = pd.DataFrame(df_dict)
hw_output = hw_output[hw_output['Industry'].notna()]
hw_output['Date']= pd.to_datetime(hw_output['Date'])
hw_output['High Wage Count'] = hw_output['High Wage Count'].astype(int)
hw_output = hw_output.sort_values(by=['Industry', 'County', 'Date'])
hw_output = pd.merge(hw_output, cost_of_living, left_on='County', right_on='Regions')
hw_output = hw_output[['Industry', 'Date', 'County', 'High Wage Count', 'Employment Count', 'Cost of Living']]

View of final `hw_output` dataframe.

In [49]:
hw_output.head()

Unnamed: 0,Industry,Date,County,High Wage Count,Employment Count,Cost of Living
0,accounting tax preparation bookkeeping and pay...,2019-01-01,Los Angeles,8490,43000.0,80216
1,accounting tax preparation bookkeeping and pay...,2019-01-01,Los Angeles,8490,43000.0,80216
2,accounting tax preparation bookkeeping and pay...,2019-01-01,Los Angeles,8490,43000.0,80216
3,accounting tax preparation bookkeeping and pay...,2019-01-01,Los Angeles,8490,43000.0,80216
4,accounting tax preparation bookkeeping and pay...,2019-02-01,Los Angeles,9359,47400.0,80216


Code to export the dataframe as a CSV file - change file path if needed and uncomment to run.

In [50]:
# hw_output.to_csv('hw_output.csv', encoding='utf-8', index=False)