# Notebook for Creating High Wage Outputs

#### This notebook is currently written to create high wage outputs for 2020.

In [1]:
import pandas as pd
import numpy as np
import string
import warnings
import os
import re
from jqi_functions import *
from tqdm.notebook import tqdm # for progress bar
warnings.filterwarnings('ignore')

## Ensure that IPUMS data is in the proper file location

The desired year for IPUMS data should live in the data folder, under `data/ipums` with the naming convention as `IPUMS_{YEAR}.csv`, where `{YEAR}` should match the year entered below.

A full example for the file path would be `data/ipums/IPUMS_2020.csv`.

## Set the desired EDD year, IPUMS year, and corresponding cost of living year.

In [2]:
edd_year = '2022'
ipums_year = '2019'
col_year = '2019'

## Creating IPUMS dataframe

#### IPUMS Data
`cleaned_ipums` is a function to generate a cleaned pandas dataframe using IPUMS data, filtering it down to California only and the desired year. The year needs to be entered in string format as a parameter.

In [3]:
ca_ipums = cleaned_ipums(ipums_year)

#### Cost of living needs to be updated each year.

In [4]:
cost_of_living = pd.read_csv(f'data/cost_of_living/united-way-col-1A1PS1C{col_year}.csv')

### Create county lookup dataframe

Expanding the `county_info` dataframe to include cost of living metrics. This dataframe is used when industry information in a geographic area is too sparse and the next largest geographic area needs to be used instead.

In [5]:
county_info = pd.read_csv('data/county_to_regions_key.csv')

In [6]:
county_info = county_info[['County', 'COUNTYFIP', 'Rural/Urban', 'CERF Regions', 'Population']]

In [7]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'CERF Regions', right_on = 'Regions')

In [8]:
county_info = county_info.rename(columns = {'Cost of Living':'Regional COL'})
county_info = county_info.drop(columns=['Regions'])

In [9]:
county_info['State COL'] = cost_of_living.iloc[13][1]

View of final `county_info` dataframe.

In [10]:
county_info.head()

Unnamed: 0,County,COUNTYFIP,Rural/Urban,CERF Regions,Population,Regional COL,State COL
0,Alameda,1,Urban,Bay Area,1656754,97249,77555
1,Contra Costa,13,Urban,Bay Area,1142251,97249,77555
2,Solano,95,Urban,Bay Area,441829,97249,77555
3,San Mateo,81,Urban,Bay Area,767423,97249,77555
4,Santa Clara,85,Urban,Bay Area,1927470,97249,77555


In [11]:
ca_ipums = pd.merge(ca_ipums, county_info, on = 'COUNTYFIP')

View of final `ca_ipums` dataframe.

In [12]:
ca_ipums.head()

Unnamed: 0,YEAR,COUNTYFIP,INDNAICS,PERWT,INCWAGE,NAICS Code,Industry Title,Industry,Crosswalk Value,County,Rural/Urban,CERF Regions,Population,Regional COL,State COL
0,2019,37,4853,21.0,23100,4853,taxi and limousine service,taxi and limousine service,11,Los Angeles,Urban,Los Angeles,10081570,80216,77555
1,2019,37,4853,21.0,23100,4853,taxi and limousine service,taxi and limousine service,14,Los Angeles,Urban,Los Angeles,10081570,80216,77555
2,2019,37,4853,11.0,28000,4853,taxi and limousine service,taxi and limousine service,11,Los Angeles,Urban,Los Angeles,10081570,80216,77555
3,2019,37,4853,11.0,28000,4853,taxi and limousine service,taxi and limousine service,14,Los Angeles,Urban,Los Angeles,10081570,80216,77555
4,2019,37,4853,35.0,28000,4853,taxi and limousine service,taxi and limousine service,11,Los Angeles,Urban,Los Angeles,10081570,80216,77555


## Create EDD Dataframe

#### EDD Data
The year for EDD data must be specified.

These CSV files are filtered and cleaned versions of the raw EDD Current Employment Statistics dataset. These CSV files can be created for upcoming years with the notebook `multiyear-edd-data-creation.ipynb`

In [13]:
edd = pd.read_csv(f'data/edd/edd_{edd_year}.csv')

View of final `edd` dataframe.

In [14]:
edd.head()

Unnamed: 0,Area Type,Area Name,Year,Month,Date,Series Code,Seasonally Adjusted,Current Employment,Industry Title,COUNTYFIP,County,Rural/Urban,CERF Regions,Crosswalk Value
0,County,Napa,2022,January,01/01/2022,60560000,N,3800,administrative and support and waste services,55,Napa,Rural,Bay Area,25
1,County,Napa,2022,February,02/01/2022,60560000,N,3900,administrative and support and waste services,55,Napa,Rural,Bay Area,25
2,County,Napa,2022,March,03/01/2022,60560000,N,3900,administrative and support and waste services,55,Napa,Rural,Bay Area,25
3,County,Napa,2022,April,04/01/2022,60560000,N,3900,administrative and support and waste services,55,Napa,Rural,Bay Area,25
4,County,Solano,2022,January,01/01/2022,60560000,N,5200,administrative and support and waste services,95,Solano,Urban,Bay Area,25


## Add High Wage Features

`add_geo_high_wages` is a function that adds the following engineered features:
- Above Threshold (Number of records above respective cost of living threshold)
- Weighted above threshold (Above Threshold multiplied by person weight variable)
- Unweighted industry counts (Number of records in that industry)
- Weighted industry counts (Sum of person weight values in that industry)
- Weighted high wage percentage (Weighted Above Threshold divided by Weighted Industry Counts as a percentage)

The features are created for the following geographical levels:
- Region
- California

In [15]:
ca_ipums_hw = add_geo_high_wages(ca_ipums)

View of final `ca_ipums_hw` dataframe.

In [16]:
ca_ipums_hw.head().T

Unnamed: 0,0,1,2,3,4
YEAR,2019,2019,2019,2019,2019
COUNTYFIP,37,37,37,37,37
INDNAICS,4853,4853,4853,4853,4853
PERWT,21.0,11.0,35.0,14.0,19.0
INCWAGE,23100,28000,28000,28000,28000
NAICS Code,4853,4853,4853,4853,4853
Industry Title,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service
Industry,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service,taxi and limousine service
Crosswalk Value,11,11,11,11,11
County,Los Angeles,Los Angeles,Los Angeles,Los Angeles,Los Angeles


## Create High Wage Outputs Dataframe

`edd_to_hw` is the function that outputs the values needed to create the high wage output dataframe. This portion of the notebook runs through every unique combination of region, industry, and date, to get that respective output and add it to the dataframe.

Because of the nested structure of the EDD industries, only a small selection of EDD industries can be used to ensure that individuals in nested industries are not counted twice. The selection of these industries different per region, so the series code of each industry is documented in the `region_series_codes` global variable in the `jqi_functions.py` library. Each of these codes were then assigned their own crosswalk value, which align with each crosswalk value assigned to each IPUMS industry. For generating high wage outputs, we only iterate through the EDD industries that have been selected and have a designated crosswalk value.

Getting unique values for each region, industry, and date.

In [17]:
regions_ipums = ca_ipums_hw['CERF Regions'].unique()

In [18]:
crosswalk_vals = sorted(edd['Crosswalk Value'].unique())

In [19]:
dates_edd = edd['Date'].unique()

Initializing empty lists for the function's outputs to later be joined in a dataframe.

In [20]:
industries = []
dates = []
regions = []
counts = []
emp_counts = []

For loop to populate lists for the high wage outputs. This will take some time to finish running.

In [21]:
for region in tqdm(regions_ipums):
    for code in crosswalk_vals:
        for date in dates_edd:
            hw_count, hw_perc, employment_count, industry = edd_to_hw(edd, ca_ipums_hw, region, code, date, 10)
            industries.append(industry)
            dates.append(date)
            regions.append(region)
            counts.append(hw_count)
            emp_counts.append(employment_count)

  0%|          | 0/12 [00:00<?, ?it/s]

Creating a cleaned dataframe from the output lists.

In [22]:
df_dict = {'Industry':industries, 'Date':dates, 'Region':regions, 'High Wage Count':counts, 'Employment Count':emp_counts}
hw_output = pd.DataFrame(df_dict)
hw_output = hw_output[hw_output['Industry'].notna()]
hw_output['Date']= pd.to_datetime(hw_output['Date'])
hw_output['High Wage Count'] = hw_output['High Wage Count'].astype(int)
hw_output = hw_output.sort_values(by=['Industry', 'Region', 'Date'])
hw_output = pd.merge(hw_output, cost_of_living, left_on='Region', right_on='Regions')
hw_output = hw_output[['Industry', 'Date', 'Region', 'High Wage Count', 'Employment Count', 'Cost of Living']]
hw_output = hw_output.drop_duplicates()

Add Region Population to Dataframe

In [23]:
reg_pop = county_info.groupby(by='CERF Regions').sum()[['Population']].reset_index()
reg_pop

Unnamed: 0,CERF Regions,Population
0,Bay Area,7710026
1,Central Coast,2342005
2,Central San Joaquin,1752543
3,Eastern Sierra,188734
4,Inland Empire,4560470
5,Kern,887641
6,Los Angeles,10081570
7,North State,713754
8,Northern San Joaquin,1557179
9,Orange,3168044


In [24]:
hw_output = pd.merge(hw_output, reg_pop, left_on='Region', right_on='CERF Regions')
hw_output = hw_output.drop(columns=['CERF Regions'])
hw_output = hw_output.rename(columns={"Population": "Region Population"})

View of final `hw_output` dataframe.

In [25]:
hw_output.head()

Unnamed: 0,Industry,Date,Region,High Wage Count,Employment Count,Cost of Living,Region Population
0,accommodation and food services,2022-01-01,Kern,2171,24200.0,54862,887641
1,accommodation and food services,2022-02-01,Kern,2198,24500.0,54862,887641
2,accommodation and food services,2022-03-01,Kern,2225,24800.0,54862,887641
3,accommodation and food services,2022-04-01,Kern,2225,24800.0,54862,887641
4,administrative and support and waste services,2022-01-01,Kern,3507,12000.0,54862,887641


Code to export the dataframe as a CSV file - change file path if needed and uncomment to run.

In [26]:
hw_output.to_csv(f'data/outputs/hw_outputs_{edd_year}.csv', encoding='utf-8', index=False)

#### TESTING

In [27]:
bayarea = hw_output.loc[hw_output['Date'] == f'{edd_year}-01-01'].loc[hw_output['Region'] == 'Bay Area']

In [28]:
bayarea['High Wage Count'].sum()

146991

In [29]:
bayarea['Employment Count'].sum()

503700.0

In [30]:
california = hw_output.loc[hw_output['Date'] == f'{edd_year}-01-01']

In [31]:
california['High Wage Count'].sum()

2699078

In [32]:
california['Employment Count'].sum()

10589900.0

https://data.bls.gov/timeseries/LASST060000000000005?amp%253bdata_tool=XGtable&output_view=data&include_graphs=true

## Code for concatenating multiple years (2010-2022)

In [37]:
# hw_2010 = pd.read_csv('data/outputs/hw_outputs_2010.csv')
# hw_2011 = pd.read_csv('data/outputs/hw_outputs_2011.csv')
# hw_2012 = pd.read_csv('data/outputs/hw_outputs_2012.csv')
# hw_2013 = pd.read_csv('data/outputs/hw_outputs_2013.csv')
# hw_2014 = pd.read_csv('data/outputs/hw_outputs_2014.csv')
# hw_2015 = pd.read_csv('data/outputs/hw_outputs_2015.csv')
# hw_2016 = pd.read_csv('data/outputs/hw_outputs_2016.csv')
# hw_2017 = pd.read_csv('data/outputs/hw_outputs_2017.csv')
# hw_2018 = pd.read_csv('data/outputs/hw_outputs_2018.csv')
# hw_2019 = pd.read_csv('data/outputs/hw_outputs_2019.csv')
# hw_2020 = pd.read_csv('data/outputs/hw_outputs_2020.csv')
# hw_2021 = pd.read_csv('data/outputs/hw_outputs_2021.csv')
# hw_2022 = pd.read_csv('data/outputs/hw_outputs_2022.csv')

In [38]:
# hw_2010['Year'] = 2010
# hw_2011['Year'] = 2011
# hw_2012['Year'] = 2012
# hw_2013['Year'] = 2013
# hw_2014['Year'] = 2014
# hw_2015['Year'] = 2015
# hw_2016['Year'] = 2016
# hw_2017['Year'] = 2017
# hw_2018['Year'] = 2018
# hw_2019['Year'] = 2019
# hw_2020['Year'] = 2020
# hw_2021['Year'] = 2021
# hw_2022['Year'] = 2022

In [39]:
# hw_output_concat = pd.concat([hw_2010, hw_2011, hw_2012, hw_2013, 
#                               hw_2014, hw_2015, hw_2016, hw_2017, 
#                               hw_2018, hw_2019, hw_2020, hw_2021, 
#                               hw_2022])

In [40]:
# hw_output_concat.to_csv('data/outputs/hw_outputs_multiyear.csv', encoding='utf-8', index=False)