# Notebook for Creating High Wage Outputs

#### This notebook is currently written to create high wage outputs for 2020.

In [1]:
import pandas as pd
import numpy as np
import string
import warnings
import os
import re
from jqi_functions import *
warnings.filterwarnings('ignore')

## Set the desired year and the corresponding cost of living year

In [2]:
year = '2014'
col_year = '2014'

## Creating IPUMS dataframe

#### IPUMS Data
`cleaned_ipums` is a function to generate a cleaned pandas dataframe using IPUMS data, filtering it down to California only and the desired year. The year needs to be entered in string format as a parameter.

In [3]:
ca_ipums = cleaned_ipums(year)

#### Cost of living needs to be updated each year.

In this case, the 2020 United Way Real Cost Measure has not been published, so I will continue using the data from 2019.

In [4]:
cost_of_living = pd.read_csv(f'data/cost_of_living/united-way-col-1A1PS1C{col_year}.csv')

### Create county lookup dataframe

Expanding the `county_info` dataframe to include cost of living metrics. This dataframe is used when industry information in a geographic area is too sparse and the next largest geographic area needs to be used instead.

In [5]:
county_info = pd.read_csv('data/county_to_regions_key.csv')

In [6]:
county_info = county_info[['County', 'COUNTYFIP', 'Rural/Urban', 'CERF Regions']]

In [7]:
county_info = pd.merge(county_info, cost_of_living, left_on = 'CERF Regions', right_on = 'Regions')

In [8]:
county_info = county_info.rename(columns = {'Cost of Living':'Regional COL'})
county_info = county_info.drop(columns=['Regions'])

In [9]:
# county_info = pd.merge(county_info, cost_of_living, left_on = 'Rural/Urban', right_on = 'Regions')

In [10]:
# county_info = county_info.rename(columns = {'Cost of Living':'Rural/Urban COL', 'Regions_x':'Regions'})
# county_info = county_info.drop(columns=['Regions_y'])

In [11]:
county_info['State COL'] = cost_of_living.iloc[13][1]

View of final `county_info` dataframe.

In [12]:
county_info.head()

Unnamed: 0,County,COUNTYFIP,Rural/Urban,CERF Regions,Regional COL,State COL
0,Alameda,1,Urban,Bay Area,77661,67433
1,Contra Costa,13,Urban,Bay Area,77661,67433
2,Solano,95,Urban,Bay Area,77661,67433
3,San Mateo,81,Urban,Bay Area,77661,67433
4,Santa Clara,85,Urban,Bay Area,77661,67433


In [13]:
ca_ipums = pd.merge(ca_ipums, county_info, on = 'COUNTYFIP')

View of final `ca_ipums` dataframe.

In [14]:
ca_ipums.head()

Unnamed: 0,YEAR,COUNTYFIP,INDNAICS,PERWT,INCWAGE,NAICS Code,Industry Title,Industry,Crosswalk Value,County,Rural/Urban,CERF Regions,Regional COL,State COL
0,2014,73,5413,260.0,40000,5413,architectural engineering and related services,architectural engineering and related services,22,San Diego,Urban,San Diego-Imperial,66878,67433
1,2014,73,5413,260.0,40000,5413,architectural engineering and related services,architectural engineering and related services,23,San Diego,Urban,San Diego-Imperial,66878,67433
2,2014,73,5413,60.0,200000,5413,architectural engineering and related services,architectural engineering and related services,22,San Diego,Urban,San Diego-Imperial,66878,67433
3,2014,73,5413,60.0,200000,5413,architectural engineering and related services,architectural engineering and related services,23,San Diego,Urban,San Diego-Imperial,66878,67433
4,2014,73,5413,125.0,82000,5413,architectural engineering and related services,architectural engineering and related services,22,San Diego,Urban,San Diego-Imperial,66878,67433


## Create EDD Dataframe

#### EDD Data
The year for EDD data must be specified.

These CSV files are filtered and cleaned versions of the raw EDD Current Employment Statistics dataset. These CSV files can be created for upcoming years with the notebook `multiyear-edd-data-creation.ipynb`

In [15]:
edd = pd.read_csv(f'data/edd/edd_{year}.csv')

View of final `edd` dataframe.

In [16]:
edd.head()

Unnamed: 0,Area Type,Area Name,Year,Month,Date,Series Code,Seasonally Adjusted,Current Employment,Industry Title,COUNTYFIP,County,Rural/Urban,CERF Regions,Crosswalk Value
0,County,Alameda,2014,January,01/01/2014,80000000,N,24400,other services,1,Alameda,Urban,Bay Area,32
1,County,Alameda,2014,February,02/01/2014,80000000,N,24700,other services,1,Alameda,Urban,Bay Area,32
2,County,Alameda,2014,March,03/01/2014,80000000,N,24800,other services,1,Alameda,Urban,Bay Area,32
3,County,Alameda,2014,April,04/01/2014,80000000,N,25600,other services,1,Alameda,Urban,Bay Area,32
4,County,Alameda,2014,May,05/01/2014,80000000,N,25200,other services,1,Alameda,Urban,Bay Area,32


## Add High Wage Features

`add_geo_high_wages` is a function that adds the following engineered features:
- Above Threshold (Number of records above respective cost of living threshold)
- Weighted above threshold (Above Threshold multiplied by person weight variable)
- Unweighted industry counts (Number of records in that industry)
- Weighted industry counts (Sum of person weight values in that industry)
- Weighted high wage percentage (Weighted Above Threshold divided by Weighted Industry Counts as a percentage)

The features are created for the following geographical levels:
- Region
- California

In [17]:
ca_ipums_hw = add_geo_high_wages(ca_ipums)

View of final `ca_ipums_hw` dataframe.

In [18]:
ca_ipums_hw.head().T

Unnamed: 0,0,1,2,3,4
YEAR,2014,2014,2014,2014,2014
COUNTYFIP,73,73,73,73,73
INDNAICS,5413,5413,5413,5413,5413
PERWT,260.0,60.0,125.0,443.0,381.0
INCWAGE,40000,200000,82000,24000,44000
NAICS Code,5413,5413,5413,5413,5413
Industry Title,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services
Industry,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services,architectural engineering and related services
Crosswalk Value,22,22,22,22,22
County,San Diego,San Diego,San Diego,San Diego,San Diego


## Create High Wage Outputs Dataframe

`edd_to_hw` is the function that outputs the values needed to create the high wage output dataframe. This portion of the notebook runs through every unique combination of region, industry, and date, to get that respective output and add it to the dataframe.

Because of the nested structure of the EDD industries, only a small selection of EDD industries can be used to ensure that individuals in nested industries are not counted twice. The selection of these industries different per region, so the series code of each industry is documented in the `region_series_codes` global variable in the `jqi_functions.py` library. Each of these codes were then assigned their own crosswalk value, which align with each crosswalk value assigned to each IPUMS industry. For generating high wage outputs, we only iterate through the EDD industries that have been selected and have a designated crosswalk value.

Getting unique values for each region, industry, and date.

In [19]:
regions_ipums = ca_ipums_hw['CERF Regions'].unique()

In [20]:
crosswalk_vals = sorted(edd['Crosswalk Value'].unique())

In [21]:
dates_edd = edd['Date'].unique()

Initializing empty lists for the function's outputs to later be joined in a dataframe.

In [22]:
industries = []
dates = []
regions = []
counts = []
emp_counts = []

In [23]:
total_iterations = len(regions_ipums) * len(crosswalk_vals) * len(dates_edd)

For loop to populate lists for the high wage outputs. This will take some time to finish running.

In [24]:
progress_count = 0
for region in regions_ipums:
    for code in crosswalk_vals:
        for date in dates_edd:
            hw_count, hw_perc, employment_count, industry = edd_to_hw(edd, ca_ipums_hw, region, code, date, 10)
            industries.append(industry)
            dates.append(date)
            regions.append(region)
            counts.append(hw_count)
            emp_counts.append(employment_count)
            progress_count += 1
            if progress_count % (total_iterations / 10) == 0:
                percent_done = int((progress_count / total_iterations) * 100)
                print(f'Progress: {percent_done}% Complete')

Progress: 10% Complete
Progress: 20% Complete
Progress: 30% Complete
Progress: 40% Complete
Progress: 50% Complete
Progress: 60% Complete
Progress: 70% Complete
Progress: 80% Complete
Progress: 90% Complete
Progress: 100% Complete


Creating a cleaned dataframe from the output lists.

In [25]:
df_dict = {'Industry':industries, 'Date':dates, 'Region':regions, 'High Wage Count':counts, 'Employment Count':emp_counts}
hw_output = pd.DataFrame(df_dict)
hw_output = hw_output[hw_output['Industry'].notna()]
hw_output['Date']= pd.to_datetime(hw_output['Date'])
hw_output['High Wage Count'] = hw_output['High Wage Count'].astype(int)
hw_output = hw_output.sort_values(by=['Industry', 'Region', 'Date'])
hw_output = pd.merge(hw_output, cost_of_living, left_on='Region', right_on='Regions')
hw_output = hw_output[['Industry', 'Date', 'Region', 'High Wage Count', 'Employment Count', 'Cost of Living']]
hw_output = hw_output.drop_duplicates()

View of final `hw_output` dataframe.

In [26]:
hw_output.head()

Unnamed: 0,Industry,Date,Region,High Wage Count,Employment Count,Cost of Living
0,accommodation and food services,2014-01-01,Inland Empire,12184,123200.0,59469
1,accommodation and food services,2014-02-01,Inland Empire,12352,124900.0,59469
2,accommodation and food services,2014-03-01,Inland Empire,12490,126300.0,59469
3,accommodation and food services,2014-04-01,Inland Empire,12550,126900.0,59469
4,accommodation and food services,2014-05-01,Inland Empire,12659,128000.0,59469


Code to export the dataframe as a CSV file - change file path if needed and uncomment to run.

In [27]:
hw_output.to_csv(f'data/outputs/hw_outputs_{year}.csv', encoding='utf-8', index=False)

#### TESTING

In [77]:
bayarea = hw_output.loc[hw_output['Date'] == '2014-12-01'].loc[hw_output['Region'] == 'Bay Area']

In [78]:
bayarea['High Wage Count'].sum()

1112069

In [79]:
bayarea['Employment Count'].sum()

3445300.0

In [80]:
california = hw_output.loc[hw_output['Date'] == '2014-12-01']

In [81]:
california['High Wage Count'].sum()

4280345

In [82]:
california['Employment Count'].sum()

15338100.0

https://data.bls.gov/timeseries/LASST060000000000005?amp%253bdata_tool=XGtable&output_view=data&include_graphs=true