# Flolabs + USF Proof-of-Concept

*February 2024 - August 2024*  
*Collin McCarter*
----------------

## Description
**Question:** Can public data be used to quantify back wages owed by sub-county geographies?
**Answer:** Yes, and in field research is highly recommended to narrow the maximum and minimum estimates. At minimum, Pasco County FL shows **$\$$153,000** back wages owed in 2022. At maximum, Pasco County FL shows **$\$$2,124,537** back wages owed in 2022. The large difference is due to assumptions on **how prevalent** wage theft is among workers and **financial impact** of wage theft when it occurs. Additionally, this analysis focused on _"where victims live"_ whereas future analysis could explore _"where victims work"_ using US Census Zip Code Business Pattern data products.
----------------  

## FORMULA to estimate BACK WAGES  
* **MINIMUM:**  
    + COUNT of EMPLOYEES x
        + EMPLOYEES in High Violation Industries of Construction, Retail, Healthcare, Hotel/Food
    + 0.2% PREVALENCE of WAGE THEFT x 
    + $\$$1000 WAGE THEFT IMPACT PER EMPLOYEE
    
* **MAXIMUM:**  
    + COUNT of EMPLOYEES x
    + 5% of EMPLOYEES are "low-wage" in All Industries x
    + 68% PREVALENCE of WAGE THEFT x 
    + 15% of MEDIAN ANNUAL EARNINGS IMPACT of WAGE THEFT PER EMPLOYEE x
    + 5% of Tract median annual earnings for "low-wage" EMPLOYEES 
----------------  
## DATA and ASSUMPTIONS  
* Public data used includes
    + US Census, American Community Survey
    + US Census, Community Reslience Estimates
    + US Bureau of Labor Statistics, Employment by Industry
    + US Department of Labor, Wage and Hour Division
    + Pasco County Commission Districts, ArcGIS OpenData shapefile
        + https://data-pascocounty.opendata.arcgis.com/datasets/ef7ff9c226524e7898c4b6e095d4f248_0/explore?location=28.324279%2C-82.543450%2C10.00
        + Districts can cross Tract boundaries, which means aggregation of Tract results into Districts will be distorted. Sophisticated geo-spatial methods can be used to approximate a correct disaggregation of ONE Tract into TWO Districts, but that is not currently in scope. Currently, each District will get the full Tract metric if any portion of the District boundary intersects into the Tract. 58 of 131 Tracts are flagged in the csv output file as crossing two different District boundaries.
            + https://shapely.readthedocs.io/en/stable/predicates.html
* Key Python packages includes
    + pygris - geography shapefiles
    + folium - interactive maps
    + geopandas - storage and modification of geography statistics
* Key research sources and assumptions
    + Florida International University, Wage Theft Report for Hillsborough County https://labor.fiu.edu/publications/faculty-publications/wage-theft-report-for-hillsborough-county.pdf
        + PREVALENCE: **68%** of low wage workers had at least 1 wage theft experience
        + IMPACT: **15%** of annual earnings is lost when employees experience at least one wage theft incident
    + Flolabs analysis of US Department of Labor wage theft cases reported and US Bureau of Labor Statistics estimated employees by industry
        + PREVALENCE: **0.02%** of workers in high violation industries had at least 1 wage theft experience
        + IMPACT: **$\$$1000** per employee is lost to wage theft when present
* Supplemental Resources  
    + https://api.census.gov/data/2018/zbp/variables.html
    + https://api.census.gov/data/2022/acs/acs5/subject/variables.html
    + https://www.census.gov/programs-surveys/economic-census/year/2022/guidance/understanding-naics.html
    + https://data.census.gov
    + https://files.epi.org/pdf/125116.pdf
    + https://www.census.gov/programs-surveys/community-resilience-estimates/technical-documentation/methodology.html
    + https://vverde.github.io/blob/interactivechoropleth.html

In [1]:
# %pip install openpyxl
# %pip install pygris
# %pip install folium

import pandas as pd
import geopandas as gpd
import openpyxl
import branca.colormap as cm
import folium
from pygris import counties,tracts
import matplotlib.pyplot as plt

path1    = "Commission_District_-1780918262523526205.zip"
path_out1= "backwagesbytract_2022_final.csv"
path_out2= "Pasco2022_USCensus_emp_cre_race_backwages_final.html"
api_key  = '' # get US Census API Key

In [2]:
## 1 DATA INPUT: Pasco County FL Geographies
# ------------START----------------- #
county_var  = 'Pasco'
state_var   = 'FL'
year_var    = 2022

gpd_tracts_pasco = tracts(county=county_var,state=state_var,year=year_var,cache=False)

# grab state FIPS from gpd_tracts_pasco 
state_fips  = ','.join(str(x) for x in gpd_tracts_pasco.STATEFP.unique())
county_fips = ','.join(str(x) for x in gpd_tracts_pasco.COUNTYFP.unique())
# -------------END---------------- #

## 2 DATA INPUT: US Census CRE API
# ------------START----------------- #
year_str    = '2022'
api_table   = 'cre'
var_strlist = 'NAME,PRED3_PE'
var_names   = {'PRED3_PE':'rate_3plus_vul_ind'}
census_ints = ['rate_3plus_vul_ind']

url = f"https://api.census.gov/data/{year_str}/{api_table}?get={var_strlist}&for=tract:*&in=state:{state_fips}&in=county:{county_fips}&key={api_key}"

df_tracts_cre   = pd.read_json(url)
df_tracts_cre   = df_tracts_cre.rename(columns=df_tracts_cre.iloc[0]).drop(df_tracts_cre.index[0])
df_tracts_cre   = df_tracts_cre.rename(columns=var_names)
df_tracts_cre[census_ints] = df_tracts_cre[census_ints].fillna(0).astype(float)

df_tracts_cre = pd.concat([pd.DataFrame(),df_tracts_cre])
# -------------END---------------- #

## 3 DATA INPUT: US Census ACS API
# ------------START----------------- #
county_var  = 'Pasco'
state_var   = 'FL'
year_var    = 2022
year_str    = '2022'
api_table   = 'acs/acs5/subject'
var_strlist = 'NAME,S2413_C01_001E,S2413_C01_005E,S2413_C01_008E,S2413_C01_022E,S2413_C01_025E,S2404_C01_001E,S2404_C01_005E,S2404_C01_008E,S2404_C01_022E,S2404_C01_025E,S0601_C01_022E' #add total employees & earnings for top 5 NAICS
var_names   = {'S2413_C01_001E':'med_earnings_dollars_allind',
               'S2413_C01_005E':'med_earnings_dollars_construct',
               'S2413_C01_008E':'med_earnings_dollars_retail',
               'S2413_C01_022E':'med_earnings_dollars_health',
               'S2413_C01_025E':'med_earnings_dollars_hotelfood',
               'S2404_C01_001E':'count_emp_allind',
               'S2404_C01_005E':'count_emp_construct',  #S2404
               'S2404_C01_008E':'count_emp_retail',
               'S2404_C01_022E':'count_emp_health',
               'S2404_C01_025E':'count_emp_hotelfood',
               'S0601_C01_022E':'perc_whitealone_nonhisp'}
census_ints = ['med_earnings_dollars_allind',
               'med_earnings_dollars_construct',
               'med_earnings_dollars_retail',
               'med_earnings_dollars_health',
               'med_earnings_dollars_hotelfood',
               'count_emp_allind',
               'count_emp_construct',
               'count_emp_retail',
               'count_emp_health',
               'count_emp_hotelfood']

url = f"https://api.census.gov/data/{year_str}/{api_table}?get={var_strlist}&for=tract:*&in=state:{state_fips}&in=county:{county_fips}&key={api_key}"

df_tracts_acs5   = pd.read_json(url)
df_tracts_acs5   = df_tracts_acs5.rename(columns=df_tracts_acs5.iloc[0]).drop(df_tracts_acs5.index[0])
df_tracts_acs5   = df_tracts_acs5.rename(columns=var_names)
df_tracts_acs5[census_ints] = df_tracts_acs5[census_ints].fillna(0).astype(int)

df_tracts_acs5.loc[df_tracts_acs5.med_earnings_dollars_allind<0,'med_earnings_dollars_allind'] = 0
df_tracts_acs5.loc[df_tracts_acs5.med_earnings_dollars_construct<0,'med_earnings_dollars_construct'] = 0
df_tracts_acs5.loc[df_tracts_acs5.med_earnings_dollars_retail<0,'med_earnings_dollars_retail'] = 0
df_tracts_acs5.loc[df_tracts_acs5.med_earnings_dollars_health<0,'med_earnings_dollars_health'] = 0
df_tracts_acs5.loc[df_tracts_acs5.med_earnings_dollars_hotelfood<0,'med_earnings_dollars_hotelfood'] = 0

df_tracts_acs5 = pd.concat([pd.DataFrame(),df_tracts_acs5])
# -------------END---------------- #

## 4 DATA INPUT: PASCO COUNTY COMMISSION DISTRICTS
# ------------START----------------- #
df_commiss_districts = gpd.read_file(path1).to_crs(gpd_tracts_pasco.crs)
df_commiss_districts = df_commiss_districts.rename(columns={'NAME': 'pasco_com_district'})
# -------------END---------------- #

## MERGE DATA INPUTS
df_tracts_merged = gpd_tracts_pasco.merge(df_tracts_acs5,left_on='TRACTCE',right_on='tract').merge(df_tracts_cre,left_on='TRACTCE',right_on='tract')
print(len(df_tracts_merged))
df_tracts_merged = df_tracts_merged.sjoin(df_commiss_districts[['geometry','pasco_com_district']], how="left", predicate='intersects')
print(len(df_tracts_merged))
df_tracts_merged.drop(columns=['GEOID','NAME_x','NAMELSAD','MTFCC','FUNCSTAT',
                              'ALAND','AWATER', 'INTPTLAT', 'INTPTLON','NAME_y',
                               'state_x', 'county_x','tract_x','state_y', 'county_y','tract_y','index_right'
                              ],inplace=True)

## DATA CLEANING
# tract id treated as integer
df_tracts_merged['TRACTCE'] = df_tracts_merged['TRACTCE'].astype(int)
# percentage treated as float
df_tracts_merged['perc_whitealone_nonhisp'] = df_tracts_merged['perc_whitealone_nonhisp'].astype(float)
# non-zero imputed values treated as zero
df_tracts_merged.loc[df_tracts_merged['perc_whitealone_nonhisp'] < 0, 'perc_whitealone_nonhisp'] = 0
# flag tracts crossing district boundaries
df_tracts_merged['tract_crosses_district'] = df_tracts_merged.duplicated(subset='TRACTCE',keep='first')
# drop 58 rows of duplicate tracts (keep first) which cross district boundaries
df_tracts_merged.drop_duplicates(subset='TRACTCE',keep='first',inplace=True,ignore_index=True)

Using FIPS code '12' for input 'FL'
Using FIPS code '101' for input 'Pasco'
131
189


In [3]:
## FORMULA MAXIMUM
## all employees * 0.68 * (0.15 * median annual earnings)
# ------------START----------------- #
employees  = 'count_emp_allind' # sourced from US Census ACS
prevalence = 0.68 # sourced from FIU study
impact     = 0.15 # source from FIU study, 15% of median annual earnings column 'med_earnings_dollars_allind'; may need to multiple by 0.25 to represent lower wages
low_wage   = 0.05 # assumption that 5% of employees are "low-wage" who make 5% of aggregate Tract median annual earnings

df_tracts_merged['max_emp_wagetheft_yes']              = round(df_tracts_merged[employees]*prevalence*low_wage,0)
df_tracts_merged['max_emp_wagetheft_yes_owed_dollars'] = round(df_tracts_merged['max_emp_wagetheft_yes']*df_tracts_merged['med_earnings_dollars_allind']*impact*low_wage,0)
print(df_tracts_merged.max_emp_wagetheft_yes_owed_dollars.sum()
      ,'\n   MAXIMUM: Estimated Back Wages Owed, All Industries, 68% PREVALENCE, 15% Median Earnings IMPACT') 
# -------------END---------------- #

## FORMULA MINIMUM
## high violation industries employees * 0.002 * 1000
# ------------START----------------- #
employees  = 'count_emp_highvioind' # sourced from US Census ACS
prevalence = 0.002 # sourced from analysis of National US Dept of Labor reported divided by US BLS employees
impact     = 1000 # sourced from analysis of National US Dept of Labor reported divided by US BLS employees

df_tracts_merged['count_emp_highvioind']               = df_tracts_merged.loc[:,['count_emp_construct','count_emp_retail','count_emp_health','count_emp_hotelfood']].sum(axis=1)
df_tracts_merged['min_emp_wagetheft_yes']              = round(df_tracts_merged[employees]*prevalence,0)
df_tracts_merged['min_emp_wagetheft_yes_owed_dollars'] = round(df_tracts_merged['min_emp_wagetheft_yes']*impact,0)
print(df_tracts_merged.min_emp_wagetheft_yes_owed_dollars.sum()
      ,'\n   MINIMUM: Estimated Back Wages Owed, High Violation Industries, 0.2% PREVALENCE, $1000 IMPACT') 
# -------------END---------------- #

2124537.0 
   MAXIMUM: Estimated Back Wages Owed, All Industries, 68% PREVALENCE, 15% Median Earnings IMPACT
153000.0 
   MINIMUM: Estimated Back Wages Owed, High Violation Industries, 0.2% PREVALENCE, $1000 IMPACT


In [4]:
## MAPPING
# ------------START----------------- #

col1 = 'min_emp_wagetheft_yes_owed_dollars'
col2 = 'max_emp_wagetheft_yes_owed_dollars'
col3 = 'count_emp_highvioind'
col4 = 'rate_3plus_vul_ind'
col5 = 'perc_whitealone_nonhisp'
# Future, map index columns combining fields above to better determine patterns across variables
# col6 = 'index_emphighvioind_ratevul_morewhitenonhisp'
# col7 = 'index_emphighvioind_ratevul_lesswhitenonhisp'


# Create a base folium map with desired location and zoom level
m = folium.Map(location=[df_tracts_merged.geometry.iloc[0].centroid.y,df_tracts_merged.geometry.iloc[0].centroid.x], zoom_start=10, tiles='OpenStreetMap')

# Add static tract + district numbers

for tract in df_tracts_merged.index:
    tract_text = df_tracts_merged.loc[tract,'TRACTCE']
    folium.map.Marker(
        (df_tracts_merged.geometry.centroid.y.iloc[tract],df_tracts_merged.geometry.centroid.x.iloc[tract]),
        icon=folium.features.DivIcon(
            icon_size=(10,5),
            icon_anchor=(0,0),
            html='<div style="font-size: 5pt">%s</div>' % tract_text,
            )
        ).add_to(m)
for district in df_commiss_districts.index:
    district_text = df_commiss_districts.loc[district,'pasco_com_district']
    folium.map.Marker(
        (df_commiss_districts.geometry.centroid.y.iloc[district],df_commiss_districts.geometry.centroid.x.iloc[district]),
        icon=folium.features.DivIcon(
            icon_size=(100,100),
            icon_anchor=(0,0),
            html='<div style="font-size: 14pt">%s</div>' % district_text,
            )
        ).add_to(m)

# Add layers to the map
folium.Choropleth(
    geo_data=df_tracts_merged[['TRACTCE','geometry']],
    data=df_tracts_merged[col1],
    name=col1,
    columns=[col1,'TRACTCE'],
    key_on="feature.id",
    fill_color="YlGnBu",
    legend_name=col1,
    highlight=True,
).add_to(m)
folium.Choropleth(
    geo_data=df_tracts_merged[['TRACTCE','geometry']],
    data=df_tracts_merged[col2],
    name=col2,
    columns=[col2,'TRACTCE'],
    key_on="feature.id",
    fill_color="YlGnBu",
    legend_name=col2,
    highlight=True,
).add_to(m)
folium.Choropleth(
    geo_data=df_tracts_merged[['TRACTCE','geometry']],
    data=df_tracts_merged[col3],
    name=col3,
    columns=[col3,'TRACTCE'],
    key_on="feature.id",
    fill_color="YlGnBu",
    legend_name=col3,
    highlight=True,
).add_to(m)
folium.Choropleth(
    geo_data=df_tracts_merged[['TRACTCE','geometry']],
    data=df_tracts_merged[col4],
    name=col4,
    columns=[col4,'TRACTCE'],
    key_on="feature.id",
    fill_color="YlGnBu",
    legend_name=col4,
    highlight=True,
).add_to(m)
folium.Choropleth(
    geo_data=df_tracts_merged[['TRACTCE','geometry']],
    data=df_tracts_merged[col5],
    name=col5,
    columns=[col5,'TRACTCE'],
    key_on="feature.id",
    fill_color="YlGnBu",
    legend_name=col5,
    highlight=True,
).add_to(m)
folium.GeoJson(df_commiss_districts[['pasco_com_district','geometry']].to_json(),
               name='pasco_com_district',
               color="black",
               weight=5).add_to(m)
folium.LayerControl().add_to(m)
m
# -------------END---------------- #
# warnings on map are due to writing tract and district numbers by polygon centroid, not an issue.


  (df_tracts_merged.geometry.centroid.y.iloc[tract],df_tracts_merged.geometry.centroid.x.iloc[tract]),

  (df_commiss_districts.geometry.centroid.y.iloc[district],df_commiss_districts.geometry.centroid.x.iloc[district]),


In [5]:
## EXPORT FILES
# ------------START----------------- #
df_tracts_merged.to_csv(path_out1)
m.save(path_out2)
# -------------END---------------- #

In [6]:
df_tracts_merged.head(1).T

Unnamed: 0,0
STATEFP,12
COUNTYFP,101
TRACTCE,31101
geometry,"POLYGON ((-82.699345 28.338521, -82.699338 28...."
med_earnings_dollars_allind,32284
med_earnings_dollars_construct,29283
med_earnings_dollars_retail,37353
med_earnings_dollars_health,25132
med_earnings_dollars_hotelfood,19432
count_emp_allind,1095
