# Obtain and Clean LODES Data

> The LEHD Origin-Destination Employment Statistics (LODES) datasets are released both as
part of the OnTheMap application and in raw form as a set of comma separated variable (CSV)
text files. This document describes the structure of those raw files and provides basic information
for users who want to perform analytical work on the data outside of the OnTheMap application." (U.S. Census, 2021)

U.S. Census Bureau. (2021). LEHD Origin-Destination Employment Statistics Data (2002-2018) [computer file]. Washington, DC: U.S. Census Bureau, Longitudinal-Employer Household Dynamics Program [distributor], accessed on {CURRENT DATE} at https://lehd.ces.census.gov/data/#lodes. LODES 7.5 [version]

The LODES Data provides emplyoment characterstics and origin-destination data
    
    1. Read in LODES data
    2. Select Work Area Charactersitics in Study Area
    3. Select Origin-Destination data in Study Area

## Description of Program
- program:    LODES_1av4_CleanLODESdata
- task:       Obtain and read in LODES data
- See github commits for description of program updates
- Current Version:    2022-01-07
- Version 4 description - try to run with fewer charactersitics and allow for lower fitness
-     Need to get a result in less time
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and LODES Jobs Data". 
Archived on Github and ICPSR.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import os # For saving output to path

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("numpy version:     ", np.__version__)
print("pandas version:    ", pd.__version__)

Python Version      3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:22:46) [MSC v.1916 64 bit (AMD64)]
numpy version:      1.22.0
pandas version:     1.3.5


In [3]:
os.getcwd()

'g:\\Shared drives\\HRRC_IN-CORE\\Tasks\\P4.9 Testebeds\\WorkNPR'

In [4]:
# Store Program Name for output files to have the same name
programname = "LODES_1av4_CleanLODESdata"
# Save Outputfolder - due to long folder name paths output saved to folder with shorter name
# files from this program will be saved with the program name - this helps to follow the overall workflow
outputfolder = "lodes_workflow_outputv4"
# Make directory to save output
if not os.path.exists(outputfolder):
    os.mkdir(outputfolder)

### Setup notebook enviroment to access Cloned Github Package
This notebook uses packages that are in developement. The packages are available at:

https://github.com/npr99/Labor_Market_Allocation

To replicate this notebook Clone the Github Package to a folder that is a sibling of this notebook.

To access the sibling package you will need to append the parent directory ('..') to the system path list.

In [5]:
# to access new package that is in a sibling folder - the system path list needs to inlcude the parent folder (..)
# append the path of the directory that includes the github repository.
sys.path.append("..\\github_com\\npr99\\Labor_Market_Allocation")

# Setup access to IN-CORE
https://incore.ncsa.illinois.edu/

In [None]:
#from pyincore import IncoreClient, Dataset, FragilityService, MappingSet, DataService
#from pyincore_viz.geoutil import GeoUtil as viz

### IN-CORE addons
This program uses coded that is being developed as potential add ons to pyincore. These functions are in a folder called pyincore_addons - this folder is located in the same directory as this notebook.
The add on functions are organized to mirror the folder sturcture of https://github.com/IN-CORE/pyincore

Each add on function attempts to follow the structure of existing pyincore functions and includes some help information.

In [6]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
# function that loops through lodes data structure
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_fullloop as lodes
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_mcmcsa_loops as mcmc
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_mcmcsa_util as mcmc_util
import pyincoredata_addons.SourceData.lehd_ces_census_gov.lodes_datautil as data_util


# since the geoutil is under construction it might need to be reloaded
#from importlib import reload 
#lodes = reload(lodes) # with auto reload on this command is not needed

# Print list of add on functions
##from inspect import getmembers, isfunction
##print(getmembers(lodes,isfunction))
##print(getmembers(mcmc,isfunction))

## Read In LODES data for a County

In [7]:
# Setup list and dictionaries to loop through
# countylist can only have 1 county at this time
countylist = {'37155' : 'Robeson County, NC'}
# note 2010 is the earliest year with Federal Jobtypes 
# to include years before 2010 requires modifiying the program
#years = ['2010','2011','2012','2013','2014','2015','2016','2017','2018']
#years = ['2012','2013','2014','2015','2016']
years = ['2010']

In [9]:
stacked_df = lodes.obtain_lodes_county_loop(countylist = countylist, 
                        years = years,
                        outputfoldername = outputfolder)

Importing and cleaning LODES data for the following files in the od group for 2010.
Importing and cleaning LODES data for the following files in the wac group for 2010.
Importing and cleaning LODES data for the following files in the rac group for 2010.


In [10]:
lodes.explore_jobcounts(stacked_df)

Unnamed: 0,jobtype,nc_37155_od_2010_na,nc_37155_wac_2010_SE,nc_37155_rac_2010_SE
0,JT03,41503.0,26878.0,437582.0
1,JT05,113.0,40.0,3227.0
2,JT07,10348.0,7914.0,91138.0
3,JT09,3799.0,2598.0,33601.0
4,JT11,611.0,511.0,4769.0
5,All,56375.0,37941.0,570344.0


In [11]:
focus_jobtype = 'JT11'
seed = 133234

stacked_jobtype_df = mcmc.split_stack_df_byjobtype(stacked_df)
for county in countylist:
    countyfips = county
    joblist_df, mcmcsa_filepath = \
        mcmc.outer_mcmc_sa_input("2010", years,
                    focus_jobtype,
                    stacked_jobtype_df,
                    outputfolder,
                    countyfips,
                    seed
                    )

Job list for MCMC SA input already exists.


In [12]:
    
# ## Run Markov-Chain-Monte Carlo Simmulated Anneal
"""
The MCMC SA process will take the combined WAC-OD-RAC joblist and 
futher reduce the number of possible jobs. 
This process will randomly select jobs and compare the job 
characteristics to the known WAC file job characteristics.
"""

random_accept_threshold = 0.1
# How much to reduce error by
start_reduction_threshold = 0.6
max_reduction_threshold = 0.9
seed = 133234

joblist_mcmcsa_df = mcmc.outer_mcmc_sa_loop(
                years = years,
                focus_jobtype = focus_jobtype,
                stacked_jobtype_df = stacked_jobtype_df,
                joblist_df = joblist_df,
                seed = seed,
                random_accept_threshold = random_accept_threshold,
                mcmcsa_filepath = mcmcsa_filepath,
                start_reduction_threshold = start_reduction_threshold,
                num_procs=8
                )

3 Random seeds set 71749 94221 39830
The JT11 joblist has 707 possible jobs.
64 % of the jobs have a 100% probability of selection.
The MCMC SA will attempt to predict the remaining job.
There are 51 Blocks
Running MCMC SA on block 371559601023032
The starting outer total fitness = 0 and outer combined fitness = 0
#################################
Attempt to reduce size of possible job list by 0.6
#################################
Possible and expected job lists have the same length: 10
The final outer total fitness = 0 and outer combined fitness = 0
Running MCMC SA on block 371559612003011
The starting outer total fitness = 33 and outer combined fitness = 98
#################################
Attempt to reduce size of possible job list by 0.6
#################################
#################################
Attempt to reduce size of possible job list by 0.7
#################################
Removing check_S000 from column list
#################################
Attempt to reduce size 

In [13]:
compare_C000, total_fitness, combined_fitness = \
    mcmc.compare_expected_possible(joblist_mcmcsa_df, stacked_jobtype_df)

There are 51 Blocks


In [14]:
compare_C000

Unnamed: 0,w_geocode,jobtype,expected count,MCMC SA count,difference
0,371559601023032,JT11,10.0,10,0.0
1,371559602013046,JT11,5.0,5,0.0
2,371559602021017,JT11,1.0,1,0.0
3,371559602021018,JT11,2.0,2,0.0
4,371559603002050,JT11,3.0,3,0.0
5,371559603002051,JT11,3.0,3,0.0
6,371559603002054,JT11,3.0,3,0.0
7,371559603002080,JT11,2.0,2,0.0
8,371559603004017,JT11,2.0,2,0.0
9,371559605021031,JT11,20.0,20,0.0


In [15]:
total_fitness

Unnamed: 0,1
Agejobtype,0
Earningsjobtype,0
Educationjobtype,2
Ethnicityjobtype,4
IndustryCodejobtype,0
Racejobtype,2
Sexjobtype,0
fitness,8
iteration,1
seed,na


In [16]:
combined_fitness

Unnamed: 0,1
AgeEarnings,0
EducationEarnings,2
EthnicityEarnings,4
RaceEarnings,4
SexEarnings,0
EarningsAge,0
EducationAge,2
EthnicityAge,4
RaceAge,2
SexAge,0
