# Preprocess Bureau of Labor Statistic Data

## Occupational Requirement Survey

From `or.txt` (readme):
```
The series_id (ORUP1000066700000560) can be broken out into:

Code                                    Value(Example)

Survey abbreviation             =               OR
Seasonal(code)                  =               U
Requirement_code                =               P
Ownership_code                  =               1
Industry_code                   =               0000
Occupation_code                 =               667
Job_characteristic_code         =               000
Estimate_code                   =               00560
```

Goal: Extract a "job id" to `estimate_code` from each `series_id`.

Load all data (`or.data.1.AllData`) into a pandas DataFrame

In [1]:
import matplotlib

In [2]:
import pandas

series = pandas.read_csv('ordata/or.data.1.AllData', delimiter='\t')
series['value'] = pandas.to_numeric(series['value'], errors='coerce')

In [3]:
series.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,ORUC1000000000000728,2017,A01,5.1,6.0
1,ORUC1000000000001030,2017,A01,29.0,
2,ORUC1000000000001031,2017,A01,48.2,7.0
3,ORUC1000000000001032,2017,A01,15.6,
4,ORUC1000000000001033,2017,A01,5.8,


Check if `occupation_code` is unique (viable candidate for a "job id"

In [4]:
occupation_code = series['series_id'].map(lambda a: a[9:12])
occupation_code.head()

0    000
1    000
2    000
3    000
4    000
Name: series_id, dtype: object

In [5]:
series_prefix = series['series_id'].map(lambda a: a[:3] + a[4:15])
series_prefix.head()

0    ORU10000000000
1    ORU10000000000
2    ORU10000000000
3    ORU10000000000
4    ORU10000000000
Name: series_id, dtype: object

In [6]:
len(series_prefix.unique()), len(occupation_code.unique())

(338, 338)

`occupation_code` is a viable "job id" key (only one row for each metric for each occupation code/job). Map to `soc_code` (really O\*NET-SOC 2010 code) and convert to proper `soc_code` (drop the last two digits) for better combination with other datasets from BLS.

In [7]:
series['occupation_code'] = series['series_id'].map(lambda a: a[9:12])

In [8]:
occupations = pandas.read_csv('ordata/or.occupation', delimiter='\t', index_col=False, dtype={'occupation_code': str, 'soc_code': str}, usecols=['occupation_code', 'soc_code'])
occupations['soc_code'] = occupations['soc_code'].map(lambda a: a[:6]) # convert from ONETSOC to SOC code
occupations.head()

Unnamed: 0,occupation_code,soc_code
0,0,0
1,1,111011
2,3,111021
3,7,112021
4,8,112022


In [9]:
series = occupations.merge(series, on='occupation_code')
series.head()

Unnamed: 0,occupation_code,soc_code,series_id,year,period,value,footnote_codes
0,0,0,ORUC1000000000000728,2017,A01,5.1,6.0
1,0,0,ORUC1000000000001030,2017,A01,29.0,
2,0,0,ORUC1000000000001031,2017,A01,48.2,7.0
3,0,0,ORUC1000000000001032,2017,A01,15.6,
4,0,0,ORUC1000000000001033,2017,A01,5.8,


Extract `estimate_code` aka metric id

In [10]:
series['estimate_code'] = series['series_id'].map(lambda a: a[15:20])
series.head()

Unnamed: 0,occupation_code,soc_code,series_id,year,period,value,footnote_codes,estimate_code
0,0,0,ORUC1000000000000728,2017,A01,5.1,6.0,728
1,0,0,ORUC1000000000001030,2017,A01,29.0,,1030
2,0,0,ORUC1000000000001031,2017,A01,48.2,7.0,1031
3,0,0,ORUC1000000000001032,2017,A01,15.6,,1032
4,0,0,ORUC1000000000001033,2017,A01,5.8,,1033


In [11]:
len(series['estimate_code'].unique())

342

Rows can be unique identified by (`occupation_code` and `estimate_code`) 

In [12]:
# sanity check
num_occs = len(series['occupation_code'].unique())
num_ests = len(series['estimate_code'].unique())
print('{} * {} = {}'.format(num_occs, num_ests, num_occs * num_ests))
print(len(series))
# because some of the data is 0/missing

338 * 342 = 115596
40352


Merge data (average) by (`soc_code` and `estimate_code`).
Then, convert the data from rows with keyed by (`occupation_code`, `estimate_code`) to a matrix of with `occupation_code` rows and `estimate_code` columns.

In [13]:
estimate_values_by_soc = series.groupby(by=('soc_code', 'estimate_code'))['value'].mean()
orsdata = estimate_values_by_soc.to_frame().pivot_table(index="soc_code", columns="estimate_code", values="value")
orsdata.head()

estimate_code,00064,00065,00066,00067,00068,00069,00070,00071,00072,00075,...,01076,01077,01080,01081,01084,01085,01087,01088,01090,01091
soc_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4.0,30.3,5.0,3.5,6.7,13.4,19.1,15.9,2.1,30.6,...,25.8,19.3,77.9,22.1,11.0,22.9,18.2,15.7,1.4,32.5
110000,0.7,,,,1.4,4.8,15.8,57.7,18.4,3.6,...,9.0,24.8,74.6,25.4,8.9,17.6,11.5,15.0,1.2,25.3
111011,,,,,,,,40.7,56.9,,...,,,,,,,,,,
111021,,,,,,6.1,16.7,51.3,20.9,,...,11.0,19.1,70.4,29.6,10.7,8.3,,15.7,,18.4
112021,,,,,,,14.6,57.6,24.9,,...,,21.8,65.3,34.7,,,,,,


In [26]:
orsdata.to_csv('ordata-processed.csv')

## Occupational Employment Survey

Add several metrics from OES as columns (national and state-level). TODO does this format work for state level data in the cloropath?

From `oe.txt`:
```
The series_id (OEUM000040000000000000001) can be broken out into:

Code					Value(Example)

survey abbreviation =       OE
seasonal(code)      =       U
areatype-code       =       M
area_code           =       0000400
industry_code       =       000000
occupation_code     =       000000 
datatype_code       =       01
```

First, load data and extract `occupation_code` (SOC format), `area_code`, `areatype-code`, and `datatype_code`.

In [76]:
oesdata = pandas.read_csv(
    'oedata/oe.data.1.AllData', 
    delimiter='\t',      
    usecols=['series_id', 'value'])
oesdata['value'] = pandas.to_numeric(oesdata['value'], errors='coerce')
oesdata['soc_code'] = oesdata['series_id'].map(lambda s: s[17:23])
oesdata['area_code'] = oesdata['series_id'].map(lambda s: s[4:11])
oesdata['areatype-code'] = oesdata['series_id'].map(lambda s: s[3:4])
oesdata['datatype_code'] = oesdata['series_id'].map(lambda s: s[23:25])
oesdata.head()

Unnamed: 0,series_id,value,soc_code,area_code,areatype-code,datatype_code
0,OEUM001018000000000000001,64450.0,0,10180,M,1
1,OEUM001018000000000000002,2.3,0,10180,M,2
2,OEUM001018000000000000003,19.88,0,10180,M,3
3,OEUM001018000000000000004,41350.0,0,10180,M,4
4,OEUM001018000000000000005,2.2,0,10180,M,5


Filter out municipal level data (national and state only)

In [77]:
len(oesdata)

6253097

In [78]:
oesdata = oesdata[oesdata['areatype-code'] == 'N']
# oesdata = oesdata[oesdata['areatype-code'] != 'M']
len(oesdata)

1954290

Add in area names

In [79]:
areas = pandas.read_csv(
    'oedata/oe.area', 
    delimiter='\t',
    usecols=['area_code', 'area_name'],
    converters={'area_code': str},
    index_col=False
)
areas.head()

Unnamed: 0,area_code,area_name
0,0,National
1,11500,"Anniston-Oxford-Jacksonville, AL"
2,12220,"Auburn-Opelika, AL"
3,13820,"Birmingham-Hoover, AL"
4,19300,"Daphne-Fairhope-Foley, AL"


In [80]:
oesdata = oesdata.merge(areas, on='area_code')
oesdata.head()

Unnamed: 0,series_id,value,soc_code,area_code,areatype-code,datatype_code,area_name
0,OEUN000000000000000000001,142549200.0,0,0,N,1,National
1,OEUN000000000000000000002,0.1,0,0,N,2,National
2,OEUN000000000000000000003,24.34,0,0,N,3,National
3,OEUN000000000000000000004,50620.0,0,0,N,4,National
4,OEUN000000000000000000005,0.1,0,0,N,5,National


From `oe.datatype`:
```
datatype_code datatype_name
01	Employment	
02	Employment percent relative standard error	
03	Hourly mean wage	
04	Annual mean wage	
05	Wage percent relative standard error	
06	Hourly 10th percentile wage	
07	Hourly 25th percentile wage	
08	Hourly median wage	
09	Hourly 75th percentile wage	
10	Hourly 90th percentile wage	
11	Annual 10th percentile wage	
12	Annual 25th percentile wage	
13	Annual median wage	
14	Annual 75th percentile wage	
15	Annual 90th percentile wage	
16	Employment per 1,000 jobs	
17	Location Quotient	
```

In [81]:
oesdata = oesdata[oesdata['datatype_code'].isin(['01', '13'])]
len(oesdata)

260572

In [96]:
oesdata_by_soc_area = oesdata.pivot_table(
#     index=('soc_code', 'area_name'),
    index='soc_code',
    columns='datatype_code', 
    values='value'
)
oesdata_by_soc_area.columns = ['num_employed', 'med_annual_wage']
oesdata_by_soc_area.head()

Unnamed: 0_level_0,num_employed,med_annual_wage
soc_code,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1755982.0,43163.622222
110000,88199.13,100105.567929
111000,29087.7,106168.819599
111011,2891.013,169287.192308
111021,25788.15,101717.995546


In [59]:
# oesdata_by_soc = oesdata_by_soc_area.pivot_table(
#     index='soc_code', 
#     columns='area_name',
#     values=('num_employed', 'med_annual_wage')
# )
# oesdata_by_soc.dropna(how='all', inplace=True)
# oesdata_by_soc.shape

(1067, 110)

In [98]:
oesdata_by_soc_area.to_csv('oedata-processed.csv')

## Join Calculated ORS metrics and OES data

See `transform_or.ipynb`

In [99]:
calculated_ors = pandas.read_csv('calculated_metrics.csv', converters={'soc_code': str})
calculated_ors.set_index('soc_code', inplace=True)
calculated_ors.head()

Unnamed: 0_level_0,occupation_text,communication,danger,experience,interaction_complexity,pace_of_work,physicality,uncertain_decisions,variety
soc_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,All Workers,0.217566,0.230495,1.050002,0.515043,0.58981,0.399198,0.334232,1.402005
110000,Management Occupations,0.215913,-0.299971,1.707406,2.28219,0.283505,-0.449087,1.76954,1.58806
111011,Chief Executives,0.540168,-0.541306,0.398697,2.333767,0.962482,-0.751402,2.237894,0.452748
111021,General and Operations Managers,0.77717,-0.30096,0.440351,1.839416,-0.517994,-0.428323,1.784039,1.616338
112021,Marketing Managers,-0.050012,-0.588087,0.349113,2.418221,-0.517994,-0.911465,1.950008,1.155236


In [102]:
data = oesdata_by_soc_area.join(calculated_ors, how='right')
data.head()

Unnamed: 0_level_0,num_employed,med_annual_wage,occupation_text,communication,danger,experience,interaction_complexity,pace_of_work,physicality,uncertain_decisions,variety
soc_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1755982.0,43163.622222,All Workers,0.217566,0.230495,1.050002,0.515043,0.58981,0.399198,0.334232,1.402005
110000,88199.13,100105.567929,Management Occupations,0.215913,-0.299971,1.707406,2.28219,0.283505,-0.449087,1.76954,1.58806
111011,2891.013,169287.192308,Chief Executives,0.540168,-0.541306,0.398697,2.333767,0.962482,-0.751402,2.237894,0.452748
111021,25788.15,101717.995546,General and Operations Managers,0.77717,-0.30096,0.440351,1.839416,-0.517994,-0.428323,1.784039,1.616338
112021,3166.102,117437.95977,Marketing Managers,-0.050012,-0.588087,0.349113,2.418221,-0.517994,-0.911465,1.950008,1.155236


In [103]:
data.to_csv('job-data.csv')