From `or.txt` (readme):
```
The series_id (ORUP1000066700000560) can be broken out into:

Code                                    Value(Example)

Survey abbreviation             =               OR
Seasonal(code)                  =               U
Requirement_code                =               P
Ownership_code                  =               1
Industry_code                   =               0000
Occupation_code                 =               667
Job_characteristic_code         =               000
Estimate_code                   =               00560
```

Goal: Extract a "job id" to `estimate_code` from each `series_id`.

Load all data (`or.data.1.AllData`) into a pandas DataFrame

In [1]:
import pandas

series = pandas.read_csv('ordata/or.data.1.AllData', delimiter='\t')
series['value'] = pandas.to_numeric(series['value'], errors='coerce')

In [2]:
series.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,ORUC1000000000000728,2017,A01,5.1,6.0
1,ORUC1000000000001030,2017,A01,29.0,
2,ORUC1000000000001031,2017,A01,48.2,7.0
3,ORUC1000000000001032,2017,A01,15.6,
4,ORUC1000000000001033,2017,A01,5.8,


Check if `occupation_code` is unique (viable candidate for a "job id"

In [3]:
occupation_code = series['series_id'].map(lambda a: a[9:12])
occupation_code.head()

0    000
1    000
2    000
3    000
4    000
Name: series_id, dtype: object

In [4]:
series_prefix = series['series_id'].map(lambda a: a[:3] + a[4:15])
series_prefix.head()

0    ORU10000000000
1    ORU10000000000
2    ORU10000000000
3    ORU10000000000
4    ORU10000000000
Name: series_id, dtype: object

In [5]:
len(series_prefix.unique()), len(occupation_code.unique())

(338, 338)

`occupation_code` is a viable "job id" key (only one row for each metric for each occupation code/job). Map to `soc_code` (really O\*NET-SOC 2010 code) and convert to proper `soc_code` (drop the last two digits) for better combination with other datasets from BLS.

In [6]:
series['occupation_code'] = series['series_id'].map(lambda a: a[9:12])

In [7]:
occupations = pandas.read_csv('ordata/or.occupation', delimiter='\t', index_col=False, dtype={'occupation_code': str, 'soc_code': str}, usecols=['occupation_code', 'soc_code'])
occupations['soc_code'] = occupations['soc_code'].map(lambda a: a[:6]) # convert from ONETSOC to SOC code
occupations.head()

Unnamed: 0,occupation_code,soc_code
0,0,0
1,1,111011
2,3,111021
3,7,112021
4,8,112022


In [8]:
series = occupations.merge(series, on='occupation_code')
series.head()

Unnamed: 0,occupation_code,soc_code,series_id,year,period,value,footnote_codes
0,0,0,ORUC1000000000000728,2017,A01,5.1,6.0
1,0,0,ORUC1000000000001030,2017,A01,29.0,
2,0,0,ORUC1000000000001031,2017,A01,48.2,7.0
3,0,0,ORUC1000000000001032,2017,A01,15.6,
4,0,0,ORUC1000000000001033,2017,A01,5.8,


Extract `estimate_code` aka metric id

In [9]:
series['estimate_code'] = series['series_id'].map(lambda a: a[15:20])
series.head()

Unnamed: 0,occupation_code,soc_code,series_id,year,period,value,footnote_codes,estimate_code
0,0,0,ORUC1000000000000728,2017,A01,5.1,6.0,728
1,0,0,ORUC1000000000001030,2017,A01,29.0,,1030
2,0,0,ORUC1000000000001031,2017,A01,48.2,7.0,1031
3,0,0,ORUC1000000000001032,2017,A01,15.6,,1032
4,0,0,ORUC1000000000001033,2017,A01,5.8,,1033


In [10]:
len(series['estimate_code'].unique())

342

Rows can be unique identified by (`occupation_code` and `estimate_code`) 

In [11]:
# sanity check
num_occs = len(series['occupation_code'].unique())
num_ests = len(series['estimate_code'].unique())
print('{} * {} = {}'.format(num_occs, num_ests, num_occs * num_ests))
print(len(series))
# because some of the data is 0/missing

338 * 342 = 115596
40352


Merge data (average) by (`soc_code` and `estimate_code`)

In [12]:
estimate_values_by_soc = series.groupby(by=('soc_code', 'estimate_code'))['value'].mean()
data = estimate_values_by_soc.to_frame().pivot_table(index="soc_code", columns="estimate_code", values="value")
data.head()

estimate_code,00064,00065,00066,00067,00068,00069,00070,00071,00072,00075,...,01076,01077,01080,01081,01084,01085,01087,01088,01090,01091
soc_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4.0,30.3,5.0,3.5,6.7,13.4,19.1,15.9,2.1,30.6,...,25.8,19.3,77.9,22.1,11.0,22.9,18.2,15.7,1.4,32.5
110000,0.7,,,,1.4,4.8,15.8,57.7,18.4,3.6,...,9.0,24.8,74.6,25.4,8.9,17.6,11.5,15.0,1.2,25.3
111011,,,,,,,,40.7,56.9,,...,,,,,,,,,,
111021,,,,,,6.1,16.7,51.3,20.9,,...,11.0,19.1,70.4,29.6,10.7,8.3,,15.7,,18.4
112021,,,,,,,14.6,57.6,24.9,,...,,21.8,65.3,34.7,,,,,,


Convert the data from rows with keyed by (`occupation_code`, `estimate_code`) to a matrix of with `occupation_code` rows and `estimate_code` columns.

In [13]:
data.to_csv('ordata-processed.csv')