From `or.txt` (readme):
```
The series_id (ORUP1000066700000560) can be broken out into:

Code                                    Value(Example)

Survey abbreviation             =               OR
Seasonal(code)                  =               U
Requirement_code                =               P
Ownership_code                  =               1
Industry_code                   =               0000
Occupation_code                 =               667
Job_characteristic_code         =               000
Estimate_code                   =               00560
```

Goal: Extract a "job id" to `estimate_code` from each `series_id`.

Load all data (`or.data.1.AllData`) into a pandas DataFrame

In [1]:
import pandas
series = pandas.read_csv('ordata/or.data.1.AllData', delimiter='\t')

In [2]:
series.head()

Unnamed: 0,series_id,year,period,value,footnote_codes
0,ORUC1000000000000728,2017,A01,5.1,6.0
1,ORUC1000000000001030,2017,A01,29.0,
2,ORUC1000000000001031,2017,A01,48.2,7.0
3,ORUC1000000000001032,2017,A01,15.6,
4,ORUC1000000000001033,2017,A01,5.8,


Check if `occupation_code` is unique (viable candidate for a "job id"

In [3]:
occupation_code = series['series_id'].map(lambda a: a[9:12])
occupation_code.head()

0    000
1    000
2    000
3    000
4    000
Name: series_id, dtype: object

In [4]:
series_prefix = series['series_id'].map(lambda a: a[:3] + a[4:15])
series_prefix.head()

0    ORU10000000000
1    ORU10000000000
2    ORU10000000000
3    ORU10000000000
4    ORU10000000000
Name: series_id, dtype: object

In [5]:
len(series_prefix.unique()), len(occupation_code.unique())

(338, 338)

`occupation_code` is a viable "job id" key (only one row for each metric for each occupation code/job).

In [6]:
series['occupation_code'] = series['series_id'].map(lambda a: a[9:12])

Extract `estimate_code` aka metric id

In [7]:
series['estimate_code'] = series['series_id'].map(lambda a: a[15:20])
series['estimate_code'].head()

0    00728
1    01030
2    01031
3    01032
4    01033
Name: estimate_code, dtype: object

In [8]:
len(series['estimate_code'].unique())

342

Rows can be unique identified by (`occupation_code` and `estimate_code`)

In [9]:
# sanity check
num_occs = len(series['occupation_code'].unique())
num_ests = len(series['estimate_code'].unique())
print('{} * {} = {}'.format(num_occs, num_ests, num_occs * num_ests))
print(len(series))
# because some of the data is 0?

338 * 342 = 115596
40352


Convert the data from rows with keyed by (`occupation_code`, `estimate_code`) to a matrix of with `occupation_code` rows and `estimate_code` columns.

In [12]:
occupation_codes = series['occupation_code'].unique()
estimate_codes = series['estimate_code'].unique()

In [23]:
import numpy as np

In [28]:
data = pandas.DataFrame({'occupation_code': occupation_codes})
num_rows = len(occupation_codes)
cells_written = 0
for est_code in estimate_codes:
    data[est_code] = 0.0
    for _, r in series[series['estimate_code']==est_code].iterrows():
        data.loc[data['occupation_code']==r.occupation_code, est_code] = r.value
        cells_written += 1
    print(cells_written)
data.head()

34
186
471
602
664
679
889
1081
1192
1234
1247
1330
1433
1573
1796
1961
2212
2335
2357
2574
2609
2693
2847
3009
3031
3059
3098
3353
3466
3608
3715
3808
4032
4055
4172
4276
4363
4485
4591
4920
5246
5503
5838
6169
6482
6769
6864
7189
7280
7287
7538
7586
7648
7965
8074
8103
8197
8223
8255
8330
8335
8370
8424
8453
8534
8857
9086
9202
9214
9292
9359
9542
9604
9682
9684
9697
9716
9721
9738
9760
9769
9778
9782
9784
9785
9789
9790
9791
9792
9793
9795
10066
10252
10462
10700
10943
11190
11448
11660
11885
12115
12334
12573
12730
12883
13084
13167
13412
13429
13533
13845
13966
14256
14561
14862
15152
15326
15533
15756
15972
16205
16374
16461
16681
16749
16764
17024
17102
17194
17202
17203
17460
17579
17624
17633
17637
17908
18077
18148
18159
18165
18478
18773
19081
19209
19248
19520
19778
19972
20031
20033
20094
20423
20443
20671
20977
21183
21397
21679
21978
22106
22372
22655
22656
22871
23051
23253
23419
23709
23710
23842
23885
24072
24155
24347
24533
24574
24711
24969
24972
25152
25156
25322
2

Unnamed: 0,occupation_code,00728,01030,01031,01032,01033,01034,01036,01037,01038,...,00643,00803,00804,01029,01084,01085,01087,01088,01090,01091
0,0,5.1,29.0,48.2,15.6,5.8,1.4,44.9,28.5,15.0,...,730.0,-,2.1,17.8,11.0,22.9,18.2,15.7,1.4,32.5
1,1,,,,,31.4,40.4,,,,...,,,,61.3,,,,,,
2,3,,,26.9,28.8,30.5,12.8,11.3,26.7,32.9,...,,,,52.8,10.7,8.3,,15.7,,18.4
3,7,,,30.2,24.0,30.7,,,16.7,33.8,...,,,,78.8,,,,,,
4,8,,,23.6,41.3,32.2,,,,45.3,...,,,,61.0,,32.7,,,,33.4


In [30]:
data.shape

(338, 343)