# O*NET Data Update Tracking

* We would like to discover how many times each O\*NET detailed occupation is updated since O\*NET transformed from DOT.
* Until version 5.0 there are not a structured list of detailed occupations so that we start tracking from there.
* There are taxonomy updates to O\*NET in 2006, 2009, 2010, 2018 (we exclude 2018). For this reason, we will track updates by each taxonomy separately.
* The history of all updates to O\*NET database can be reached from the [link](https://www.onetcenter.org/db_releases.html)

# Table of Contents

1. [Read Data](#Read-Data)
2. [O\*NET-SOC-2000 / O\*NET-SOC-2006](#O\*NET-SOC-2000-/-O\*NET-SOC-2006)
3. [O\*NET-SOC-2006 / O\*NET-SOC-2009](#O\*NET-SOC-2006-/-O\*NET-SOC-2009)
4. [O\*NET-SOC-2009 / O\*NET-SOC-2010](#O\*NET-SOC-2009-/-O\*NET-SOC-2010)
5. [O\*NET-SOC-2010 +](#O\*NET-SOC-2010-+)
6. [All Updates Final](#All-Updates-Final)

# Read Data

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:90% !important; }</style>'))
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", None)

In [2]:
def read_files(version):
    temp = pd.read_csv('csv_files/ONET_databases/UPDATES/ONET_' + \
                       str(version) + '_Updates.csv')
    title = 'update_'+str(version)
    temp.columns = ['onetsoccode', title, 'description']
    temp.drop('description', axis=1, inplace=True)
    return temp

In [3]:
def merge_files(f1, f2, indicator:bool):
    from copy import copy
    temp = copy(f1)
    temp = temp.merge(f2, how='outer', on='onetsoccode', indicator=indicator)
    return temp

# O\*NET-SOC-2000 / O\*NET-SOC-2006

* The period between 2000 and 2006 covers 6 updates:

| version | date |
| :------ | ---: |
| O\*NET 5.0 | April 2003 |
| O\*NET 5.1 | November 2003 |
| O\*NET 6.0 | July 2004 |
| O\*NET 7.0 | December 2004 |
| O\*NET 8.0 | July 2005 |
| O\*NET 9.0 | December 2005 |

* However, we will only work with updates `5.0`, `6.0`, `7.0`, `8.0` and `9.0`.

In [4]:
update_50 = read_files(50)
update_60 = read_files(60)
update_70 = read_files(70)
update_80 = read_files(80)
update_90 = read_files(90)

In [5]:
df = merge_files(update_50, update_60, indicator=False)
df = merge_files(df, update_70, indicator=False)
df = merge_files(df, update_80, indicator=False)
df = merge_files(df, update_90, indicator=False)
df.head()

Unnamed: 0,onetsoccode,update_50,update_60,update_70,update_80,update_90
0,11-2022.00,Sales Managers,,,,
1,11-3011.00,Administrative Services Managers,,,,
2,11-3051.00,Industrial Production Managers,,,,
3,11-9111.00,Medical and Health Services Managers,,,,
4,13-1022.00,"Wholesale and Retail Buyers, Except Farm Products",,,,


* If a detailed occupation updated in a particular version, then the detailed occupation appears in version file. This is how we track updates. First, read and merge all updates for a particular taxonomy, then generate a new feature which covers all detailed occupational which are updated on that period. Finally, replace the detailed occupations with the version number and clean the dataset.

In [6]:
df.fillna('', inplace=True)
df['title'] = df.iloc[:, 1:].sum(axis=1)
df['update_50'] = np.where(df['update_50'] != '', '5.0', df['update_50'])
df['update_60'] = np.where(df['update_60'] != '', '6.0', df['update_60'])
df['update_70'] = np.where(df['update_70'] != '', '7.0', df['update_70'])
df['update_80'] = np.where(df['update_80'] != '', '8.0', df['update_80'])
df['update_90'] = np.where(df['update_90'] != '', '9.0', df['update_90'])
df['version_2000'] = df.iloc[:, 1:-1].sum(axis=1).map(str)
df.drop(['update_50', 'update_60', 'update_70', 'update_80', 'update_90'],
        axis=1, inplace=True)
df.head()

Unnamed: 0,onetsoccode,title,version_2000
0,11-2022.00,Sales Managers,5.0
1,11-3011.00,Administrative Services Managers,5.0
2,11-3051.00,Industrial Production Managers,5.0
3,11-9111.00,Medical and Health Services Managers,5.0
4,13-1022.00,"Wholesale and Retail Buyers, Except Farm Products",5.0


# O\*NET-SOC-2006 / O\*NET-SOC-2009

* If a particular detailed occupation is updated more than once in a given period, we track the updates by generating additional features such as `version_2006_1`. The feature `version_2006` indicates if a detailed occupation is updated and `version_2006_1` shows the version of the second update for a particular detailed occupation.

In [7]:
update_100 = read_files(100)
update_110 = read_files(110)
update_120 = read_files(120)
update_130 = read_files(130)

In [8]:
df1 = merge_files(update_100, update_110, indicator=False)
df1 = merge_files(df1, update_120, indicator=False)
df1 = df1.merge(update_130, how='outer', on='onetsoccode', indicator='second_update')
df1.iloc[:, 1:-1] = df1.iloc[:, 1:-1].fillna('') # replace missing values with empty string
df1.head()

Unnamed: 0,onetsoccode,update_100,update_110,update_120,update_130,second_update
0,11-1011.00,Chief Executives,,,,left_only
1,11-3031.01,Treasurers and Controllers,,,,left_only
2,11-3031.02,"Financial Managers, Branch or Department",,,,left_only
3,11-9011.02,Crop and Livestock Managers,,,,left_only
4,13-1041.06,Coroners,,,,left_only


In [9]:
versions = ['update_100', 'update_110', 'update_120', 'update_130']
mask = df1['second_update'] != 'both' # select all but double update
df1['title'] = df1[mask].loc[:, versions].sum(axis=1)
# add double update detailed occupation's title
df1['title'] = np.where(df1['title'].isnull(), df1['update_100'], df1['title'])
# generates version infromation
df1['update_100'] = np.where(df1['update_100'] != '', '10.0', df1['update_100'])
df1['update_110'] = np.where(df1['update_110'] != '', '11.0', df1['update_110'])
df1['update_120'] = np.where(df1['update_120'] != '', '12.0', df1['update_120'])
df1['update_130'] = np.where(df1['update_130'] != '', '13.0', df1['update_130'])
# gathers version information into one column (only the first update)
df1['version_2006'] = df1[mask].loc[:, list(df1.columns)[1:-2]].sum(axis=1).map(str)
df1['version_2006'] = np.where(df1['version_2006'].isnull(), df1['update_100'], df1['version_2006'])
# generate a new column for second update information
df1['version_2006_1'] = ''
df1['version_2006_1'] = np.where(~mask, df1['update_130'], df1['version_2006_1'] )
versions.append('second_update')
df1.drop(versions, axis=1, inplace=True)
df1.head()

Unnamed: 0,onetsoccode,title,version_2006,version_2006_1
0,11-1011.00,Chief Executives,10.0,
1,11-3031.01,Treasurers and Controllers,10.0,
2,11-3031.02,"Financial Managers, Branch or Department",10.0,
3,11-9011.02,Crop and Livestock Managers,10.0,
4,13-1041.06,Coroners,10.0,


# O\*NET-SOC-2009 / O\*NET-SOC-2010

In [10]:
update_140 = read_files(140)
update_150 = read_files(150)

In [11]:
df2 = merge_files(update_140, update_150, indicator=False)
df2.fillna('', inplace=True)
df2['title'] = df2.iloc[:, 1:].sum(axis=1)
df2['update_140'] = np.where(df2['update_140'] != '', '14.0', df2['update_140'])
df2['update_150'] = np.where(df2['update_150'] != '', '15.0', df2['update_150'])
df2['version_2009'] = df2.iloc[:, 1:-1].sum(axis=1).map(str)
df2.drop(['update_140', 'update_150'], axis=1, inplace=True)
df2.head()

Unnamed: 0,onetsoccode,title,version_2009
0,11-2031.00,Public Relations Managers,14.0
1,11-3011.00,Administrative Services Managers,14.0
2,11-3061.00,Purchasing Managers,14.0
3,11-3071.01,Transportation Managers,14.0
4,11-9111.01,Clinical Nurse Specialists,14.0


# O\*NET-SOC-2010 +

In [12]:
update_160 = read_files(160)
update_170 = read_files(170)
update_180 = read_files(180)
update_190 = read_files(190)
update_200 = read_files(200)
update_210 = read_files(210)
update_220 = read_files(220)
update_230 = read_files(230)

In [13]:
# there are lots of detailed occupations that are updated more than once after 2010
# for each merge we generate a new variable which shows where each observation came from
# later using those variables we gather update information
df3 = merge_files(update_160, update_170, indicator='_17')
df3 = merge_files(df3, update_180, indicator='_18')
df3 = merge_files(df3, update_190, indicator='_19')
df3 = merge_files(df3, update_200, indicator='_20')
df3 = merge_files(df3, update_210, indicator='_21')
df3 = merge_files(df3, update_220, indicator='_22')
df3 = merge_files(df3, update_230, indicator='_23')

In [14]:
# gather information for first update
updates = ['update_160', 'update_170', 'update_180', 'update_190',
           'update_200', 'update_210',  'update_220', 'update_230']
versions = ['16.0', '17.0', '18.0', '19.0', '20.0', '21.0', '22.0', '23.0']
df3.loc[:, updates] = df3.loc[:, updates].fillna('')
df3['title'] = df3.loc[:, updates].sum(axis=1)
df3['version_2010'] = ''
for update, version in zip(updates, versions):
    condition = (df3['version_2010'] == '') & (df3[update] != '')
    df3['version_2010'] = np.where(condition, version, df3['version_2010'])

In [17]:
df3.head()

Unnamed: 0,onetsoccode,update_160,update_170,_17,update_180,_18,update_190,_19,update_200,_20,update_210,_21,update_220,_22,update_230,_23,title,version_2010,version_2010_1
0,11-3051.02,Geothermal Production Managers,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,Geothermal Production Managers,16.0,
1,11-3071.03,Logistics Managers,,left_only,,left_only,,left_only,,left_only,,left_only,Logistics Managers,both,,left_only,Logistics ManagersLogistics Managers,16.0,22.0
2,11-9121.01,Clinical Research Coordinators,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,Clinical Research Coordinators,16.0,
3,13-1023.00,"Purchasing Agents, Except Wholesale, Retail, a...",,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,"Purchasing Agents, Except Wholesale, Retail, a...",16.0,
4,13-1081.01,Logistics Engineers,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,,left_only,Logistics Engineers,16.0,


In [16]:
# gather information for second update
df3['version_2010_1'] = ''
for merge, version, update in zip(['_19', '_20', '_21', '_22', '_23'],
                                  ['19.0', '20.0', '21.0', '22.0', '23.0'],
                                  ['update_190', 'update_200', 'update_210', 
                                   'update_220', 'update_230']):
    mask = df3[merge] == 'both'
    df3['version_2010_1'] = np.where(mask == True, version, df3['version_2010_1'])

In [23]:
updates.extend(['_17', '_18', '_19', '_20', '_21', '_22', '_23'])
df3.drop(updates, axis=1, inplace=True)
df3.head()

Unnamed: 0,onetsoccode,title,version_2010,version_2010_1
0,11-3051.02,Geothermal Production Managers,16.0,
1,11-3071.03,Logistics ManagersLogistics Managers,16.0,22.0
2,11-9121.01,Clinical Research Coordinators,16.0,
3,13-1023.00,"Purchasing Agents, Except Wholesale, Retail, a...",16.0,
4,13-1081.01,Logistics Engineers,16.0,


# All Updates Final

* Merge all files into a final file

In [24]:
def final_merge(df1, df2):
    temp = df1.merge(df2, how='outer', on='onetsoccode')
    temp['title_x'] = np.where(temp['title_x'].isnull(),
                               temp['title_y'], temp['title_x'])
    temp.drop(['title_y'], axis=1, inplace=True)
    temp.rename(columns={'title_x' : 'title'}, inplace=True)
    return temp  

In [29]:
df_final = final_merge(df, df1)
df_final = final_merge(df_final, df2)
df_final = final_merge(df_final, df3)
df_final.fillna('', inplace=True)
df_final.sort_values('onetsoccode', ascending=True, inplace=True)
df_final.reset_index(drop=True, inplace=True)
df_final.head()

Unnamed: 0,onetsoccode,title,version_2000,version_2006,version_2006_1,version_2009,version_2010,version_2010_1
0,11-1011.00,Chief Executives,,10.0,,,19.0,
1,11-1011.03,Chief Sustainability Officers,,,,,18.0,
2,11-1021.00,General and Operations Managers,6.0,13.0,,,20.0,
3,11-1031.00,Legislators,,13.0,,,,
4,11-2011.00,Advertising and Promotions Managers,7.0,,,15.0,23.0,


* As seen from above __Chief Executives__ became an O\*NET detailed occupation in 2006 taxonomy update.
* Furthermore, __General and Operations Managers__ are introduced in database update `6.0` and updated two times until 2016.

* `onetsoccode`: O\*NET Detailed Occupation Code
* `title`: O\*NET Detailed Occupation Title
* `version_2000`: If the detailed occupation updated between the period 2000-2006 and if so indicates the database version.
* `version_2006`: If the detailed occupation updated between the period 2006-2009 and if so indicates the database version.
* `version_2006_1`: If the detailed occupation was updated more than once between 2000-2006.
* `version_2009`: If the detailed occupation updated between the period 2009-2010 and if so indicates the database version.
* `version_2010`: If the detailed occupation updated after 2010 and if so indicates the database version.
* `version_2000_1`: If the detailed occupation was updated more than once after 2010.

In [30]:
df_final.to_csv('onet_update_tracking.csv')