# Identifying New Work in O\*NET

* In this project, we would like to identify new work.
* New work can be identified as the works that do not exist in the previous O\*NET-SOC Taxonomy but do exist in the current one. In other words, we follow the consecutive Taxonomies and find the differences among them. 
* However, we do not consider splitted works as new works. In order to eliminate those we use O\*NET-SOC Crosswalks and drop them from the set of new works.

* For example:

| O\*NET-SOC Code 2009 | O\*NET-SOC Title 2009 | O\*NET-SOC Code 2010 | O\*NET-SOC Title 2010 |
| :------------------: | :-------------------: | :------------------: | :-------------------: |
| 29-2099.03 | Ophthalmic Medical Technologists and Technicians | 29-2057.00 | Ophthalmic Medical Technicians |
| 29-2099.03 | Ophthalmic Medical Technologists and Technicians | 29-2099.05 | Ophthalmic Medical Technologists |

* __Ophthalmic Medical Technologists and Technicians__ is identified as new work. However,  __Ophthalmic Medical Technicians__ and __Ophthalmic Medical Technologists__ are splitted from __Ophthalmic Medical Technologists and Technicians__. For this reason, they will not be considered as new works.

* We follow the methodology proposed by [(Lin, 2011)](https://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00079) which compares consecutive work taxonomies and identify new work as the jobs that do not exist in the previous taxonomy.
* For this reason, we acquired taxonomies and crosswalks from [O\*NET](https://www.onetcenter.org/taxonomy.html) and compared them in order to identify the new works after 2000.

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:80% !important; }</style>'))
import pandas as pd
import numpy as np

## Import Taxonomies

* `tax_2006`: O\*NET-SOC 2006 Taxonomy
* `tax_2009`: O\*NET-SOC 2009 Taxonomy
* `tax_2010`: O\*NET-SOC 2010 Taxonomy
___
* `cross_2006`: O\*NET-SOC 2000 - 2006 Crosswalk
* `cross_2009`: O\*NET-SOC 2006 - 2009 Crosswalk
* `cross_2010`: O\*NET-SOC 2009 - 2010 Crosswalk

In [2]:
def read_taxonomies(year):
    """Reads O*NET Taxonomy and returns DataFrame"""
    df = pd.read_csv('csv_files/Taxonomies/'+str(year)+'_Occupations.csv')
    df.columns = ['onetsoccode', 'title', 'description']
    df.drop('description', axis=1, inplace=True)
    return df

def read_crosswalks(year1, year2):
    """Reads O*NET Crosswalk and returns DataFrame"""
    df = pd.read_csv('csv_files/Crosswalks/'+str(year1)+'_to_'+str(year2)+'_Crosswalk.csv')
    df.columns = ['temp_code', 'temp_title', 'onetsoccode', 'title']
    df.drop(['temp_code', 'temp_title'], axis=1, inplace=True)
    return df

In [3]:
tax_2006 = read_taxonomies(2006)
tax_2009 = read_taxonomies(2009)
tax_2010 = read_taxonomies(2010)

In [4]:
cross_2006 = read_crosswalks(2000, 2006)
cross_2009 = read_crosswalks(2006, 2009)
cross_2010 = read_crosswalks(2009, 2010)

## Identify New Works

In [5]:
def get_new_work(df1, df2):
    """Compares Taxonomy and Crosswalks in order to indentify New Works"""
    temp = df1.merge(df2, how='left', on='onetsoccode')
    temp = temp[temp.title_y.isnull()]
    temp.drop('title_y', axis=1, inplace=True)
    temp.rename(columns={'title_x':'title'}, inplace=True)
    temp.reset_index(drop=True, inplace=True)
    return temp

### New Work between 2000 - 2006

In [6]:
new_work1 = get_new_work(tax_2006, cross_2006)
new_work1['year'] = 2006
display(new_work1.head())
print('There are {} new works between 2000 and 2006'.format(new_work1.shape[0]))

Unnamed: 0,onetsoccode,title,year
0,15-1099.01,Software Quality Assurance Engineers and Testers,2006
1,15-1099.02,Computer Systems Engineers/Architects,2006
2,15-1099.03,Network Designers,2006
3,15-1099.04,Web Developers,2006
4,15-1099.05,Web Administrators,2006


There are 6 new works between 2000 and 2006


### New Work between 2006 - 2009

In [7]:
new_work2 = get_new_work(tax_2009, cross_2009)
new_work2['year'] = 2009
display(new_work2.head())
print('There are {} new work between 2006 and 2009'.format(new_work2.shape[0]))

Unnamed: 0,onetsoccode,title,year
0,11-1011.03,Chief Sustainability Officers,2009
1,11-2011.01,Green Marketers,2009
2,11-3051.01,Quality Control Systems Managers,2009
3,11-3051.02,Geothermal Production Managers,2009
4,11-3051.03,Biofuels Production Managers,2009


There are 153 new work between 2006 and 2009


### New Work between 2009 - 2010

In [8]:
new_work3 = get_new_work(tax_2010, cross_2010)
new_work3['year'] = 2010
display(new_work3.head())
print('There are {} new work between 2009 and 2010'.format(new_work3.shape[0]))

Unnamed: 0,onetsoccode,title,year
0,11-9061.00,Funeral Service Managers,2010
1,13-1131.00,Fundraisers,2010
2,13-2071.00,Credit Counselors,2010
3,21-1094.00,Community Health Workers,2010
4,25-2059.00,"Special Education Teachers, All Other",2010


There are 14 new work between 2009 and 2010


In [9]:
# Merge all new work into one dataframe
df_new_work = new_work1.append(new_work2).append(new_work3)
df_new_work.sort_values('onetsoccode', inplace=True)
df_new_work.reset_index(drop=True, inplace=True)
display(df_new_work.head())
print('The total number of new works between 2000 and 2010 is {}'.format(
                                                        df_new_work.shape[0]))

Unnamed: 0,onetsoccode,title,year
0,11-1011.03,Chief Sustainability Officers,2009
1,11-2011.01,Green Marketers,2009
2,11-3051.01,Quality Control Systems Managers,2009
3,11-3051.02,Geothermal Production Managers,2009
4,11-3051.03,Biofuels Production Managers,2009


The total number of new works between 2000 and 2010 is 173


## Summary

* Total of 173 new work identified
* There are 6 new works from 2006
* There are 153 new works from 2009
* There are 14 new works from 2010

### Update SOC Codes

* After identifying new work by comparing taxonomies and crosswalks, there is still one step that we have to take which is to update O\*NET-SOC occupation codes.
* New works from 2010 do not need to be updated.
* For splitted titles we will select the first title's O\*NET-SOC Code for the update.

* `onetsoccode_x`: O\*NET-SOC occupational codes for 2009
* `onetsoccode_y`: O\*NET-SOC occupational codes for 2010

In [10]:
df_final = df_new_work.merge(tax_2010, how='left', on='title', indicator='_merge')
df_final.head()

Unnamed: 0,onetsoccode_x,title,year,onetsoccode_y,_merge
0,11-1011.03,Chief Sustainability Officers,2009,11-1011.03,both
1,11-2011.01,Green Marketers,2009,11-2011.01,both
2,11-3051.01,Quality Control Systems Managers,2009,11-3051.01,both
3,11-3051.02,Geothermal Production Managers,2009,11-3051.02,both
4,11-3051.03,Biofuels Production Managers,2009,11-3051.03,both


In [11]:
condition = (df_final.onetsoccode_x !=
             df_final.onetsoccode_y) & (df_final.onetsoccode_y.notnull())
df_final['onetsoccode'] = np.where(condition, df_final.onetsoccode_y,
                                   df_final.onetsoccode_x)
df_final.drop(['onetsoccode_x', 'onetsoccode_y'], axis=1, inplace=True)
df_final.loc[:,['onetsoccode', 'title', 'year']].head()

Unnamed: 0,onetsoccode,title,year
0,11-1011.03,Chief Sustainability Officers,2009
1,11-2011.01,Green Marketers,2009
2,11-3051.01,Quality Control Systems Managers,2009
3,11-3051.02,Geothermal Production Managers,2009
4,11-3051.03,Biofuels Production Managers,2009


In [12]:
# Occupations that are not updated by O*NET-SOC Taxonomy 2010
df_final[df_final._merge != 'both']

Unnamed: 0,title,year,_merge,onetsoccode
5,Biomass Production Managers,2009,left_only,11-3051.04
41,Telecommunications Specialists,2009,left_only,15-1081.01
44,Network Designers,2006,left_only,15-1099.03
53,Electronic Commerce Specialists,2009,left_only,15-1099.12
105,Adaptive Physical Education Specialists,2009,left_only,25-3099.01
137,Electroneurodiagnostic Technologists,2009,left_only,29-2099.01
139,Ophthalmic Medical Technologists and Technicians,2009,left_only,29-2099.03
149,Loss Prevention Specialists,2009,left_only,33-9099.02


* In the first step, we kept O\*NET-SOC occupational titles for 2009 and O\*NET-SOC occupational codes for 2010 from 2009-2010 Crosswalk, then merge new work with those titles and codes by using occupational titles. Finally, we find the differences between O\*NET-SOC 2009 occupational codes and O\*NET-SOC 2010 occupational codes and updated the occupational codes based on O\*NET-SOC 2010 occupational codes.

In [13]:
temp = pd.read_csv('csv_files/Crosswalks/2009_to_2010_Crosswalk.csv')
temp.columns = ['onetsoccode2009', 'title', 'onetsoccode', 'onetsoctitle2010']
temp.drop(['onetsoccode2009', 'onetsoctitle2010'], axis=1, inplace=True)
temp.drop_duplicates('title', inplace=True)

* `onetsoccode_x`: O\*NET-SOC occupational codes for 2009
* `onetsoccode_y`: O\*NET-SOC occupational codes for 2010

In [14]:
df_final = df_final.merge(temp, how='left', on='title', indicator='_merge2')
df_final.head()

Unnamed: 0,title,year,_merge,onetsoccode_x,onetsoccode_y,_merge2
0,Chief Sustainability Officers,2009,both,11-1011.03,11-1011.03,both
1,Green Marketers,2009,both,11-2011.01,11-2011.01,both
2,Quality Control Systems Managers,2009,both,11-3051.01,11-3051.01,both
3,Geothermal Production Managers,2009,both,11-3051.02,11-3051.02,both
4,Biofuels Production Managers,2009,both,11-3051.03,11-3051.03,both


In [15]:
condition = (df_final.onetsoccode_x != df_final.onetsoccode_y) & \
                (df_final.onetsoccode_y.notnull())
df_final['onetsoccode'] = np.where(condition, df_final.onetsoccode_y,
                                              df_final.onetsoccode_x)
df_final.drop(['onetsoccode_x', 'onetsoccode_y'], axis=1, inplace=True)
df_final.head()

Unnamed: 0,title,year,_merge,_merge2,onetsoccode
0,Chief Sustainability Officers,2009,both,both,11-1011.03
1,Green Marketers,2009,both,both,11-2011.01
2,Quality Control Systems Managers,2009,both,both,11-3051.01
3,Geothermal Production Managers,2009,both,both,11-3051.02
4,Biofuels Production Managers,2009,both,both,11-3051.03


In [16]:
# Occupations that are not updated by Crosswalk 2009-2010
df_final[df_final._merge2 != 'both']

Unnamed: 0,title,year,_merge,_merge2,onetsoccode
11,Funeral Service Managers,2010,both,left_only,11-9061.00
29,Fundraisers,2010,both,left_only,13-1131.00
35,Credit Counselors,2010,both,left_only,13-2071.00
103,Community Health Workers,2010,both,left_only,21-1094.00
104,"Special Education Teachers, All Other",2010,both,left_only,25-2059.00
125,Art Therapists,2010,both,left_only,29-1125.01
126,Music Therapists,2010,both,left_only,29-1125.02
127,Exercise Physiologists,2010,both,left_only,29-1128.00
136,Magnetic Resonance Imaging Technologists,2010,both,left_only,29-2035.00
141,Surgical Assistants,2010,both,left_only,29-2099.07


* In the above occupations, we will ignore new works from year 2010 since they are not included in O\*NET-SOC Crosswalk 2009 - 2010.
* __Transportation Security Screeners__ first emerges in O\*NET-SOC Taxonomy 2006. Later, in O\*NET-SOC Taxonomy 2009, it named __Transportation Security Officers__. Finally, in O\*NET-SOC Taxonomy 2010, it changed name back to __Transportation Security Screeners__.
* As a result, all new work is updated.

In [17]:
df_final.drop(['_merge', '_merge2'], axis=1, inplace=True)
df_final = df_final[['onetsoccode', 'title', 'year']]
with pd.option_context('display.max_rows', None):
    display(df_final)

Unnamed: 0,onetsoccode,title,year
0,11-1011.03,Chief Sustainability Officers,2009
1,11-2011.01,Green Marketers,2009
2,11-3051.01,Quality Control Systems Managers,2009
3,11-3051.02,Geothermal Production Managers,2009
4,11-3051.03,Biofuels Production Managers,2009
5,11-3051.04,Biomass Production Managers,2009
6,11-3051.05,Methane/Landfill Gas Collection System Operators,2009
7,11-3051.06,Hydroelectric Production Managers,2009
8,11-9039.01,Distance Learning Coordinators,2009
9,11-9039.02,Fitness and Wellness Coordinators,2009


In [18]:
df_final.to_csv('new_work.csv', index=False)