# O\*NET Occupational Data

* We are interested in four occupational data:
    1. Abilities
    2. Knowledge
    3. Skills
    4. Work Activities
___
* __Abilities__ consists of 52 items.
* __Knowledge__ consists of 33 items.
* __Skills__ consists of 35 items.
* __Work Activities__ consists of 41 items.
* The detailed list of items and their encoded value will be provided below.
___
* O\*NET provide data for each O\*NET-SOC occupational title. The purpose is to generate a final dataset which merges all data into one dataset for further analysis.
* In this project, O\*NET database version 22.3 is employed (go to the [link](https://www.onetcenter.org/db_releases.html) to download the database files)

In [1]:
from IPython.core.display import display, HTML
display(HTML('<style>.container { width:80% !important; }</style>'))
import pandas as pd
import numpy as np

# Read Files

In [2]:
# filepaths
abilities = 'csv_files/db_22_3_excel/Abilities.xlsx'
activities = 'csv_files/db_22_3_excel/Work Activities.xlsx'
knowledge = 'csv_files/db_22_3_excel/Knowledge.xlsx'
skills = 'csv_files/db_22_3_excel/Skills.xlsx'

* `process_df` function takes a tabular data and process it by keeping scores based on importance (`Scale ID == 'IM'`) and encodes each item in the file. Finally, reshape the data from long to wide by generating columns for each data point for the corresponding O\*NET-SOC occupational title.

In [3]:
def process_df(filepath, new_column):
    from string import punctuation
    temp = pd.read_excel(filepath)
    t = [''.join([x for x in column if x not in punctuation]) 
             for column in temp.columns]
    temp.columns = [c.lower().replace(' ', '_') for c in t]
    elements = sorted(temp['element_name'].unique())
    temp['element_code'] = temp['element_name'].replace(to_replace=elements,
                                                        value=np.arange(1, 
                                                            len(elements)+1))
    temp = temp.loc[temp['scale_id'] == 'IM', :]
    temp.drop(['element_id', 'element_name', 'scale_id', 'scale_name', 'n',
               'standard_error', 'lower_ci_bound', 'upper_ci_bound',
               'recommend_suppress', 'not_relevant', 'date', 'domain_source'],
              axis=1, inplace=True)
    temp = temp.pivot(index=['onetsoc_code', 'title'],
                      columns=['element_code'],
                      values='data_value')
    temp.columns = [new_column+'_'+str(i) for i in range(1, temp.shape[1]+1)]
    temp.reset_index(level=1, inplace=True)
    temp.index.name = 'onetsoccode'
    temp.columns.name = None
    return temp, elements

* In order to demonstrate the difference between processed and raw data, we will use `abilities` data.

In [4]:
# raw
print('Sample of raw data')
df_abilities_raw = pd.read_excel(abilities)
display(df_abilities_raw.head())

Sample of raw data


Unnamed: 0,O*NET-SOC Code,Title,Element ID,Element Name,Scale ID,Scale Name,Data Value,N,Standard Error,Lower CI Bound,Upper CI Bound,Recommend Suppress,Not Relevant,Date,Domain Source
0,11-1011.00,Chief Executives,1.A.1.a.1,Oral Comprehension,IM,Importance,4.5,8.0,0.19,4.13,4.87,N,,07/2014,Analyst
1,11-1011.00,Chief Executives,1.A.1.a.1,Oral Comprehension,LV,Level,4.88,8.0,0.13,4.63,5.12,N,N,07/2014,Analyst
2,11-1011.00,Chief Executives,1.A.1.a.2,Written Comprehension,IM,Importance,4.25,8.0,0.16,3.93,4.57,N,,07/2014,Analyst
3,11-1011.00,Chief Executives,1.A.1.a.2,Written Comprehension,LV,Level,4.62,8.0,0.18,4.27,4.98,N,N,07/2014,Analyst
4,11-1011.00,Chief Executives,1.A.1.a.3,Oral Expression,IM,Importance,4.38,8.0,0.18,4.02,4.73,N,,07/2014,Analyst


In [5]:
# processed
df_abilities, abilities_items = process_df(abilities, 'ability')
df_activities, activities_items = process_df(activities, 'activity')
df_knowledge, knowledge_items = process_df(knowledge, 'knowledge')
df_skills, skills_items = process_df(skills, 'skill')

In [6]:
print('Sample of processed data')
display(df_abilities.head())

Sample of processed data


Unnamed: 0_level_0,title,ability_1,ability_2,ability_3,ability_4,ability_5,ability_6,ability_7,ability_8,ability_9,...,ability_43,ability_44,ability_45,ability_46,ability_47,ability_48,ability_49,ability_50,ability_51,ability_52
onetsoccode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11-1011.00,Chief Executives,1.0,2.12,3.5,1.75,4.12,1.75,1.0,1.0,1.0,...,1.0,1.0,1.0,3.0,1.0,1.88,3.12,1.0,4.25,4.12
11-1011.03,Chief Sustainability Officers,1.0,1.88,3.38,1.75,4.0,2.0,1.0,1.0,1.0,...,1.0,1.0,1.0,2.62,1.12,2.0,2.75,1.12,4.0,3.88
11-1021.00,General and Operations Managers,2.0,2.12,3.0,1.75,3.75,2.0,1.0,1.62,1.5,...,1.5,2.0,2.0,2.88,2.12,2.0,2.75,1.38,4.0,4.0
11-2011.00,Advertising and Promotions Managers,1.88,1.88,3.38,1.5,3.88,1.88,1.0,1.0,1.0,...,1.0,1.0,1.0,2.75,1.25,2.88,3.0,1.25,3.88,3.88
11-2021.00,Marketing Managers,1.12,1.88,3.25,1.0,3.88,1.75,1.0,1.25,1.0,...,1.0,1.0,1.0,2.75,1.75,2.88,3.0,1.62,4.0,3.88


In [7]:
def add_trailing(t): # adjusts the length of shorter lists with abilities list by adding empty strings
    n = len(abilities_items)
    difference = n - len(t)
    items = list(t) + ['' for i in range(difference)]
    values = [i + 1 for i in range(len(t))] + ['' for i in range(difference)]
    return items, values

In [8]:
knowledge_items, knowledge_values = add_trailing(knowledge_items)
skills_items, skills_values = add_trailing(skills_items)
activities_items, activities_values = add_trailing(activities_items)
table = pd.DataFrame({'abilities_items' : abilities_items,
                      'abilities_value' : [i + 1 for i in range(len(abilities_items))],
                      'knowledge_items' : knowledge_items,
                      'knowledge_value' : knowledge_values,
                      'skills_items' : skills_items,
                      'skills_value' : skills_values,
                      'activities_items' : activities_items,
                      'activities_value' : activities_values})
display(table)

Unnamed: 0,abilities_items,abilities_value,knowledge_items,knowledge_value,skills_items,skills_value,activities_items,activities_value
0,Arm-Hand Steadiness,1,Administration and Management,1.0,Active Learning,1.0,Analyzing Data or Information,1.0
1,Auditory Attention,2,Biology,2.0,Active Listening,2.0,Assisting and Caring for Others,2.0
2,Category Flexibility,3,Building and Construction,3.0,Complex Problem Solving,3.0,Coaching and Developing Others,3.0
3,Control Precision,4,Chemistry,4.0,Coordination,4.0,Communicating with Persons Outside Organization,4.0
4,Deductive Reasoning,5,Clerical,5.0,Critical Thinking,5.0,"Communicating with Supervisors, Peers, or Subo...",5.0
5,Depth Perception,6,Communications and Media,6.0,Equipment Maintenance,6.0,Controlling Machines and Processes,6.0
6,Dynamic Flexibility,7,Computers and Electronics,7.0,Equipment Selection,7.0,Coordinating the Work and Activities of Others,7.0
7,Dynamic Strength,8,Customer and Personal Service,8.0,Installation,8.0,Developing Objectives and Strategies,8.0
8,Explosive Strength,9,Design,9.0,Instructing,9.0,Developing and Building Teams,9.0
9,Extent Flexibility,10,Economics and Accounting,10.0,Judgment and Decision Making,10.0,Documenting/Recording Information,10.0


# Final Dataset

In [9]:
df_onet = df_abilities.merge(df_activities, on='title',
                             left_index=True, right_index=True)
df_onet = df_onet.merge(df_knowledge, on='title',
                        left_index=True, right_index=True)
df_onet = df_onet.merge(df_skills, on='title',
                        left_index=True, right_index=True)

In [10]:
df_onet.head()

Unnamed: 0_level_0,title,ability_1,ability_2,ability_3,ability_4,ability_5,ability_6,ability_7,ability_8,ability_9,...,skill_26,skill_27,skill_28,skill_29,skill_30,skill_31,skill_32,skill_33,skill_34,skill_35
onetsoccode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11-1011.00,Chief Executives,1.0,2.12,3.5,1.75,4.12,1.75,1.0,1.0,1.0,...,1.88,3.12,4.25,4.38,4.12,4.12,1.75,4.0,1.0,4.0
11-1011.03,Chief Sustainability Officers,1.0,1.88,3.38,1.75,4.0,2.0,1.0,1.0,1.0,...,1.75,3.25,3.75,4.0,3.62,3.62,1.62,3.38,1.12,3.88
11-1021.00,General and Operations Managers,2.0,2.12,3.0,1.75,3.75,2.0,1.0,1.62,1.5,...,1.88,3.25,4.0,4.0,3.0,3.0,1.88,3.75,2.0,3.25
11-2011.00,Advertising and Promotions Managers,1.88,1.88,3.38,1.5,3.88,1.88,1.0,1.0,1.0,...,1.5,3.12,4.0,4.0,3.12,3.0,1.62,3.88,1.12,3.75
11-2021.00,Marketing Managers,1.12,1.88,3.25,1.0,3.88,1.75,1.0,1.25,1.0,...,1.75,3.12,3.88,3.88,3.25,3.5,1.75,3.5,1.0,3.25


In [11]:
df_onet.reset_index(inplace=True)
df_onet.to_csv('onet_numeric.csv')