# Occupations and Skills in Demand

## Objective
This project aims to develop a platform to continuously monitor online job vacancies across the U.S. and predict occupations and skills in demand by the employers and different industries. Such a platform would enhance the existing labor market indicators by providing a deeper and higher-frequency monitoring of labor demand. As a result, it would inform labor, education, and immigration policies and activities to develop and maintain a skilled workforce, which would, in the long run, contribute to income mobility and equality. 

Three organizations have developed similar platforms. The Conference Board uses online job postings from various portals and companies to publish monthly national and state-level vacancy indicators at the 2-digit Standard Occupational Classification (SOC) level. Florida Department of Economic Opportunity is one of the users of these indicators. A technical note that describes the Conference Board platform is available __[here](https://www.conference-board.org/pdf_free/press/2018%20HWOL%20Technical%20Note8.pdf)__. The Center for Urban Research at the City University of New York has designed a real-time labor market information system that scrapes private and public job boards on a daily basis and stores the data in searchable databases. A brief desciption of this system is available __[here](https://gc.cuny.edu/lmis/research/real_time#menu)__. __[The New Jersey state government](https://careerconnections.nj.gov/careerconnections/prepare/skills/demand/demand_occupations_list_methodology.shtml)__ is one of the users of this system. The Burning Glass Technologies delivers real-time job vacancies data and planning tools that inform careers, define academic programs, and shape workforces. A description of their offerings is __[here](https://www.burning-glass.com/research-project/skills-taxonomy/)__. The World Bank and the Government of Malaysia used their offerings to monitor in-demand occupations and skills in Malaysia.

## Methodology
The project will follow a phased approach as outlined below:
1. Occupational classification for DC metropolitan area with a single job portal
2. Skills identification and clustering for DC metropolitan area with a single job portal
3. Coverage of other metropolitan areas
4. Coverage of other job portals

Phase 1 will be based on job postings scraped from Indeed. Initially, job titles will be matched to 6-digit SOC titles using rules. For the unmatched job titles, a look-alike algorithm will be implemented using the job descriptions and SOC descriptions. Quality assurance will be done through the random selection of a small subset of the data and its manual labeling. Upon successful completion of this step, an algorithm will be developed for daily scraping of the data and its storage. A set of key indicators will also be designed together with their visualization. Phase 1 will be completed by January 24, 2020. 

The scope for Phase 2 is still open. One option is to cluster the job descriptions along several dimensions measuring various aspects of knowledge and skills. The second option is to match the occupations identified with phase 1 to different skills using __[ONet's existing classification](https://www.onetcenter.org/dataCollection.html)__.   

## Data Cleaning
A preliminary round of data collection has already been completed. This data includes job postings from Indeed for Washington, DC. The data fields are date of collection, location, job title, company, job description, and salary (if provided).

In [79]:
%run data_cleaning

In [7]:
soc_titles_df = clean_soc_titles()

In [8]:
soc_titles_df.head()

Unnamed: 0,title,soc_6
0,CEO,11-1011
1,Chief Executive Officer,11-1011
2,Chief Operating Officer,11-1011
3,Commissioner of Internal Revenue,11-1011
4,COO,11-1011


In [9]:
soc_titles_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40221 entries, 0 to 64699
Data columns (total 2 columns):
title    40221 non-null object
soc_6    40221 non-null object
dtypes: object(2)
memory usage: 942.7+ KB


In [10]:
tokenized_soc_titles_list = [word_tokenize(title) for title in soc_titles_df.title]

In [11]:
tokenized_soc_titles_list[:5]

[['CEO'],
 ['Chief', 'Executive', 'Officer'],
 ['Chief', 'Operating', 'Officer'],
 ['Commissioner', 'of', 'Internal', 'Revenue'],
 ['COO']]

In [12]:
stopwords_list = create_stop_words()

In [13]:
stopwords_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [14]:
stopwords_list[-10:]

['naperville',
 'torrance',
 'omaha',
 'aguadilla',
 'hickory',
 'martin',
 'metairie',
 'price',
 'portsmouth',
 'dayton']

In [15]:
stopped_tokenized_soc_titles_list = stop_tokenized_titles(tokenized_soc_titles_list, stopwords_list)

In [16]:
stopped_tokenized_soc_titles_list[:5]

[['ceo'],
 ['chief', 'executive', 'officer'],
 ['chief', 'operating', 'officer'],
 ['commissioner', 'internal', 'revenue'],
 ['coo']]

In [17]:
len(stopped_tokenized_soc_titles_list)

40221

In [89]:
indeed_titles_df = pd.read_csv('47900_training.csv').title

In [90]:
indeed_titles_df.shape

(8835,)

In [91]:
indeed_titles_df = indeed_titles_df.dropna()

In [92]:
indeed_titles_df.shape

(8834,)

In [93]:
indeed_titles_df.head()

0             Front Office Coordinator
1      Customer Service Representative
2     Police Communications Specialist
3    Office Services \/ Mail Associate
4            Full-Time Store Associate
Name: title, dtype: object

In [94]:
tokenized_indeed_titles_list = [word_tokenize(title) for title in indeed_titles_df]

In [95]:
tokenized_indeed_titles_list[:5]

[['Front', 'Office', 'Coordinator'],
 ['Customer', 'Service', 'Representative'],
 ['Police', 'Communications', 'Specialist'],
 ['Office', 'Services', '\\/', 'Mail', 'Associate'],
 ['Full-Time', 'Store', 'Associate']]

In [96]:
len(tokenized_indeed_titles_list)

8834

In [97]:
stopped_tokenized_indeed_titles_list = stop_tokenized_titles(tokenized_indeed_titles_list, stopwords_list)

In [98]:
len(stopped_tokenized_indeed_titles_list)

8834

In [99]:
stopped_tokenized_indeed_titles_list = substitute_words(stopped_tokenized_indeed_titles_list)

In [100]:
stopped_tokenized_indeed_titles_list[:5]

[['front', 'office', 'coordinator'],
 ['customer', 'service', 'representative'],
 ['police', 'communications', 'specialist'],
 ['office', 'service', 'mail', 'associate'],
 ['store', 'associate']]

In [101]:
len(stopped_tokenized_indeed_titles_list)

8834

In [102]:
indeed_titles_list = []
for tokenized_title in stopped_tokenized_indeed_titles_list:
    title = ''
    for token in tokenized_title:
        title += token + ' '
    indeed_titles_list.append(title.rstrip())

In [103]:
indeed_titles_list[:5]

['front office coordinator',
 'customer service representative',
 'police communications specialist',
 'office service mail associate',
 'store associate']

In [104]:
len(indeed_titles_list)

8834

In [111]:
indeed_titles_df = pd.DataFrame(indeed_titles_list)

In [112]:
indeed_titles_df = indeed_titles_df.drop_duplicates()

In [113]:
indeed_titles_df.shape

(5499, 1)

In [114]:
indeed_titles_df.head()

Unnamed: 0,0
0,front office coordinator
1,customer service representative
2,police communications specialist
3,office service mail associate
4,store associate


In [116]:
stopped_tokenized_indeed_titles_list = [word_tokenize(title) for title in indeed_titles_df.iloc[:, 0]]

In [117]:
stopped_tokenized_indeed_titles_list[:5]

[['front', 'office', 'coordinator'],
 ['customer', 'service', 'representative'],
 ['police', 'communications', 'specialist'],
 ['office', 'service', 'mail', 'associate'],
 ['store', 'associate']]

In [118]:
len(stopped_tokenized_indeed_titles_list)

5499

### Modeling


In [120]:
from gensim.models import Word2Vec 
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial import distance
import multiprocessing
import numpy as np

In [119]:
corpus_list = stopped_tokenized_indeed_titles_list + stopped_tokenized_soc_titles_list

In [121]:
len(corpus_list)

45720

In [124]:
dim = 300
wsize = 5
model = Word2Vec(corpus_list, 
                 size = dim, 
                 window = wsize, 
                 min_count = 5, 
                 workers = multiprocessing.cpu_count())
model.train(corpus_list, total_examples = model.corpus_count, epochs = model.epochs)
wv = model.wv



In [125]:
wv.most_similar('customer')

[('guest', 0.9594030380249023),
 ('sale', 0.9460951685905457),
 ('retail', 0.9457467794418335),
 ('account', 0.9450162053108215),
 ('patient', 0.9430860280990601),
 ('advisor', 0.9305132031440735),
 ('store', 0.9239559173583984),
 ('service', 0.9238852858543396),
 ('insurance', 0.9203119277954102),
 ('client', 0.9136644601821899)]

In [126]:
wv.most_similar('data')

[('management', 0.9686769247055054),
 ('network', 0.963157594203949),
 ('assurance', 0.9608554840087891),
 ('security', 0.959830641746521),
 ('quality', 0.9597352147102356),
 ('level', 0.9533358216285706),
 ('project', 0.9511696696281433),
 ('software', 0.950702428817749),
 ('senior', 0.9490289688110352),
 ('technology', 0.9487961530685425)]