# Occupations and Skills in Demand

## Objective
This project aims to develop a platform to continuously monitor online job vacancies across the U.S. and predict occupations and skills in demand by the employers and different industries. Such a platform would enhance the existing labor market indicators by providing a deeper and higher-frequency monitoring of labor demand. As a result, it would inform labor, education, and immigration policies and activities to develop and maintain a skilled workforce, which would, in the long run, contribute to income mobility and equality. 

Three organizations have developed similar platforms. The Conference Board uses online job postings from various portals and companies to publish monthly national and state-level vacancy indicators at the 2-digit Standard Occupational Classification (SOC) level. Florida Department of Economic Opportunity is one of the users of these indicators. A technical note that describes the Conference Board platform is available __[here](https://www.conference-board.org/pdf_free/press/2018%20HWOL%20Technical%20Note8.pdf)__. The Center for Urban Research at the City University of New York has designed a real-time labor market information system that scrapes private and public job boards on a daily basis and stores the data in searchable databases. A brief desciption of this system is available __[here](https://gc.cuny.edu/lmis/research/real_time#menu)__. __[The New Jersey state government](https://careerconnections.nj.gov/careerconnections/prepare/skills/demand/demand_occupations_list_methodology.shtml)__ is one of the users of this system. The Burning Glass Technologies delivers real-time job vacancies data and planning tools that inform careers, define academic programs, and shape workforces. A description of their offerings is __[here](https://www.burning-glass.com/research-project/skills-taxonomy/)__. The World Bank and the Government of Malaysia used their offerings to monitor in-demand occupations and skills in Malaysia.

## Methodology
The project will follow a phased approach as outlined below:
1. Occupational classification for DC metropolitan area with a single job portal
2. Skills identification and clustering for DC metropolitan area with a single job portal
3. Coverage of other metropolitan areas
4. Coverage of other job portals

Phase 1 will be based on job postings scraped from Indeed. Initially, job titles will be matched to 6-digit SOC titles using rules. For the unmatched job titles, a look-alike algorithm will be implemented using the job descriptions and SOC descriptions. Quality assurance will be done through the random selection of a small subset of the data and its manual labeling. Upon successful completion of this step, an algorithm will be developed for daily scraping of the data and its storage. A set of key indicators will also be designed together with their visualization. Phase 1 will be completed by January 24, 2020. 

The scope for Phase 2 is still open. One option is to cluster the job descriptions along several dimensions measuring various aspects of knowledge and skills. The second option is to match the occupations identified with phase 1 to different skills using __[ONet's existing classification](https://www.onetcenter.org/dataCollection.html)__.   

## Data Cleaning
A preliminary round of data collection has already been completed. This data includes job postings from Indeed for Washington, DC. The data fields are date of collection, location, job title, company, job description, and salary (if provided).

In [3]:
%run data_cleaning

In [4]:
soc_titles_df = clean_soc_titles()

In [5]:
soc_titles_df.head()

Unnamed: 0,title,soc_6
0,CEO,11-1011
1,Chief Executive Officer,11-1011
2,Chief Operating Officer,11-1011
3,Commissioner of Internal Revenue,11-1011
4,COO,11-1011


In [6]:
soc_titles_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40221 entries, 0 to 64699
Data columns (total 2 columns):
title    40221 non-null object
soc_6    40221 non-null object
dtypes: object(2)
memory usage: 942.7+ KB


In [7]:
tokenized_soc_titles_list = [word_tokenize(title) for title in soc_titles_df.title]

In [8]:
tokenized_soc_titles_list[:5]

[['CEO'],
 ['Chief', 'Executive', 'Officer'],
 ['Chief', 'Operating', 'Officer'],
 ['Commissioner', 'of', 'Internal', 'Revenue'],
 ['COO']]

In [9]:
stopwords_list = create_stop_words()

In [10]:
stopwords_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [11]:
stopwords_list[-10:]

['aurora',
 'elmira',
 'sedalia',
 'sterling',
 'albertville',
 'bemidji',
 'menomonie',
 'carroll',
 'indiana',
 'sacramento']

In [12]:
stopped_tokenized_soc_titles_list = stop_tokenized_titles(tokenized_soc_titles_list, stopwords_list)

In [13]:
stopped_tokenized_soc_titles_list[:5]

[['ceo'],
 ['chief', 'executive', 'officer'],
 ['chief', 'operating', 'officer'],
 ['commissioner', 'internal', 'revenue'],
 ['coo']]

In [14]:
len(stopped_tokenized_soc_titles_list)

40221

In [15]:
indeed_titles_df = pd.read_csv('47900_training.csv').title

In [16]:
indeed_titles_df.shape

(8835,)

In [17]:
indeed_titles_df = indeed_titles_df.dropna()

In [18]:
indeed_titles_df.shape

(8834,)

In [19]:
indeed_titles_df.head()

0             Front Office Coordinator
1      Customer Service Representative
2     Police Communications Specialist
3    Office Services \/ Mail Associate
4            Full-Time Store Associate
Name: title, dtype: object

In [20]:
tokenized_indeed_titles_list = [word_tokenize(title) for title in indeed_titles_df]

In [21]:
tokenized_indeed_titles_list[:5]

[['Front', 'Office', 'Coordinator'],
 ['Customer', 'Service', 'Representative'],
 ['Police', 'Communications', 'Specialist'],
 ['Office', 'Services', '\\/', 'Mail', 'Associate'],
 ['Full-Time', 'Store', 'Associate']]

In [22]:
len(tokenized_indeed_titles_list)

8834

In [23]:
stopped_tokenized_indeed_titles_list = stop_tokenized_titles(tokenized_indeed_titles_list, stopwords_list)

In [24]:
len(stopped_tokenized_indeed_titles_list)

8834

In [25]:
stopped_tokenized_indeed_titles_list = substitute_words(stopped_tokenized_indeed_titles_list)

In [26]:
stopped_tokenized_indeed_titles_list[:5]

[['front', 'office', 'coordinator'],
 ['customer', 'service', 'representative'],
 ['police', 'communications', 'specialist'],
 ['office', 'service', 'mail', 'associate'],
 ['store', 'associate']]

In [27]:
len(stopped_tokenized_indeed_titles_list)

8834

In [28]:
indeed_titles_list = []
for tokenized_title in stopped_tokenized_indeed_titles_list:
    title = ''
    for token in tokenized_title:
        title += token + ' '
    indeed_titles_list.append(title.rstrip())

In [29]:
indeed_titles_list[:5]

['front office coordinator',
 'customer service representative',
 'police communications specialist',
 'office service mail associate',
 'store associate']

In [30]:
len(indeed_titles_list)

8834

In [31]:
indeed_titles_df = pd.DataFrame(indeed_titles_list)

In [32]:
indeed_titles_df = indeed_titles_df.drop_duplicates()

In [33]:
indeed_titles_df.shape

(5499, 1)

In [34]:
indeed_titles_df.head()

Unnamed: 0,0
0,front office coordinator
1,customer service representative
2,police communications specialist
3,office service mail associate
4,store associate


In [35]:
stopped_tokenized_indeed_titles_list = [word_tokenize(title) for title in indeed_titles_df.iloc[:, 0]]

In [36]:
stopped_tokenized_indeed_titles_list[:5]

[['front', 'office', 'coordinator'],
 ['customer', 'service', 'representative'],
 ['police', 'communications', 'specialist'],
 ['office', 'service', 'mail', 'associate'],
 ['store', 'associate']]

In [37]:
len(stopped_tokenized_indeed_titles_list)

5499

### Modeling


In [57]:
%run soc_classification.py

In [39]:
corpus_list = stopped_tokenized_indeed_titles_list + stopped_tokenized_soc_titles_list

In [527]:
dim = 600
wsize = 2
mcount = 3
epoch = 20
a = 0.0001
model = Word2Vec(corpus_list, 
                 size = dim, 
                 window = wsize, 
                 min_count = mcount, 
                 iter = epoch,
                 alpha = a,
                 workers = multiprocessing.cpu_count())
model.train(corpus_list, total_examples = model.corpus_count, epochs = model.epochs)
wv = model.wv

In [355]:
wv.most_similar('customer')

[('retail', 0.9704826474189758),
 ('sale', 0.9648263454437256),
 ('guest', 0.9467342495918274),
 ('account', 0.9463907480239868),
 ('service', 0.9447579383850098),
 ('food', 0.9430411458015442),
 ('tsr', 0.9411400556564331),
 ('representative', 0.9409863948822021),
 ('store', 0.9352807998657227),
 ('sales', 0.9347476363182068)]

In [263]:
wv.most_similar('data')

[('assurance', 0.9761399626731873),
 ('network', 0.971688985824585),
 ('security', 0.9659345149993896),
 ('quality', 0.9646139740943909),
 ('lab', 0.962565541267395),
 ('management', 0.9619563817977905),
 ('intelligence', 0.9584349393844604),
 ('gis', 0.9551427364349365),
 ('protection', 0.9545958042144775),
 ('operations', 0.9539693593978882)]

In [284]:
vectorized_indeed_titles_list = vectorize_title(wv, dim, stopped_tokenized_indeed_titles_list)

In [232]:
len(vectorized_indeed_titles_list)

5499

In [510]:
vectorized_soc_titles_list = vectorize_title(wv, dim, stopped_tokenized_soc_titles_list)

In [234]:
len(vectorized_soc_titles_list)

40221

In [235]:
similarity_matrix = distance.cdist(vectorized_indeed_titles_list, vectorized_soc_titles_list, 'cosine')

In [77]:
similarity_matrix.shape

(5499, 40221)

In [105]:
similarity_matrix[0]

array([0.1826299 , 0.23561255, 0.4054996 , ..., 0.62479005, 0.17248713,
       0.23750251])

In [236]:
similarity_matrix = 1 - similarity_matrix

In [107]:
similarity_matrix[0]

array([0.8173701 , 0.76438745, 0.5945004 , ..., 0.37520995, 0.82751287,
       0.76249749])

In [237]:
masked_similarity_matrix = np.ma.masked_invalid(similarity_matrix)

In [109]:
masked_similarity_matrix[0]

masked_array(data=[0.8173701 , 0.76438745, 0.5945004 , ..., 0.37520995,
                   0.82751287, 0.76249749],
             mask=False,
       fill_value=nan)

In [238]:
max_similarity_list = np.amax(masked_similarity_matrix, axis = 1)

In [117]:
max_similarity_list[:10]

masked_array(data=[1.0, 0.9999999999999999, 0.9977303617895377,
                   0.9904795308458327, 0.9958900753024992,
                   0.9975540127829251, 0.9997086807851735,
                   0.9926173237671049, 0.9930601174248639,
                   0.9998682610920837],
             mask=[False, False, False, False, False, False, False, False,
                   False, False],
       fill_value=1e+20)

In [118]:
max_similarity_list.shape

(5499,)

In [239]:
max_similarity_index_list = np.argmax(masked_similarity_matrix, axis = 1)

In [121]:
max_similarity_index_list[:10]

array([20917, 25118, 16233, 20107, 20674, 22249, 12633, 18650,  8556,
       20649])

In [140]:
max_similarity_index_list.shape

(5499,)

In [129]:
a = np.array([1, 2, np.nan, 3])

In [130]:
ma = np.ma.masked_invalid(a)

In [131]:
ma

masked_array(data=[1.0, 2.0, --, 3.0],
             mask=[False, False,  True, False],
       fill_value=1e+20)

In [132]:
i = np.argmax(ma)

In [133]:
i

3

In [134]:
indeed_titles_list = []
for tokenized_title in stopped_tokenized_indeed_titles_list:
    title = ''
    for token in tokenized_title:
        title += token + ' '
    indeed_titles_list.append(title.rstrip())

In [135]:
indeed_titles_list[:5]

['front office coordinator',
 'customer service representative',
 'police communications specialist',
 'office service mail associate',
 'store associate']

In [136]:
len(indeed_titles_list)

5499

In [137]:
soc_titles_list = []
for tokenized_title in stopped_tokenized_soc_titles_list:
    title = ''
    for token in tokenized_title:
        title += token + ' '
    soc_titles_list.append(title.rstrip())

In [138]:
soc_titles_list[:5]

['ceo',
 'chief executive officer',
 'chief operating officer',
 'commissioner internal revenue',
 'coo']

In [139]:
len(soc_titles_list)

40221

In [175]:
df = pd.DataFrame({'indeed title': indeed_titles_list,
                   'soc title': v,
                   'soc code': y,
                   'cosine score': max_similarity_list})

In [142]:
v = []
for i in max_similarity_index_list:
    try:
        v.append(soc_titles_list[i])
    except:
        print(soc_titles_list[i])

In [170]:
max_similarity_index_list[0]

20917

In [172]:
soc_titles_df.iloc[0]

title        CEO
soc_6    11-1011
Name: 0, dtype: object

In [173]:
y = []
for i in max_similarity_index_list:
    try:
        y.append(soc_titles_df.iloc[i].soc_6)
    except:
        print(soc_titles_df.iloc[i].soc_6)

In [174]:
len(y)

5499

In [176]:
df.head(100)

Unnamed: 0,indeed title,soc title,soc code,cosine score
0,front office coordinator,front office coordinator,43-6013,1.000000
1,customer service representative,customer service representative,49-9031,1.000000
2,police communications specialist,radiation protection specialist,29-9011,0.997730
3,office service mail associate,library customer service clerk,43-4121,0.990480
4,store associate,store marketing associate ism associate,43-5081,0.995890
...,...,...,...,...
95,investigation specialist,activity specialist,39-9032,0.999643
96,bookseller textbooks temp howard university bo...,university archivist,25-4011,0.999646
97,busser,busser,35-9011,1.000000
98,mail courier driver,city route driver,53-3031,0.999209


### Testing

In [182]:
test_df = pd.read_csv('47900_test.csv')

In [184]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 22 columns):
Unnamed: 0        200 non-null int64
_id               200 non-null object
jk                200 non-null object
efccid            199 non-null object
srcid             200 non-null object
cmpid             200 non-null object
num               200 non-null int64
srcname           200 non-null object
cmp               185 non-null object
cmpesc            185 non-null object
cmplnk            179 non-null object
loc               179 non-null object
country           178 non-null object
zip               78 non-null float64
city              174 non-null object
title             200 non-null object
locid             176 non-null object
rd                176 non-null object
date              200 non-null object
msa               200 non-null int64
SOC_title_2010    190 non-null object
SOC_2010          190 non-null object
dtypes: float64(1), int64(3), object(18)
memory usage: 34.5+ K

In [185]:
test_df = test_df.dropna(subset = ['title'])

In [186]:
test_df.shape

(200, 22)

In [187]:
test_titles_df = test_df.title

In [188]:
tokenized_test_titles_list = [word_tokenize(title) for title in test_titles_df]

In [189]:
stopped_tokenized_test_titles_list = stop_tokenized_titles(tokenized_test_titles_list, stopwords_list)

In [190]:
stopped_tokenized_test_titles_list = substitute_words(stopped_tokenized_test_titles_list)

In [191]:
len(stopped_tokenized_test_titles_list)

200

Below is the test pipeline to:

* Vectorize tokenized and stopped job titles in the test data;
* Compute cosine similarity scores vis-a-vis the SOC-coded job titles;
* Match each job title from the test data to the SOC-coded job title with the highest similarity score; and
* Compute accuracy metrics for 2-digit and 6-digit SOC codes.

In [511]:
vectorized_test_titles_list = vectorize_title(wv, dim, stopped_tokenized_test_titles_list)

In [512]:
similarity_matrix = 1 - distance.cdist(vectorized_test_titles_list, vectorized_soc_titles_list, 'cosine')

In [513]:
masked_similarity_matrix = np.ma.masked_invalid(similarity_matrix)

In [514]:
max_similarity_list = np.amax(masked_similarity_matrix, axis = 1)

In [515]:
max_similarity_index_list = np.argmax(masked_similarity_matrix, axis = 1)

In [516]:
v = []
for i in max_similarity_index_list:
    try:
        v.append(soc_titles_list[i])
    except:
        print(soc_titles_list[i])

In [517]:
y = []
for i in max_similarity_index_list:
    try:
        y.append(soc_titles_df.iloc[i].soc_6)
    except:
        print(soc_titles_df.iloc[i].soc_6)

In [518]:
df = pd.DataFrame({'test title': test_df.title,
                   'test soc code': test_df.SOC_2010,
                   'soc title': v,
                   'soc code': y,
                   'cosine score': max_similarity_list})

In [528]:
df[170:200]

Unnamed: 0,test title,test soc code,soc title,soc code,cosine score,label_2,label_6,soc_2,soc_6
170,Spa Concierge at a Luxury Day Spa,39-6010,day spa manager,39-1021,0.661977,39,39-6010,39,39-1020
171,Bookkeeping Assistant,43-3030,bookkeeping assistant,43-3021,1.0,43,43-3030,43,43-3020
172,Busboys and Poets Hyattsville - Food Preparati...,35-9090,food preparation worker,35-2021,0.760179,35,35-9090,35,35-2020
173,Dispatcher,43-5030,dispatcher,43-5031,1.0,43,43-5030,43,43-5030
174,Restaurant Team Member - Crew (1579 - CityVista),35-3030,restaurant team member,35-3021,0.87501,35,35-3030,35,35-3020
175,Service Desk - Overnight,43-4170,desk monitor,39-3091,0.557692,43,43-4170,39,39-3090
176,Engineering Technician,17-3020,engineering technician,17-3029,1.0,17,17-3020,17,17-3020
177,"Spanish FLES Teacher, ES",25-2020,barbering teacher,25-1194,0.720723,25,25-2020,25,25-1190
178,Dish,35-9020,dish washer,35-9021,0.713782,35,35-9020,35,35-9020
179,16523 - Investigator - GS-15,13-1030,epidemiology investigator,19-1041,1.0,13,13-1030,19,19-1040


In [520]:
df['label_2'] = [str(code)[0:2] for code in df['test soc code'] if code != None]

In [521]:
df['label_6'] = [str(code)[0:6]+'0' for code in df['test soc code'] if code != None]

In [522]:
df['soc_2'] = [str(code)[0:2] for code in df['soc code'] if code != None]

In [523]:
df['soc_6'] = [str(code)[0:6]+'0' for code in df['soc code'] if code != None]

In [524]:
df.loc[df.label_6 == df.soc_6].count()['test title'] / 200

0.265

In [525]:
df.loc[df.label_2 == df.soc_2].count()['test title'] / 200

0.49

In [301]:
df

Unnamed: 0,test title,test soc code,soc title,soc code,cosine score,label_2,label_6,soc_2,soc_6
0,The Bizzelle Group Job Openings,,choir accompanist,27-2042,0.998337,na,nan0,27,27-2040
1,Pavement Marking and Sign Operator,47-2070,color printer operator,51-9151,0.999906,47,47-2070,51,51-9150
2,"Program Associate, YEAR Program",13-1190,environmental business development associate,19-2041,0.996578,13,13-1190,19,19-2040
3,AMP Host,39-3030,event host,27-3012,0.997654,39,39-3030,27,27-3010
4,Unarmed Security Officer,33-9030,campus security officer,33-9032,0.999020,33,33-9030,33,33-9030
...,...,...,...,...,...,...,...,...,...
195,Concierge,39-6010,concierge,39-6012,1.000000,39,39-6010,39,39-6010
196,Client Service Representative,43-4050,claim service representative,43-9041,0.999621,43,43-4050,43,43-9040
197,UNIT SECRETARY - Emergency Room,43-6010,emergency room orderly,31-1015,0.993650,43,43-6010,31,31-1010
198,Warehouse Specialist,43-5080,warehouse specialist,43-5081,1.000000,43,43-5080,43,43-5080


In [390]:
df[:30]

Unnamed: 0,test title,test soc code,soc title,soc code,cosine score,label_2,label_6,soc_2,soc_6
0,The Bizzelle Group Job Openings,,deputy attorney general,23-1011,0.99739,na,nan0,23,23-1010
1,Pavement Marking and Sign Operator,47-2070,rod tape operator,51-9199,0.999833,47,47-2070,51,51-9190
2,"Program Associate, YEAR Program",13-1190,environmental business development associate,19-2041,0.99508,13,13-1190,19,19-2040
3,AMP Host,39-3030,event host,27-3012,0.996696,39,39-3030,27,27-3010
4,Unarmed Security Officer,33-9030,campus security officer,33-9032,0.999358,33,33-9030,33,33-9030
5,Customer Relations and Recovery Manager.,11-2000,salon customer experience specialist,39-5012,0.996989,11,11-2000,39,39-5010
6,Temporary Book Store Clerks,41-2020,grocery store courtesy clerk,53-7064,0.999179,41,41-2020,53,53-7060
7,PLANNER III,19-3051,facility planner,17-3022,0.999505,19,19-3050,17,17-3020
8,075 Sales Associate - Inbound Telephone,41-9040,retail commerce sales associate,13-1199,0.99399,41,41-9040,13,13-1190
9,Sales Assistant\/Office Administrator - Join o...,43-6010,staff research associate,19-4021,0.980191,43,43-6010,19,19-4020
