# Model

This notebook builds a Doc2Vec model for matching course descriptions to job descriptions. It trains the model on a corpus of course descriptions (from the Coursera catalog). Then it evaluates the model by testing it out with a sample set of job descriptions for which relevant courses have already been pre-selected. Finally, the model is "pickled" for use in an API.

In [101]:
# Import libraries

import numpy as np
import pandas as pd
import gensim
from scipy.spatial import distance
import pickle

## Model Building

In [102]:
# Read in the course data

course_df = pd.read_csv('../EDA/Coursera_Catalog_English.csv')
print(course_df.shape)
course_df.head(2)

(3185, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data


In [103]:
# Preprocess the course data as a training corpus

# Convert the course descriptions to a list
course_descriptions = course_df['description'].to_list()

# Define a function for basic preprocessing:
# tokenize, lowercase, de-punctuate, document tagging
def read_corpus(corpus):
    for i, doc in enumerate(corpus):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Run the function on the list of course descriptions
train_corpus = list(read_corpus(course_descriptions))

In [104]:
# Examine the first document in the corpus (tag at end)

print(train_corpus[0])

TaggedDocument(['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], [0])


In [105]:
# Build a model

# Instantiate the model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50,
                                      window=5,
                                      dm=0, # PV-DBOW
                                      dm_concat=1,
                                      min_count=2,
                                      epochs=10)

# Create a vocabulary for the model
model.build_vocab(train_corpus)

# Train the model on the corpus
model.train(train_corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

In [106]:
# Self-assess the model

# Infer a vector for a document in the training corpus itself.
vector = model.infer_vector(train_corpus[0].words)

# Find which documents in the training corpus it is most similar to.
sims = model.docvecs.most_similar([vector])
sims

[(0, 0.9417077302932739),
 (2696, 0.9248173236846924),
 (349, 0.9130089282989502),
 (1530, 0.9050692915916443),
 (2956, 0.903048038482666),
 (143, 0.8966079354286194),
 (842, 0.8923130035400391),
 (1961, 0.8914554119110107),
 (1138, 0.8891334533691406),
 (385, 0.8859497904777527)]

#### Analysis: 
The model successfully finds that the queried document is most similar to itself in the training corpus, with a similarity score of over 94%.

## Model Evaluation

In [107]:
# Read in the sample job descriptions

jobs_test_sample = pd.read_csv('../EDA/jobs_test_sample.csv')
jobs_test_sample

Unnamed: 0,Job_title,Job_Desc,Job_id
0,Data Scientist,We are looking for Data Scientists who are int...,901
1,Data Scientist,The world's largest and fastest-growing compan...,910
2,Data Scientist,\nRole: Data Scientist.\n\nLocation: Foster Ci...,916
3,Data Scientist,Upstart is the leading AI lending platform par...,920
4,Data Scientist,"Why Divvy?Over the past decade, millions of Am...",938
5,Data Engineer,About Rocket LawyerWe believe everyone deserve...,935
6,Data Engineer,Our mission is to create a world where mental ...,1068
7,Data Engineer,Data Engineer \nIf you are a Data Engineer wit...,1089
8,Data Engineer,Prabhav Services Inc. is one of the premier pr...,1100
9,Data Engineer,About Skupos\nSkupos is the data platform for ...,1105


In [108]:
# Read in the sample courses relevant to the job descriptions

courses_test_sample = pd.read_csv('../EDA/courses_test_sample.csv')
courses_test_sample

Unnamed: 0,name,description,job_title,course_id
0,Machine Learning,Machine learning is the science of getting com...,Data Scientist,107
1,Databases and SQL for Data Science,Much of the world's data resides in databases....,Data Engineer,413
2,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...,Data Engineer,764
3,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...,Data Engineer,964
4,Machine Learning with Python,This course dives into the basics of machine l...,Data Scientist,1837
5,Applied Machine Learning in Python,This course will introduce the learner to appl...,Data Scientist,2300
6,Data Visualization with Python,"""A picture is worth a thousand words"". We are ...",Data Scientist,2590
7,Database Management Essentials,Database Management Essentials provides the fo...,Data Engineer,2721
8,The Data Scientist’s Toolbox,In this course you will get an introduction to...,Data Scientist,2766
9,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...,Data Engineer,3051


In [109]:
# Split the sample courses by job title

ds_courses = courses_test_sample.loc[courses_test_sample['job_title'] == "Data Scientist"]
de_courses = courses_test_sample.loc[courses_test_sample['job_title'] == "Data Engineer"]
de_courses.reset_index(drop=True, inplace=True)

In [111]:
# Calculate the model's distance score for this sample data:
# A lower score is better because the model should recognize
# that these jobs and courses are similar to each other.

# Initiate a list of distances
distances = []

# Loop through the labeled jobs
for i in range(len(jobs_test_sample)):
    
    # Extract the information for the job
    job_title, job_desc, job_id = jobs_test_sample.iloc[i]
    
    # Preprocess the job description
    doc = gensim.utils.simple_preprocess(job_desc)
    
    # Convert the job description into a vector
    job_vec = model.infer_vector(doc)
    
    # Check the job title and loop through relevant courses
    if job_title == 'Data Scientist':
        
        for j in range(len(ds_courses)):
            
            # Extract the course id
            course_id = ds_courses.iloc[j,3]
            
            # Find the distance between the job and course
            d = distance.cosine(job_vec, model.docvecs[course_id])
            
            # Add the result to the distance list
            distances.append(d)
            
    if job_title == 'Data Engineer':
        
        for j in range(len(de_courses)):
            
            # Extract the course id
            course_id = de_courses.iloc[j,3]
            
            # Find the distance between the job and course
            d = distance.cosine(job_vec, model.docvecs[course_id])
            
            # Add the result to the distance list
            distances.append(d)

# Calculate the average distance for all job descriptions
sum_dist = 0
for d in distances:
    sum_dist += d

avg_dist = round(sum_dist / len(distances), 3)
min_dist = round(min(distances), 3)
max_dist = round(max(distances), 3)

print(f"Mean distance score: {avg_dist}")
print(f"Min distance score: {min_dist}")
print(f"Max distance score: {max_dist}")

Mean distance score: 0.273
Min distance score: 0.085
Max distance score: 0.423


## Detailed Results

In [112]:
# Define a function to display the results for each labeled job

def display_recs(job_id):
    
    # Initialize a dictionary for the job
    job_dict = {}
    keys = ['job_id', 'job_title', 'course_id', 'course_name', 'selected', 'distance']
    for key in keys:
        job_dict[key] = []
    
    # Extract info about the job
    i = jobs_test_sample.loc[jobs_test_sample['Job_id'] == job_id].index.values[0]
    job_title = jobs_test_sample.iloc[i]['Job_title']
    job_desc = jobs_test_sample.iloc[i]['Job_Desc']
    
    # Preprocess the job description
    doc = gensim.utils.simple_preprocess(job_desc)
    
    # Convert the job description into a vector
    job_vec = model.infer_vector(doc)
    
    # Get top 5 courses from the model
    top_recs = model.docvecs.most_similar(positive=[job_vec], topn=5)
    
    # Loop through top 5 courses
    for rec in top_recs:
        
        # Add job info to the dictionary
        job_dict['job_id'].append(job_id)
        job_dict['job_title'].append(job_title)
        
        # Add course info to the dicitonary
        course_id = rec[0]
        job_dict['course_id'].append(course_id)
        course_name = course_df.loc[course_id, 'name']
        job_dict['course_name'].append(course_name)
        job_dict['selected'].append(0)
        
        # Add the distance score to the dictionary
        d = round((1-rec[1]), 3)
        job_dict['distance'].append(d)
    
    # Now loop through the courses pre-selected as relevant
    if job_title == 'Data Scientist':
        
        for i in range(len(ds_courses)):
            
            # Add job info to the dictionary
            job_dict['job_id'].append(job_id)
            job_dict['job_title'].append(job_title)
            
            # Add course info to the dictionary
            course_id = ds_courses.iloc[i,3]
            job_dict['course_id'].append(course_id)
            course_name = ds_courses.iloc[i,0]
            job_dict['course_name'].append(course_name)
            job_dict['selected'].append(1)
            
            # Find and add the distance score to the dictionary
            d = distance.cosine(job_vec, model.docvecs[course_id])
            d = round(d,3)
            job_dict['distance'].append(d)
            
    if job_title == 'Data Engineer':
        
        for i in range(len(de_courses)):
            
            # Add job info to the dictionary
            job_dict['job_id'].append(job_id)
            job_dict['job_title'].append(job_title)
            
            # Add course info to the dictionary
            course_id = de_courses.iloc[i,3]
            job_dict['course_id'].append(course_id)
            course_name = de_courses.iloc[i,0]
            job_dict['course_name'].append(course_name)
            job_dict['selected'].append(1)
            
            # Find and add the distance score to the dictionary
            d = distance.cosine(job_vec, model.docvecs[course_id])
            d = round(d,3)
            job_dict['distance'].append(d)
    
    # Convert the dictionary into a dataframe
    df = pd.DataFrame(job_dict)
    
    # Rank by distance
    df = df.sort_values('distance')
    
    return df

In [114]:
# Job 1
display_recs(901)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,901,Data Scientist,2032,Social Media Data Analytics,0,0.195
1,901,Data Scientist,1697,"Big Data Essentials: HDFS, MapReduce and Spark...",0,0.198
2,901,Data Scientist,1269,New Technologies for Business Leaders,0,0.206
3,901,Data Scientist,1251,Introduction to Social Media Analytics,0,0.206
4,901,Data Scientist,917,Predictive Analytics and Data Mining,0,0.207
8,901,Data Scientist,2590,Data Visualization with Python,1,0.267
9,901,Data Scientist,2766,The Data Scientist’s Toolbox,1,0.294
7,901,Data Scientist,2300,Applied Machine Learning in Python,1,0.303
5,901,Data Scientist,107,Machine Learning,1,0.321
6,901,Data Scientist,1837,Machine Learning with Python,1,0.375


In [115]:
# Job 2
display_recs(910)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,910,Data Scientist,3052,Business Transformation with Google Cloud,0,0.258
1,910,Data Scientist,2267,Digital Product Management: Modern Fundamentals,0,0.259
2,910,Data Scientist,1378,Executive Data Science Capstone,0,0.266
3,910,Data Scientist,1465,People Analytics,0,0.268
4,910,Data Scientist,2868,Moving to the Cloud,0,0.269
8,910,Data Scientist,2590,Data Visualization with Python,1,0.326
9,910,Data Scientist,2766,The Data Scientist’s Toolbox,1,0.328
7,910,Data Scientist,2300,Applied Machine Learning in Python,1,0.417
6,910,Data Scientist,1837,Machine Learning with Python,1,0.423
5,910,Data Scientist,107,Machine Learning,1,0.426


In [116]:
# Job 3
display_recs(916)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,916,Data Scientist,458,Data Science for Business Innovation,0,0.057
1,916,Data Scientist,1066,Foundations of Data Science: K-Means Clusterin...,0,0.058
2,916,Data Scientist,2270,Pattern Discovery in Data Mining,0,0.065
3,916,Data Scientist,1909,Ordered Data Structures,0,0.067
4,916,Data Scientist,2791,Data Science in Stratified Healthcare and Prec...,0,0.07
8,916,Data Scientist,2590,Data Visualization with Python,1,0.074
9,916,Data Scientist,2766,The Data Scientist’s Toolbox,1,0.086
7,916,Data Scientist,2300,Applied Machine Learning in Python,1,0.104
5,916,Data Scientist,107,Machine Learning,1,0.137
6,916,Data Scientist,1837,Machine Learning with Python,1,0.191


In [117]:
# Job 4
display_recs(920)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,920,Data Scientist,633,Big Data Applications: Machine Learning at Scale,0,0.19
1,920,Data Scientist,1697,"Big Data Essentials: HDFS, MapReduce and Spark...",0,0.194
2,920,Data Scientist,2214,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.206
3,920,Data Scientist,1843,Data Science in Real Life,0,0.234
4,920,Data Scientist,1935,"Big Data, Artificial Intelligence, and Ethics",0,0.235
8,920,Data Scientist,2590,Data Visualization with Python,1,0.269
5,920,Data Scientist,107,Machine Learning,1,0.283
6,920,Data Scientist,1837,Machine Learning with Python,1,0.305
9,920,Data Scientist,2766,The Data Scientist’s Toolbox,1,0.314
7,920,Data Scientist,2300,Applied Machine Learning in Python,1,0.363


In [118]:
# Job 5
display_recs(938)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,938,Data Scientist,2984,Foundations of strategic business analytics,0,0.143
1,938,Data Scientist,2219,Big Data Emerging Technologies,0,0.144
2,938,Data Scientist,298,Business Analytics Executive Overview,0,0.148
3,938,Data Scientist,2874,Business Analytics and Digital Media,0,0.155
4,938,Data Scientist,507,Relational Database Support for Data Warehouses,0,0.155
8,938,Data Scientist,2590,Data Visualization with Python,1,0.181
9,938,Data Scientist,2766,The Data Scientist’s Toolbox,1,0.187
7,938,Data Scientist,2300,Applied Machine Learning in Python,1,0.264
5,938,Data Scientist,107,Machine Learning,1,0.286
6,938,Data Scientist,1837,Machine Learning with Python,1,0.288


In [119]:
# Job 6
display_recs(935)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,935,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",0,0.191
7,935,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",1,0.191
1,935,Data Engineer,2841,Wharton Business and Financial Modeling Capstone,0,0.204
2,935,Data Engineer,1804,"Business Intelligence Concepts, Tools, and App...",0,0.205
3,935,Data Engineer,2423,Survey Data Collection and Analytics Project (...,0,0.205
4,935,Data Engineer,917,Predictive Analytics and Data Mining,0,0.209
9,935,Data Engineer,3051,Big Data Modeling and Management Systems,1,0.222
8,935,Data Engineer,2721,Database Management Essentials,1,0.223
6,935,Data Engineer,764,Google Cloud Platform Big Data and Machine Lea...,1,0.244
5,935,Data Engineer,413,Databases and SQL for Data Science,1,0.276


In [120]:
# Job 7
display_recs(1068)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,1068,Data Engineer,2214,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.31
1,1068,Data Engineer,1697,"Big Data Essentials: HDFS, MapReduce and Spark...",0,0.311
2,1068,Data Engineer,1479,Search Advertising,0,0.33
3,1068,Data Engineer,3051,Big Data Modeling and Management Systems,0,0.333
9,1068,Data Engineer,3051,Big Data Modeling and Management Systems,1,0.333
4,1068,Data Engineer,1441,Data Manipulation at Scale: Systems and Algori...,0,0.335
6,1068,Data Engineer,764,Google Cloud Platform Big Data and Machine Lea...,1,0.371
5,1068,Data Engineer,413,Databases and SQL for Data Science,1,0.398
7,1068,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",1,0.412
8,1068,Data Engineer,2721,Database Management Essentials,1,0.42


In [121]:
# Job 8
display_recs(1089)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,1089,Data Engineer,2214,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.2
1,1089,Data Engineer,1479,Search Advertising,0,0.233
2,1089,Data Engineer,619,Data Analysis and Interpretation Capstone,0,0.234
3,1089,Data Engineer,2428,Agile Analytics,0,0.237
4,1089,Data Engineer,175,"Big Data, Genes, and Medicine",0,0.242
9,1089,Data Engineer,3051,Big Data Modeling and Management Systems,1,0.288
7,1089,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",1,0.29
8,1089,Data Engineer,2721,Database Management Essentials,1,0.297
5,1089,Data Engineer,413,Databases and SQL for Data Science,1,0.318
6,1089,Data Engineer,764,Google Cloud Platform Big Data and Machine Lea...,1,0.363


In [122]:
# Job 9
display_recs(1100)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,1100,Data Engineer,22,Code Free Data Science,0,0.128
1,1100,Data Engineer,3051,Big Data Modeling and Management Systems,0,0.129
9,1100,Data Engineer,3051,Big Data Modeling and Management Systems,1,0.129
2,1100,Data Engineer,2698,Introduction to FPGA Design for Embedded Systems,0,0.13
3,1100,Data Engineer,3064,Responsive Website Tutorial and Examples,0,0.134
4,1100,Data Engineer,24,Getting Started With Application Development,0,0.137
7,1100,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",1,0.139
8,1100,Data Engineer,2721,Database Management Essentials,1,0.155
5,1100,Data Engineer,413,Databases and SQL for Data Science,1,0.164
6,1100,Data Engineer,764,Google Cloud Platform Big Data and Machine Lea...,1,0.186


In [123]:
# Job 10
display_recs(1105)

Unnamed: 0,job_id,job_title,course_id,course_name,selected,distance
0,1105,Data Engineer,3052,Business Transformation with Google Cloud,0,0.203
1,1105,Data Engineer,1545,Introduction to Big Data,0,0.213
2,1105,Data Engineer,1378,Executive Data Science Capstone,0,0.214
3,1105,Data Engineer,1479,Search Advertising,0,0.218
4,1105,Data Engineer,71,Google Cloud Product Fundamentals,0,0.222
9,1105,Data Engineer,3051,Big Data Modeling and Management Systems,1,0.225
7,1105,Data Engineer,964,"Data Warehouse Concepts, Design, and Data Inte...",1,0.23
8,1105,Data Engineer,2721,Database Management Essentials,1,0.256
6,1105,Data Engineer,764,Google Cloud Platform Big Data and Machine Lea...,1,0.262
5,1105,Data Engineer,413,Databases and SQL for Data Science,1,0.282


### Analysis:
The architecture works, but the model's performance is not great. It is finding relevant courses, but almost definitely not the most relevant ones in the catalog for job seekers.

## Pickle the Model

In [124]:
# Save the model for use in the API
# by serializing it with the pickle library.

pickle.dump(model, open('../model.p', 'wb'))