The purpose of this notebook is to build a Doc2Vec model.

In [1]:
# Import libraries

import numpy as np
import pandas as pd
import gensim
from scipy.spatial import distance
import pickle

## Model Building

In [2]:
# Read in the course data

course_df = pd.read_csv('../Data/Course_Data/Coursera_Catalog.csv')
print(course_df.shape)
course_df.head(2)

(4416, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data


In [3]:
# Preprocess the course data as a training corpus

# Convert the course descriptions to a list
course_descriptions = course_df['description'].to_list()

# Define a function for basic preprocessing:
# tokenize, lowercase, de-punctuation, document tagging
def read_corpus(corpus):
    for i, doc in enumerate(corpus):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Run the function on the list of course descriptions
train_corpus = list(read_corpus(course_descriptions))

In [4]:
# Examine the first document in the corpus (tag at end)

print(train_corpus[0])

TaggedDocument(['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], [0])


In [5]:
# Build a model

# Instantiate the model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50,
                                      window=5,
                                      dm=0,
                                      dm_concat=1,
                                      min_count=2,
                                      epochs=10)

# Create a vocabulary for the model
model.build_vocab(train_corpus)

# Train the model on the corpus
model.train(train_corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

In [6]:
# Self-assess the model

# Infer a vector for a document in the training corpus itself.
vector = model.infer_vector(train_corpus[0].words)

# Find which documents in the training corpus it is most similar to.
sims = model.docvecs.most_similar([vector])
sims

[(0, 0.9621601700782776),
 (3723, 0.9405094385147095),
 (1121, 0.9323824048042297),
 (184, 0.9322988986968994),
 (2326, 0.9320486783981323),
 (688, 0.93101567029953),
 (599, 0.9309282898902893),
 (3169, 0.9299595355987549),
 (1019, 0.9295350313186646),
 (4182, 0.9295138120651245)]

### Analysis: 
The model successfully finds that the queried document is most similar to itself in the training corpus. But a similarity score of only .905 still seems low for two exactly identical documents.

## Model Evaluation

In [7]:
# Read in the labeled job descriptions

jobs_test_sample = pd.read_csv('../EDA/jobs_test_sample.csv')
jobs_test_sample

Unnamed: 0,Job_title,Job_Desc,Job_id
0,Data Scientist,We are looking for Data Scientists who are int...,901
1,Data Scientist,The world's largest and fastest-growing compan...,910
2,Data Scientist,\nRole: Data Scientist.\n\nLocation: Foster Ci...,916
3,Data Scientist,Upstart is the leading AI lending platform par...,920
4,Data Scientist,"Why Divvy?Over the past decade, millions of Am...",938
5,Data Engineer,About Rocket LawyerWe believe everyone deserve...,935
6,Data Engineer,Our mission is to create a world where mental ...,1068
7,Data Engineer,Data Engineer \nIf you are a Data Engineer wit...,1089
8,Data Engineer,Prabhav Services Inc. is one of the premier pr...,1100
9,Data Engineer,About Skupos\nSkupos is the data platform for ...,1105


In [8]:
# Read in the courses for the labeled job descriptions

courses_test_sample = pd.read_csv('../EDA/courses_test_sample.csv')
courses_test_sample

Unnamed: 0,name,description,job_title,course_id
0,The Data Scientist’s Toolbox,In this course you will get an introduction to...,Data Scientist,3823
1,Machine Learning,Machine learning is the science of getting com...,Data Scientist,143
2,Applied Machine Learning in Python,This course will introduce the learner to appl...,Data Scientist,3165
3,Data Visualization with Python,"""A picture is worth a thousand words"". We are ...",Data Scientist,3588
4,Machine Learning with Python,This course dives into the basics of machine l...,Data Scientist,2517
5,Databases and SQL for Data Science,Much of the world's data resides in databases....,Data Engineer,545
6,Google Cloud Platform Big Data and Machine Lea...,This 2-week accelerated on-demand course intro...,Data Engineer,1015
7,Big Data Modeling and Management Systems,Once you’ve identified a big data issue to ana...,Data Engineer,4233
8,Database Management Essentials,Database Management Essentials provides the fo...,Data Engineer,3763
9,"Data Warehouse Concepts, Design, and Data Inte...",This is the second course in the Data Warehous...,Data Engineer,1311


In [9]:
# Split the labeled courses by job title

ds_courses = courses_test_sample.loc[courses_test_sample['job_title'] == "Data Scientist"]
de_courses = courses_test_sample.loc[courses_test_sample['job_title'] == "Data Engineer"]
de_courses.reset_index(drop=True, inplace=True)

In [10]:
# Calculate the model's distance score for this labeled data:
# A lower score is better because the model should recognize
# that these jobs and courses are similar to each other.

# Initiate a list of distances
distances = []

# Loop through the labeled jobs
for i in range(len(jobs_test_sample)):
    
    # Extract the information for the job
    job_title, job_desc, job_id = jobs_test_sample.iloc[i]
    
    # Preprocess the job description
    doc = gensim.utils.simple_preprocess(job_desc)
    
    # Convert the job description into a vector
    job_vec = model.infer_vector(doc)
    
    # Check the job title and loop through relevant courses
    if job_title == 'Data Scientist':
        
        for j in range(len(ds_courses)):
            
            # Extract the course id
            course_id = ds_courses.iloc[j,3]
            
            # Find the distance between the job and course
            d = distance.cosine(job_vec, model.docvecs[course_id])
            
            # Add the result to the distance list
            distances.append(d)
            
    if job_title == 'Data Engineer':
        
        for j in range(len(de_courses)):
            
            # Extract the course id
            course_id = de_courses.iloc[j,3]
            
            # Find the distance between the job and course
            d = distance.cosine(job_vec, model.docvecs[course_id])
            
            # Add the result to the distance list
            distances.append(d)

# Calculate the average distance for all job descriptions
sum_dist = 0
for d in distances:
    sum_dist += d

avg_dist = round(sum_dist / len(distances), 3)
min_dist = round(min(distances), 3)
max_dist = round(max(distances), 3)

print(f"Mean distance score: {avg_dist}")
print(f"Min distance score: {min_dist}")
print(f"Max distance score: {max_dist}")

Mean distance score: 0.165
Min distance score: 0.057
Max distance score: 0.273


## Detailed Results

In [11]:
# Define a function to display the results for each labeled job

def display_recs(job_id):
    
    # Initialize a dictionary for the job
    job_dict = {}
    keys = ['job_id', 'job_title', 'course_id', 'course_name', 'labeled', 'distance']
    for key in keys:
        job_dict[key] = []
    
    # Extract info about the job
    i = jobs_test_sample.loc[jobs_test_sample['Job_id'] == job_id].index.values[0]
    job_title = jobs_test_sample.iloc[i]['Job_title']
    job_desc = jobs_test_sample.iloc[i]['Job_Desc']
    
    # Preprocess the job description
    doc = gensim.utils.simple_preprocess(job_desc)
    
    # Convert the job description into a vector
    job_vec = model.infer_vector(doc)
    
    # Get top 5 courses from the model
    top_recs = model.docvecs.most_similar(positive=[job_vec], topn=5)
    
    # Loop through top 5 courses
    for rec in top_recs:
        
        # Add job info to the dictionary
        job_dict['job_id'].append(job_id)
        job_dict['job_title'].append(job_title)
        
        # Add course info to the dicitonary
        course_id = rec[0]
        job_dict['course_id'].append(course_id)
        course_name = course_df.loc[course_id, 'name']
        job_dict['course_name'].append(course_name)
        job_dict['labeled'].append(0)
        
        # Add the distance score to the dictionary
        d = round((1-rec[1]), 3)
        job_dict['distance'].append(d)
    
    # Now loop through the courses I selected as relevant
    if job_title == 'Data Scientist':
        
        for i in range(len(ds_courses)):
            
            # Add job info to the dictionary
            job_dict['job_id'].append(job_id)
            job_dict['job_title'].append(job_title)
            
            # Add course info to the dictionary
            course_id = ds_courses.iloc[i,3]
            job_dict['course_id'].append(course_id)
            course_name = ds_courses.iloc[i,0]
            job_dict['course_name'].append(course_name)
            job_dict['labeled'].append(1)
            
            # Find and add the distance score to the dictionary
            d = distance.cosine(job_vec, model.docvecs[course_id])
            d = round(d,3)
            job_dict['distance'].append(d)
            
    if job_title == 'Data Engineer':
        
        for i in range(len(de_courses)):
            
            # Add job info to the dictionary
            job_dict['job_id'].append(job_id)
            job_dict['job_title'].append(job_title)
            
            # Add course info to the dictionary
            course_id = de_courses.iloc[i,3]
            job_dict['course_id'].append(course_id)
            course_name = de_courses.iloc[i,0]
            job_dict['course_name'].append(course_name)
            job_dict['labeled'].append(1)
            
            # Find and add the distance score to the dictionary
            d = distance.cosine(job_vec, model.docvecs[course_id])
            d = round(d,3)
            job_dict['distance'].append(d)
    
    # Convert the dictionary into a dataframe
    df = pd.DataFrame(job_dict)
    
    # Rank by distance
    df = df.sort_values('distance')
    
    return df

In [12]:
# Job 1
display_recs(901)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,901,Data Scientist,2783,Social Media Data Analytics,0,0.094
1,901,Data Scientist,1240,Predictive Analytics and Data Mining,0,0.101
2,901,Data Scientist,3238,Building Data Visualization Tools,0,0.102
3,901,Data Scientist,4024,Research Data Management and Sharing,0,0.102
4,901,Data Scientist,3638,Applying Data Analytics in Marketing,0,0.102
8,901,Data Scientist,3588,Data Visualization with Python,1,0.123
5,901,Data Scientist,3823,The Data Scientist’s Toolbox,1,0.139
7,901,Data Scientist,3165,Applied Machine Learning in Python,1,0.147
6,901,Data Scientist,143,Machine Learning,1,0.185
9,901,Data Scientist,2517,Machine Learning with Python,1,0.2


In [13]:
# Job 2
display_recs(910)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,910,Data Scientist,3120,Digital Product Management: Modern Fundamentals,0,0.145
1,910,Data Scientist,3049,Big Data Emerging Technologies,0,0.148
2,910,Data Scientist,1721,New Technologies for Business Leaders,0,0.149
3,910,Data Scientist,4147,Survey analysis to Gain Marketing Insights,0,0.152
4,910,Data Scientist,1289,Marketing analytics: Know your customers,0,0.154
8,910,Data Scientist,3588,Data Visualization with Python,1,0.184
5,910,Data Scientist,3823,The Data Scientist’s Toolbox,1,0.19
9,910,Data Scientist,2517,Machine Learning with Python,1,0.221
6,910,Data Scientist,143,Machine Learning,1,0.225
7,910,Data Scientist,3165,Applied Machine Learning in Python,1,0.236


In [14]:
# Job 3
display_recs(916)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,916,Data Scientist,272,Data Science Methodology,0,0.046
1,916,Data Scientist,1806,AI Workflow: Business Priorities and Data Inge...,0,0.047
2,916,Data Scientist,1553,Addressing Large Hadron Collider Challenges by...,0,0.053
3,916,Data Scientist,602,Data Science for Business Innovation,0,0.053
4,916,Data Scientist,3199,Text Retrieval and Search Engines,0,0.053
8,916,Data Scientist,3588,Data Visualization with Python,1,0.055
5,916,Data Scientist,3823,The Data Scientist’s Toolbox,1,0.065
7,916,Data Scientist,3165,Applied Machine Learning in Python,1,0.077
6,916,Data Scientist,143,Machine Learning,1,0.081
9,916,Data Scientist,2517,Machine Learning with Python,1,0.103


In [15]:
# Job 4
display_recs(920)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,920,Data Scientist,3044,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.096
1,920,Data Scientist,2316,"Big Data Essentials: HDFS, MapReduce and Spark...",0,0.103
2,920,Data Scientist,846,Big Data Applications: Machine Learning at Scale,0,0.106
3,920,Data Scientist,868,Tinkering Fundamentals: Motion and Mechanisms,0,0.112
4,920,Data Scientist,2821,Compassionate Leadership Through Service Learn...,0,0.115
6,920,Data Scientist,143,Machine Learning,1,0.132
8,920,Data Scientist,3588,Data Visualization with Python,1,0.135
9,920,Data Scientist,2517,Machine Learning with Python,1,0.157
5,920,Data Scientist,3823,The Data Scientist’s Toolbox,1,0.164
7,920,Data Scientist,3165,Applied Machine Learning in Python,1,0.192


In [16]:
# Job 5
display_recs(938)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,938,Data Scientist,2945,Business Application of Machine Learning and A...,0,0.076
1,938,Data Scientist,3776,Data Management for Clinical Research,0,0.078
2,938,Data Scientist,272,Data Science Methodology,0,0.079
3,938,Data Scientist,3049,Big Data Emerging Technologies,0,0.082
4,938,Data Scientist,4392,Python and Machine Learning for Asset Management,0,0.085
8,938,Data Scientist,3588,Data Visualization with Python,1,0.094
5,938,Data Scientist,3823,The Data Scientist’s Toolbox,1,0.108
6,938,Data Scientist,143,Machine Learning,1,0.13
9,938,Data Scientist,2517,Machine Learning with Python,1,0.138
7,938,Data Scientist,3165,Applied Machine Learning in Python,1,0.146


In [17]:
# Job 6
display_recs(935)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,935,Data Engineer,2465,"Business Intelligence Concepts, Tools, and App...",0,0.149
1,935,Data Engineer,1721,New Technologies for Business Leaders,0,0.152
2,935,Data Engineer,3049,Big Data Emerging Technologies,0,0.154
3,935,Data Engineer,3928,Wharton Business and Financial Modeling Capstone,0,0.156
4,935,Data Engineer,1235,Data in Database,0,0.162
9,935,Data Engineer,1311,"Data Warehouse Concepts, Design, and Data Inte...",1,0.167
7,935,Data Engineer,4233,Big Data Modeling and Management Systems,1,0.186
6,935,Data Engineer,1015,Google Cloud Platform Big Data and Machine Lea...,1,0.188
8,935,Data Engineer,3763,Database Management Essentials,1,0.195
5,935,Data Engineer,545,Databases and SQL for Data Science,1,0.207


In [18]:
# Job 7
display_recs(1068)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,1068,Data Engineer,3044,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.177
1,1068,Data Engineer,932,Data and Health Indicators in Public Health Pr...,0,0.19
2,1068,Data Engineer,3831,Surveillance Systems: The Building Blocks,0,0.191
3,1068,Data Engineer,4404,Cloud Networking,0,0.193
4,1068,Data Engineer,3776,Data Management for Clinical Research,0,0.2
7,1068,Data Engineer,4233,Big Data Modeling and Management Systems,1,0.229
6,1068,Data Engineer,1015,Google Cloud Platform Big Data and Machine Lea...,1,0.245
5,1068,Data Engineer,545,Databases and SQL for Data Science,1,0.258
8,1068,Data Engineer,3763,Database Management Essentials,1,0.266
9,1068,Data Engineer,1311,"Data Warehouse Concepts, Design, and Data Inte...",1,0.268


In [19]:
# Job 8
display_recs(1089)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,1089,Data Engineer,3044,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.085
1,1089,Data Engineer,225,"Big Data, Genes, and Medicine",0,0.113
2,1089,Data Engineer,3853,SQL for Data Science,0,0.116
3,1089,Data Engineer,2548,Requirements Writing,0,0.12
4,1089,Data Engineer,3776,Data Management for Clinical Research,0,0.122
7,1089,Data Engineer,4233,Big Data Modeling and Management Systems,1,0.156
8,1089,Data Engineer,3763,Database Management Essentials,1,0.157
5,1089,Data Engineer,545,Databases and SQL for Data Science,1,0.165
9,1089,Data Engineer,1311,"Data Warehouse Concepts, Design, and Data Inte...",1,0.178
6,1089,Data Engineer,1015,Google Cloud Platform Big Data and Machine Lea...,1,0.228


In [20]:
# Job 9
display_recs(1100)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,1100,Data Engineer,4233,Big Data Modeling and Management Systems,0,0.093
1,1100,Data Engineer,3976,Big Data Integration and Processing,0,0.093
2,1100,Data Engineer,4253,Responsive Website Tutorial and Examples,0,0.093
7,1100,Data Engineer,4233,Big Data Modeling and Management Systems,1,0.093
3,1100,Data Engineer,2936,Building R Packages,0,0.094
4,1100,Data Engineer,2099,Introduction to Big Data,0,0.095
9,1100,Data Engineer,1311,"Data Warehouse Concepts, Design, and Data Inte...",1,0.11
8,1100,Data Engineer,3763,Database Management Essentials,1,0.112
5,1100,Data Engineer,545,Databases and SQL for Data Science,1,0.132
6,1100,Data Engineer,1015,Google Cloud Platform Big Data and Machine Lea...,1,0.168


In [21]:
# Job 10
display_recs(1105)

Unnamed: 0,job_id,job_title,course_id,course_name,labeled,distance
0,1105,Data Engineer,3044,"Big Data Analysis: Hive, Spark SQL, DataFrames...",0,0.129
1,1105,Data Engineer,3049,Big Data Emerging Technologies,0,0.134
2,1105,Data Engineer,1874,Executive Data Science Capstone,0,0.14
3,1105,Data Engineer,4025,Amazon DynamoDB: Building NoSQL Database-Drive...,0,0.143
4,1105,Data Engineer,3238,Building Data Visualization Tools,0,0.144
9,1105,Data Engineer,1311,"Data Warehouse Concepts, Design, and Data Inte...",1,0.155
7,1105,Data Engineer,4233,Big Data Modeling and Management Systems,1,0.167
8,1105,Data Engineer,3763,Database Management Essentials,1,0.169
5,1105,Data Engineer,545,Databases and SQL for Data Science,1,0.183
6,1105,Data Engineer,1015,Google Cloud Platform Big Data and Machine Lea...,1,0.193


### Analysis:
The architecture works, but the model does not appear to be very effective: Digital Product Management is definitely not the most relevant course on Coursera for the Data Scientist position at BrightIdea. Why does my model think it is? (The resoning is difficult to see just by looking at the two documents it matched together.) How can I improve the model's performance?

## Pickle the Model

In [22]:
# Save the model to the API file
# by serializing it with the pickle library.

pickle.dump(model, open('../API/model.p', 'wb'))