The purpose of this notebook is to build a Doc2Vec model.

In [4]:
# Import libraries

import pandas as pd
import gensim
import pickle

## 1. Model Building

In [5]:
# Read in the course data

course_df = pd.read_csv('../Data/Course_Data/Coursera_Catalog.csv')
print(course_df.shape)
course_df.head(2)

(4416, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data


In [6]:
# Preprocess the course data as a training corpus

# Convert the course descriptions to a list
course_descriptions = course_df['description'].to_list()

# Define a function for basic preprocessing:
# tokenize, lowercase, de-punctuation, document tagging
def read_corpus(corpus):
    for i, doc in enumerate(corpus):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Run the function on the list of course descriptions
train_corpus = list(read_corpus(course_descriptions))

In [7]:
# Examine the first document in the corpus (tag at end)

print(train_corpus[0])

TaggedDocument(['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], [0])


In [8]:
# Build a model

# Instantiate the model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, 
                                      min_count=2,
                                      epochs=20)

# Create a vocabulary for the model
model.build_vocab(train_corpus)

# Train the model on the corpus
model.train(train_corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

In [9]:
# Assess the model

# Infer a vector for a document in the training corpus itself.
vector = model.infer_vector(train_corpus[0].words)

# Find which documents in the training corpus it is most similar to.
sims = model.docvecs.most_similar([vector])
sims

[(0, 0.9297835230827332),
 (4328, 0.6194338202476501),
 (3514, 0.595689058303833),
 (4125, 0.5856249332427979),
 (4203, 0.5853810906410217),
 (2684, 0.5722717046737671),
 (2627, 0.5592280030250549),
 (2036, 0.5415085554122925),
 (872, 0.5365030765533447),
 (993, 0.5356308221817017)]

### Analysis: 
The model successfully finds that the queried document is most similar to itself in the training corpus. But a similarity score of only .905 still seems low for two exactly identical documents.

In [10]:
# The text of the document in the corpus
train_corpus[0]

TaggedDocument(words=['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], tags=[0])

In [11]:
# The vector of the document in the query
vector

array([ 0.38629812,  0.5783848 , -0.02540533,  0.37469834, -0.48154405,
       -0.1593047 , -0.31531   , -0.06316546, -0.8638272 ,  0.5396206 ,
       -0.7799212 ,  0.4568018 , -1.0130543 ,  0.09721771,  0.01842084,
        0.33465704,  0.03045495,  0.2790492 , -0.41015503, -0.38814616,
        0.22531334, -0.00586816, -0.0821557 , -0.54532003,  0.91010016,
       -0.00290382,  0.1918041 ,  0.12977605, -0.14652981,  0.72398365,
       -0.1945694 , -0.09146152, -0.31444156, -0.05626089, -0.2945109 ,
        0.08408266, -0.6675039 ,  0.3900648 ,  0.2890711 ,  0.44858932,
        0.64772874,  0.5251184 , -0.00458518,  0.8088226 ,  0.45003164,
        0.05860854, -0.18295369,  0.03185805,  0.07435168, -0.43666798],
      dtype=float32)

In [12]:
# The vector of the document in the corpus
model.docvecs[0]

array([ 0.28258136,  0.30014622, -0.17879154,  0.22064546,  0.06615274,
       -0.13122658, -0.10445565, -0.11441766, -0.6116671 ,  0.47410238,
       -0.5039973 ,  0.29454967, -0.61621284,  0.24525239,  0.0964065 ,
        0.3089098 , -0.10508112,  0.13013092, -0.30781358, -0.1194708 ,
        0.19032517, -0.26953024, -0.02628003, -0.19508152,  0.7917689 ,
        0.03244531,  0.1812464 ,  0.13986608, -0.06453077,  0.4570868 ,
       -0.08253011, -0.16883898, -0.16239563,  0.00398307, -0.2208868 ,
        0.07151165, -0.4613866 ,  0.29482293,  0.38926482,  0.21558928,
        0.31290093,  0.33779612, -0.04056223,  0.6325989 ,  0.11753267,
        0.00493962, -0.21678235,  0.05802134, -0.04052821, -0.32103115],
      dtype=float32)

## 2. Querying Job Descriptions

In [13]:
# Read in the job data

job_df = pd.read_csv('../Data/Job_Data/Glassdoor_Joblist.csv')
print(job_df.shape)
job_df.head(2)

(3324, 12)


Unnamed: 0,Job_title,Company,State,City,Min_Salary,Max_Salary,Job_Desc,Industry,Rating,Date_Posted,Valid_until,Job_Type
0,Chief Marketing Officer (CMO),National Debt Relief,NY,New York,-1,-1,Who We're Looking For:\n\nThe Chief Marketing ...,Finance,4.0,2020-05-08,2020-06-07,FULL_TIME
1,Registered Nurse,Queens Boulevard Endoscopy Center,NY,Rego Park,-1,-1,"Queens Boulevard Endoscopy Center, an endoscop...",,3.0,2020-04-25,2020-06-07,FULL_TIME


In [14]:
# Define a function that queries a job description

def job_query(job_description):
    doc = gensim.utils.simple_preprocess(job_description)
    vector = model.infer_vector(doc)
    sims = model.docvecs.most_similar([vector])
    return sims

In [15]:
# Select a sample job

display(job_df.loc[953])
jd_1 = job_df['Job_Desc'][953]

Job_title                                         Data Scientist
Company                                               Brightidea
State                                                         CA
City                                               San Francisco
Min_Salary                                                119642
Max_Salary                                                138312
Job_Desc       \nData Scientist\n\nat Brightidea\n\nSan Franc...
Industry                                  Information Technology
Rating                                                       4.3
Date_Posted                                           2020-04-24
Valid_until                                           2020-06-05
Job_Type                                               FULL_TIME
Name: 953, dtype: object

In [16]:
# Call the function on the sample job description
sim_docs = job_query(jd_1)

# Display the doc ids and similarity scores
sim_docs

[(1987, 0.6027417182922363),
 (2316, 0.5830969214439392),
 (3044, 0.573232889175415),
 (3374, 0.5678238272666931),
 (3120, 0.5664616227149963),
 (968, 0.5618869066238403),
 (4342, 0.5590941309928894),
 (4101, 0.5466418862342834),
 (2219, 0.5438205003738403),
 (3388, 0.5413758158683777)]

In [17]:
# Make a list of the ids for the top ten most similar docs
sim_ids = [sim[0] for sim in sim_docs]
sim_ids

[1987, 2316, 3044, 3374, 3120, 968, 4342, 4101, 2219, 3388]

In [18]:
# Which courses did the model identify as most similar?

course_df.loc[sim_ids, 'name']

1987                                     People Analytics
2316    Big Data Essentials: HDFS, MapReduce and Spark...
3044    Big Data Analysis: Hive, Spark SQL, DataFrames...
3374    Developing An Entrepreneurial Mindset: First S...
3120      Digital Product Management: Modern Fundamentals
968                Creating and Developing a Tech Startup
4342           Sales Training: Building Your Sales Career
4101                            Business Model Innovation
2219              Introduction to the Orbital Perspective
3388    Technology Commercialization, Part 1: Setting ...
Name: name, dtype: object

In [19]:
# Display the sample job description
jd_1

'\nData Scientist\n\nat Brightidea\n\nSan Francisco\n\nThe Role\n\nWe are seeking machine learning developers with natural language processing experience.\n\nIn general, we are looking for people who are self-motivated and passionate about the field of machine learning and the vast applications of it. These folks will have the ability to work with / understand / and build on top of an existing code base using their deep knowledge of various machine learning algorithms (e.g. neural networks, bayesian methods, etc).\n\nKey responsibilities include, but not limited to:\n\n\nBuild on top of an existing text processing/classification system\nWrite, maintain, and develop python machine learning modules & repos\nRun hyperparameter optimizations + collect, analyze, visualize, and present results\n\nWhat You Need to Succeed\n\nBS or MS in computer science, mathematics, physics or other hard science/engineering discipline\nProgramming in Python ~ 2+ years\nNumpy, scipy, pandas, Jupyter, and scik

In [20]:
# Display the description of the course identified as most similar

course_df.loc[sim_ids[0], 'description']

'People analytics is a data-driven approach to managing people at work. For the first time in history, business leaders can make decisions about their people based on deep analysis of data rather than the traditional methods of personal relationships, decision making based on experience, and risk avoidance. In this brand new course, three of Wharton’s top professors, all pioneers in the field of people analytics, will explore the state-of-the-art techniques used to recruit and retain great people, and demonstrate how these techniques are used at cutting-edge companies. They’ll explain how data and sophisticated analysis is brought to bear on people-related issues, such as recruiting, performance evaluation, leadership, hiring and promotion, job design, compensation, and collaboration. This course is an introduction to the theory of people analytics, and is not intended to prepare learners to perform complex talent management data analysis. By the end of this course, you’ll understand h

### Analysis:
The architecture works, but the model does not appear to be very effective: Digital Product Management is definitely not the most relevant course on Coursera for the Data Scientist position at BrightIdea. Why does my model think it is? (The resoning is difficult to see just by looking at the two documents it matched together.) How can I improve the model's performance?

## 3. Pickle the Model

In [21]:
# Save the model to file by serializing it with the pickle library.

pickle.dump(model, open('model.p', 'wb'))