The purpose of this notebook is to build a Doc2Vec model.

In [1]:
# Import libraries

import pandas as pd
import gensim

## 1. Model Building

In [24]:
# Read in the course data

course_df = pd.read_csv('../Data/Course_Data/Coursera_Catalog.csv')
print(df.shape)
df.head(2)

(4416, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data


In [26]:
# Preprocess the course data as a training corpus

# Convert the course descriptions to a list
course_descriptions = df['description'].to_list()

# Define a function for basic preprocessing:
# tokenize, lowercase, de-punctuation, document tagging
def read_corpus(corpus):
    for i, doc in enumerate(corpus):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Run the function on the list of course descriptions
train_corpus = list(read_corpus(course_descriptions))

In [28]:
# Examine the first document in the corpus (tag at end)

print(train_corpus[0])

TaggedDocument(['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], [0])


In [29]:
# Build a model

# Instantiate the model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, 
                                      min_count=2,
                                      epochs=20)

# Create a vocabulary for the model
model.build_vocab(train_corpus)

# Train the model on the corpus
model.train(train_corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

In [31]:
# Assess the model

# Infer a vector for a document in the training corpus itself.
vector = model.infer_vector(train_corpus[0].words)

# Find which documents in the training corpus it is most similar to.
sims = model.docvecs.most_similar([vector])
sims

[(0, 0.904577374458313),
 (731, 0.5876209735870361),
 (1615, 0.5753642320632935),
 (661, 0.5732396841049194),
 (3514, 0.5497420430183411),
 (2684, 0.543770968914032),
 (2340, 0.5378788709640503),
 (2627, 0.5375131964683533),
 (3661, 0.5327433943748474),
 (2036, 0.532387375831604)]

### Analysis: 
The model successfully finds that the queried document is most similar to itself in the training corpus. But a similarity score of only .905 still seems low for two exactly identical documents.

In [35]:
# The text of the document in the corpus
train_corpus[0]

TaggedDocument(words=['gamification', 'is', 'the', 'application', 'of', 'game', 'elements', 'and', 'digital', 'game', 'design', 'techniques', 'to', 'non', 'game', 'problems', 'such', 'as', 'business', 'and', 'social', 'impact', 'challenges', 'this', 'course', 'will', 'teach', 'you', 'the', 'mechanisms', 'of', 'gamification', 'why', 'it', 'has', 'such', 'tremendous', 'potential', 'and', 'how', 'to', 'use', 'it', 'effectively', 'for', 'additional', 'information', 'on', 'the', 'concepts', 'described', 'in', 'the', 'course', 'you', 'can', 'purchase', 'professor', 'werbach', 'book', 'for', 'the', 'win', 'how', 'game', 'thinking', 'can', 'revolutionize', 'your', 'business', 'in', 'print', 'or', 'ebook', 'format', 'in', 'several', 'languages'], tags=[0])

In [33]:
# The vector of the document in the query
vector

array([ 0.11690026, -0.93670326,  0.41097265,  0.10937741, -0.96882236,
        0.20679808,  0.424446  , -0.09070673,  0.00299759,  0.01188672,
        0.26091468,  0.51995707,  0.04820716,  0.00663995, -0.2585928 ,
        0.62834185, -0.14286816,  0.0248737 ,  0.07393045, -0.2732914 ,
       -0.01604197,  0.36418656,  0.00162364, -0.33485684, -0.27331105,
       -0.2906291 ,  0.618427  ,  0.16012691, -0.19389565,  0.18030727,
       -0.6522208 ,  0.6255185 , -0.36164117, -0.9744435 , -0.11492941,
        0.08528145,  0.11748454, -0.22892228,  0.36189422,  0.5233691 ,
       -0.2823789 , -0.15077887,  0.76997656,  0.4974916 , -0.17532884,
        0.09894331,  0.97362083, -0.67234313, -0.70708084,  0.29495043],
      dtype=float32)

In [34]:
# The vector of the document in the corpus
model.docvecs[0]

array([ 0.00911336, -0.4198732 ,  0.3058561 , -0.08523002, -0.5323205 ,
        0.05346673,  0.43908954,  0.12736237,  0.11748328, -0.03223978,
        0.30294204,  0.43887916, -0.1793193 , -0.19363931, -0.21842101,
        0.50964105, -0.09694078, -0.03430551,  0.18276013, -0.21690501,
        0.04544578,  0.27415705, -0.11679361, -0.25259763,  0.0895469 ,
       -0.38938457,  0.36528096,  0.13421895, -0.1035319 ,  0.03812784,
       -0.3657621 ,  0.43080994, -0.28503206, -0.79827994, -0.04070273,
       -0.05921128,  0.04337315, -0.23427871,  0.27903736,  0.31075197,
       -0.18912442, -0.17384917,  0.27425772,  0.3348448 ,  0.15542182,
        0.04993793,  0.45157406, -0.2595677 , -0.28854147,  0.03226552],
      dtype=float32)

## 2. Querying Job Descriptions

In [36]:
# Read in the job data

job_df = pd.read_csv('../Data/Job_Data/Glassdoor_Joblist.csv')
print(job_df.shape)
job_df.head(2)

(3324, 12)


Unnamed: 0,Job_title,Company,State,City,Min_Salary,Max_Salary,Job_Desc,Industry,Rating,Date_Posted,Valid_until,Job_Type
0,Chief Marketing Officer (CMO),National Debt Relief,NY,New York,-1,-1,Who We're Looking For:\n\nThe Chief Marketing ...,Finance,4.0,2020-05-08,2020-06-07,FULL_TIME
1,Registered Nurse,Queens Boulevard Endoscopy Center,NY,Rego Park,-1,-1,"Queens Boulevard Endoscopy Center, an endoscop...",,3.0,2020-04-25,2020-06-07,FULL_TIME


In [48]:
# Define a function that queries a job description

def job_query(job_description):
    doc = gensim.utils.simple_preprocess(job_description)
    vector = model.infer_vector(doc)
    sims = model.docvecs.most_similar([vector])
    return sims

In [68]:
# Select a sample job

display(job_df.loc[953])
jd_1 = job_df['Job_Desc'][953]

Job_title                                         Data Scientist
Company                                               Brightidea
State                                                         CA
City                                               San Francisco
Min_Salary                                                119642
Max_Salary                                                138312
Job_Desc       \nData Scientist\n\nat Brightidea\n\nSan Franc...
Industry                                  Information Technology
Rating                                                       4.3
Date_Posted                                           2020-04-24
Valid_until                                           2020-06-05
Job_Type                                               FULL_TIME
Name: 953, dtype: object

In [69]:
# Call the function on the sample job description
sim_docs = job_query(jd_1)

# Display the doc ids and similarity scores
sim_docs

[(3120, 0.6513493657112122),
 (2219, 0.6135452389717102),
 (968, 0.5959042310714722),
 (1987, 0.5909996032714844),
 (2316, 0.5609511137008667),
 (253, 0.5444154143333435),
 (3044, 0.5315869450569153),
 (12, 0.5302783846855164),
 (3173, 0.5288372039794922),
 (3374, 0.5286433696746826)]

In [70]:
# Make a list of the ids for the top ten most similar docs
sim_ids = [sim[0] for sim in sim_docs]
sim_ids

[3120, 2219, 968, 1987, 2316, 253, 3044, 12, 3173, 3374]

In [71]:
# Which courses did the model identify as most similar?

course_df.loc[sim_ids, 'name']

3120      Digital Product Management: Modern Fundamentals
2219              Introduction to the Orbital Perspective
968                Creating and Developing a Tech Startup
1987                                     People Analytics
2316    Big Data Essentials: HDFS, MapReduce and Spark...
253     A Business Approach to Sustainable Landscape R...
3044    Big Data Analysis: Hive, Spark SQL, DataFrames...
12          Researcher Management and Leadership Training
3173    Time to Reorganize! Understand Organizations, ...
3374    Developing An Entrepreneurial Mindset: First S...
Name: name, dtype: object

In [72]:
# Display the sample job description
jd_1

'\nData Scientist\n\nat Brightidea\n\nSan Francisco\n\nThe Role\n\nWe are seeking machine learning developers with natural language processing experience.\n\nIn general, we are looking for people who are self-motivated and passionate about the field of machine learning and the vast applications of it. These folks will have the ability to work with / understand / and build on top of an existing code base using their deep knowledge of various machine learning algorithms (e.g. neural networks, bayesian methods, etc).\n\nKey responsibilities include, but not limited to:\n\n\nBuild on top of an existing text processing/classification system\nWrite, maintain, and develop python machine learning modules & repos\nRun hyperparameter optimizations + collect, analyze, visualize, and present results\n\nWhat You Need to Succeed\n\nBS or MS in computer science, mathematics, physics or other hard science/engineering discipline\nProgramming in Python ~ 2+ years\nNumpy, scipy, pandas, Jupyter, and scik

In [73]:
# Display the description of the course identified as most similar

course_df.loc[sim_ids[0], 'description']

"Not so long ago, the job of product manager was about assessing market data, creating requirements, and managing the hand-off to sales/marketing. Maybe you’d talk to a customer somewhere in there and they’d tell you what features they wanted. But companies that manage product that way are dying.\n\nBeing a product person today is a new game, and product managers are at the center of it. Today, particularly if your product is mostly digital, you might update it several times a day. Massive troves of data are available for making decisions and, at the same time, deep insights into customer motivation and experience are more important than ever. The job of the modern product manager is to charter a direction and create a successful working environment for all the actors involved in product success.  It’s not a simple job or an easy job, but it is a meaningful job where you’ll be learning all the time. \n\nThis course will help you along your learning journey and prepare you with the skil

## Analysis:
The architecture works, but the model does not appear to be very effective: Digital Product Management is definitely not the most relevant course on Coursera for the Data Scientist position at BrightIdea. Why does my model think it is? (The resoning is difficult to see just by looking at the two documents it matched together.) How can I improve the model's performance?