The purpose of this notebook is to build a Doc2Vec model.

In [67]:
import pandas as pd
import gensim

In [68]:
df = pd.read_csv('../Data/Job_Data/Glassdoor_Joblist.csv')
print(df.shape)
df.head()

(3324, 12)


Unnamed: 0,Job_title,Company,State,City,Min_Salary,Max_Salary,Job_Desc,Industry,Rating,Date_Posted,Valid_until,Job_Type
0,Chief Marketing Officer (CMO),National Debt Relief,NY,New York,-1,-1,Who We're Looking For:\n\nThe Chief Marketing ...,Finance,4.0,2020-05-08,2020-06-07,FULL_TIME
1,Registered Nurse,Queens Boulevard Endoscopy Center,NY,Rego Park,-1,-1,"Queens Boulevard Endoscopy Center, an endoscop...",,3.0,2020-04-25,2020-06-07,FULL_TIME
2,Dental Hygienist,Batista Dental,NJ,West New York,-1,-1,Part-time or Full-timedental hygienist positio...,,,2020-05-02,2020-06-07,PART_TIME
3,Senior Salesforce Developer,National Debt Relief,NY,New York,44587,82162,Principle Duties & Responsibilities:\n\nAnalyz...,Finance,4.0,2020-05-08,2020-06-07,FULL_TIME
4,"DEPUTY EXECUTIVE DIRECTOR, PROGRAM AND LEGAL A...",National Advocates for Pregnant Women,NY,New York,125410,212901,"For FULL Job Announcement, visit our website: ...",,,2020-04-28,2020-06-07,FULL_TIME


In [69]:
# Convert job description series to list
job_descriptions = df['Job_Desc'].to_list()

In [70]:
# Do simple preprocessing on the corpus of job descriptions:
# tokenize, lowercase, remove punctuation.

def read_corpus(corpus):
    for i, doc in enumerate(corpus):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(job_descriptions))

In [71]:
# Examine one document in the corpus (tag at end)

print(train_corpus[1])

TaggedDocument(['queens', 'boulevard', 'endoscopy', 'center', 'an', 'endoscopy', 'asc', 'located', 'in', 'rego', 'park', 'has', 'an', 'exciting', 'opportunity', 'for', 'full', 'time', 'registered', 'nurse', 'successful', 'candidates', 'will', 'provide', 'quality', 'nursing', 'care', 'in', 'all', 'areas', 'of', 'the', 'center', 'including', 'pre', 'assessment', 'pre', 'op', 'and', 'pacu', 'qualified', 'candidates', 'must', 'possess', 'the', 'following', 'current', 'ny', 'state', 'rn', 'license', 'bls', 'certification', 'acls', 'preferred', 'must', 'be', 'team', 'player', 'with', 'excellent', 'multi', 'tasking', 'and', 'interpersonal', 'skills', 'compassion', 'for', 'patient', 'needs', 'and', 'high', 'degree', 'of', 'professionalism', 'chinese', 'speaking', 'and', 'spanish', 'preferred', 'queens', 'boulevard', 'endoscopy', 'center', 'offers', 'pleasant', 'professional', 'work', 'environment', 'and', 'no', 'evening', 'or', 'holiday', 'work', 'hours', 'drug', 'free', 'work', 'environment',

In [72]:
# Instantiate the model

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, 
                                      min_count=2,
                                      epochs=20)

In [73]:
# Build a vocabulary

model.build_vocab(train_corpus)

In [74]:
# How many occurrences of the word "data" in the corpus?

model.wv.vocab['data'].count

19259

In [75]:
# Train the model on the corpus

model.train(train_corpus,
            total_examples=model.corpus_count,
            epochs=model.epochs)

In [76]:
# Assess the model by inferring a vector from a document
# in the training corpus

vector = model.infer_vector(train_corpus[1].words)
sims = model.docvecs.most_similar([vector])

In [77]:
# These are the id #s and similarity scores
# for the most similar documents (duplicates!)
sims

[(601, 0.9700711965560913),
 (721, 0.9677823185920715),
 (811, 0.9671065211296082),
 (631, 0.9666780829429626),
 (31, 0.9636819362640381),
 (361, 0.9634627103805542),
 (481, 0.9631281495094299),
 (200, 0.961749255657196),
 (2048, 0.9606184363365173),
 (691, 0.9604147672653198)]

In [78]:
train_corpus[751]

TaggedDocument(words=['queens', 'boulevard', 'endoscopy', 'center', 'an', 'endoscopy', 'asc', 'located', 'in', 'rego', 'park', 'has', 'an', 'exciting', 'opportunity', 'for', 'full', 'time', 'registered', 'nurse', 'successful', 'candidates', 'will', 'provide', 'quality', 'nursing', 'care', 'in', 'all', 'areas', 'of', 'the', 'center', 'including', 'pre', 'assessment', 'pre', 'op', 'and', 'pacu', 'qualified', 'candidates', 'must', 'possess', 'the', 'following', 'current', 'ny', 'state', 'rn', 'license', 'bls', 'certification', 'acls', 'preferred', 'must', 'be', 'team', 'player', 'with', 'excellent', 'multi', 'tasking', 'and', 'interpersonal', 'skills', 'compassion', 'for', 'patient', 'needs', 'and', 'high', 'degree', 'of', 'professionalism', 'chinese', 'speaking', 'and', 'spanish', 'preferred', 'queens', 'boulevard', 'endoscopy', 'center', 'offers', 'pleasant', 'professional', 'work', 'environment', 'and', 'no', 'evening', 'or', 'holiday', 'work', 'hours', 'drug', 'free', 'work', 'environ

In [79]:
# Read in the course data

course_catalog = pd.read_csv('../Data/Course_Data/Coursera_Catalog.csv')
print(course_catalog.shape)
course_catalog.head()

(4416, 10)


Unnamed: 0,courseType,description,domainTypes,id,slug,specializations,workload,primaryLanguages,certificates,name
0,v2.ondemand,Gamification is the application of game elemen...,"[{'subdomainId': 'design-and-product', 'domain...",69Bku0KoEeWZtA4u62x6lQ,gamification,[],4-8 hours/week,['en'],['VerifiedCert'],Gamification
1,v2.ondemand,This course will cover the steps used in weigh...,"[{'subdomainId': 'data-analysis', 'domainId': ...",0HiU7Oe4EeWTAQ4yevf_oQ,missing-data,[],"4 weeks of study, 1-2 hours/week",['en'],"['VerifiedCert', 'Specialization']",Dealing With Missing Data
2,v2.ondemand,The Unordered Data Structures course covers th...,"[{'domainId': 'computer-science', 'subdomainId...",sI_-QEBiEemtDRLx7Ne8jg,cs-fundamentals-3,[],,['en'],"['VerifiedCert', 'Specialization']",Unordered Data Structures
3,v2.ondemand,"The vital signs – heart rate, blood pressure, ...","[{'subdomainId': 'patient-care', 'domainId': '...",5zjIsJq-EeW_wArffOXkOw,vital-signs,[],3-5 hours/week,['en'],['VerifiedCert'],Vital Signs: Understanding What the Body Is Te...
4,v2.ondemand,This course “FinTech Disruptive Innovation: Im...,"[{'subdomainId': 'finance', 'domainId': 'busin...",WFanvtoSEeedbRLwgi9a7A,fintech-disruption,[],"Around 4 hours of videos in total, plus a fina...",['en'],"['VerifiedCert', 'Specialization']",FinTech Disruptive Innovation: Implications fo...


In [80]:
# select a course description as an example

cd = course_catalog['description'][9]
cd

'Machine learning (ML) is one of the fastest growing areas in technology and a highly sought after skillset in today’s job market. The World Economic Forum states the growth of artificial intelligence (AI) could create 58 million net new jobs in the next few years, yet it’s estimated that currently there are 300,000 AI engineers worldwide, but millions are needed. This means there is a unique and immediate opportunity for you to get started with learning the essential ML concepts that are used to build AI applications – no matter what your skill levels are. Learning the foundations of ML now, will help you keep pace with this growth, expand your skills and even help advance your career. \n\nThis course will teach you how to get started with AWS Machine Learning. Key topics include: Machine Learning on AWS, Computer Vision on AWS, and Natural Language Processing (NLP) on AWS. Each topic consists of several modules deep-diving into variety of ML concepts, AWS services as well as insights

In [81]:
# tokenize the course description
new_doc = gensim.utils.simple_preprocess(cd)
type(new_doc)

list

In [82]:
# Infer a vector using the model and compare to training docs
new_vector = model.infer_vector(new_doc)
new_sims = model.docvecs.most_similar([new_vector])
new_sims

[(953, 0.6477341055870056),
 (1730, 0.6368511915206909),
 (2092, 0.5848551988601685),
 (1737, 0.5847705602645874),
 (929, 0.5752161145210266),
 (2263, 0.5674833059310913),
 (2452, 0.5643137097358704),
 (1728, 0.5582506656646729),
 (2074, 0.5567480325698853),
 (1244, 0.5544293522834778)]

In [87]:
# This is the job listing with the highest similarity score
df[953:954]

Unnamed: 0,Job_title,Company,State,City,Min_Salary,Max_Salary,Job_Desc,Industry,Rating,Date_Posted,Valid_until,Job_Type
953,Data Scientist,Brightidea,CA,San Francisco,119642,138312,\nData Scientist\n\nat Brightidea\n\nSan Franc...,Information Technology,4.3,2020-04-24,2020-06-05,FULL_TIME


In [88]:
df['Job_Desc'][953]

'\nData Scientist\n\nat Brightidea\n\nSan Francisco\n\nThe Role\n\nWe are seeking machine learning developers with natural language processing experience.\n\nIn general, we are looking for people who are self-motivated and passionate about the field of machine learning and the vast applications of it. These folks will have the ability to work with / understand / and build on top of an existing code base using their deep knowledge of various machine learning algorithms (e.g. neural networks, bayesian methods, etc).\n\nKey responsibilities include, but not limited to:\n\n\nBuild on top of an existing text processing/classification system\nWrite, maintain, and develop python machine learning modules & repos\nRun hyperparameter optimizations + collect, analyze, visualize, and present results\n\nWhat You Need to Succeed\n\nBS or MS in computer science, mathematics, physics or other hard science/engineering discipline\nProgramming in Python ~ 2+ years\nNumpy, scipy, pandas, Jupyter, and scik

In [89]:
cd

'Machine learning (ML) is one of the fastest growing areas in technology and a highly sought after skillset in today’s job market. The World Economic Forum states the growth of artificial intelligence (AI) could create 58 million net new jobs in the next few years, yet it’s estimated that currently there are 300,000 AI engineers worldwide, but millions are needed. This means there is a unique and immediate opportunity for you to get started with learning the essential ML concepts that are used to build AI applications – no matter what your skill levels are. Learning the foundations of ML now, will help you keep pace with this growth, expand your skills and even help advance your career. \n\nThis course will teach you how to get started with AWS Machine Learning. Key topics include: Machine Learning on AWS, Computer Vision on AWS, and Natural Language Processing (NLP) on AWS. Each topic consists of several modules deep-diving into variety of ML concepts, AWS services as well as insights