# Resume Scoring using Doc2Vec model

## Introduction

With the Doc2Vec model, a resume scorer is built to find the match score of a resume with a job descriptions by representing them as numerical vectors and calculating the cosine similarity between the two vectors.

### Import the libraries

In [1]:
# Import the required libraries
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import pandas as pd

### Load the training dataset

In [2]:
# Load data from "nyc_jobs.csv"
df = pd.read_csv("nyc_jobs.csv")

# Display top 5 rows in data
df.head()

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Code No,Level,Job Category,Full-Time/Part-Time indicator,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
0,87990,DEPARTMENT OF BUSINESS SERV.,Internal,1,Account Manager,CONTRACT REVIEWER (OFFICE OF L,40563,1,,,...,"Salary range for this position is: $42,405 - $...",,,,,New York City residency is generally required ...,2011-06-24T00:00:00,,2011-06-24T00:00:00,2018-07-17T00:00:00
1,97899,DEPARTMENT OF BUSINESS SERV.,Internal,1,"EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT",ADMINISTRATIVE BUSINESS PROMOT,10009,M3,,F,...,,"In addition to applying through this website, ...",,,,New York City residency is generally required ...,2012-01-26T00:00:00,,2012-01-26T00:00:00,2018-07-17T00:00:00
2,102221,DEPT OF ENVIRONMENT PROTECTION,External,1,Project Specialist,ENVIRONMENTAL ENGINEERING INTE,20616,0,,F,...,Appointments are subject to OMB approval,click the apply now button,35 hours per week/day,,,New York City Residency is not required for th...,2012-06-21T00:00:00,,2012-09-07T00:00:00,2018-07-17T00:00:00
3,102221,DEPT OF ENVIRONMENT PROTECTION,Internal,1,Project Specialist,ENVIRONMENTAL ENGINEERING INTE,20616,0,,F,...,Appointments are subject to OMB approval,click the apply now button,35 hours per week/day,,,New York City Residency is not required for th...,2012-06-21T00:00:00,,2012-09-07T00:00:00,2018-07-17T00:00:00
4,114352,DEPT OF ENVIRONMENT PROTECTION,Internal,5,Deputy Plant Chief,SENIOR STATIONARY ENGINEER (EL,91639,0,,F,...,Appointments are subject to OMB approval Fo...,"Click ""Apply Now"" button",40 per week / day,Various,,New York City residency is generally required ...,2012-12-12T00:00:00,,2012-12-13T00:00:00,2018-07-17T00:00:00


### Prepare the data for training

In [3]:
# Display columns in data
df.columns

Index(['Job ID', 'Agency', 'Posting Type', '# Of Positions', 'Business Title',
       'Civil Service Title', 'Title Code No', 'Level', 'Job Category',
       'Full-Time/Part-Time indicator', 'Salary Range From', 'Salary Range To',
       'Salary Frequency', 'Work Location', 'Division/Work Unit',
       'Job Description', 'Minimum Qual Requirements', 'Preferred Skills',
       'Additional Information', 'To Apply', 'Hours/Shift', 'Work Location 1',
       'Recruitment Contact', 'Residency Requirement', 'Posting Date',
       'Post Until', 'Posting Updated', 'Process Date'],
      dtype='object')

In [4]:
# Select relevant columns from data
df = df[['Business Title', 'Job Description', 'Minimum Qual Requirements', 'Preferred Skills']]

# Display top 5 rows in data
df.head()

Unnamed: 0,Business Title,Job Description,Minimum Qual Requirements,Preferred Skills
0,Account Manager,Division of Economic & Financial Opportunity (...,1.\tA baccalaureate degree from an accredited ...,â€¢\tExcellent interpersonal and organizationa...
1,"EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT",The New York City Department of Small Business...,1. A baccalaureate degree from an accredited c...,
2,Project Specialist,"Under direct supervision, perform elementary e...",A Baccalaureate degree from an accredited coll...,
3,Project Specialist,"Under direct supervision, perform elementary e...",A Baccalaureate degree from an accredited coll...,
4,Deputy Plant Chief,"Under general direction, is in responsible cha...",1. Six years of full-time satisfactory experie...,


In [5]:
# Create a new column 'data' & merge the values of the selected relevant columns into it
df['data'] = df[['Business Title', 'Job Description', 'Minimum Qual Requirements', 'Preferred Skills']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

# Drop the individual selected columns as they are no longer needed
df.drop(['Business Title', 'Job Description', 'Minimum Qual Requirements', 'Preferred Skills'], axis=1, inplace=True)

# Display top 5 rows in data
df.head()

Unnamed: 0,data
0,Account Manager Division of Economic & Financi...
1,"EXECUTIVE DIRECTOR, BUSINESS DEVELOPMENT The N..."
2,"Project Specialist Under direct supervision, p..."
3,"Project Specialist Under direct supervision, p..."
4,"Deputy Plant Chief Under general direction, is..."


### Tokenize data

In [6]:
# Tokenize the words in the 'data' column & tag them with unique identifiers using the TaggedDocument class
data = list(df['data'])
tagged_data = [TaggedDocument(words = word_tokenize(_d.lower()), tags = [str(i)]) for i, _d in enumerate(data)]

### Model initialisation

In [7]:
# Initialise the Doc2Vec model
model = Doc2Vec(vector_size = 50, min_count = 5, epochs = 100, alpha = 0.001)

### Vocabulary building

In [8]:
# Build the vocabulary
model.build_vocab(tagged_data)

# Get the vocabulary keys
keys = model.wv.key_to_index.keys()

# Display the length of the vocabulary keys
print(len(keys))

8599


### Model training

In [9]:
# Train the model
for epoch in range(model.epochs):
    print(f"Training epoch {epoch + 1}/{model.epochs}")
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

Training epoch 1/100
Training epoch 2/100
Training epoch 3/100
Training epoch 4/100
Training epoch 5/100
Training epoch 6/100
Training epoch 7/100
Training epoch 8/100
Training epoch 9/100
Training epoch 10/100
Training epoch 11/100
Training epoch 12/100
Training epoch 13/100
Training epoch 14/100
Training epoch 15/100
Training epoch 16/100
Training epoch 17/100
Training epoch 18/100
Training epoch 19/100
Training epoch 20/100
Training epoch 21/100
Training epoch 22/100
Training epoch 23/100
Training epoch 24/100
Training epoch 25/100
Training epoch 26/100
Training epoch 27/100
Training epoch 28/100
Training epoch 29/100
Training epoch 30/100
Training epoch 31/100
Training epoch 32/100
Training epoch 33/100
Training epoch 34/100
Training epoch 35/100
Training epoch 36/100
Training epoch 37/100
Training epoch 38/100
Training epoch 39/100
Training epoch 40/100
Training epoch 41/100
Training epoch 42/100
Training epoch 43/100
Training epoch 44/100
Training epoch 45/100
Training epoch 46/1

### Save the model

In [10]:
# Save the model as "resumeScorer.model"
model.save("resumeScorer.model")
print("Model saved as resumeScorer.model")

Model saved as resumeScorer.model
