## CV Job Matching using Doc2Vec

### Introduction
CV Job Matching using Doc2Vec is a technique that aims to match job descriptions with resumes by representing them as numerical vectors using the Doc2Vec model. This approach allows for efficient comparison and similarity calculation between textual documents.

In the field of machine learning, representing text documents numerically is a challenging task. However, it is essential for various applications, such as document retrieval, web search, spam filtering, and topic modeling. Doc2Vec, a variation of the Word2Vec algorithm, provides a solution by generating vector representations from words.

**Word2Vec** algorithms, such as **Continuous Bag-of-Words (CBOW)** and **Skip-Gram**, are used to create Word2Vec representations. CBOW predicts the current word based on the surrounding words in a sliding window context. Each word is then converted into a feature vector, and these vectors become the word vectors after training. On the other hand, Skip-Gram predicts the surrounding words given the current word. It is slower than CBOW but is known for its accuracy with infrequent words.

### Implementation

To implement CV Job Matching using Doc2Vec, we start by importing the necessary libraries and loading the job data from a CSV file. We preprocess the data, keeping only the relevant columns, and merge them into a new column called 'data.' Then, we tokenize the words in the 'data' column and tag them with unique identifiers using the TaggedDocument class.

Next, we initialize the Doc2Vec model with specific parameters, such as the vector size, minimum count, and number of epochs. We build the vocabulary by feeding the tagged data to the model, and then train the model on the tagged data.

After training, we save the model for future use. To match a resume with a job description, we load the saved model and preprocess the resume and job description text. We convert them to lowercase, remove punctuation and numerical values.

Using the trained model, we infer the document vectors for the resume and job description. Then, we calculate the cosine similarity between the two vectors to determine the match between the resume and the job description. The cosine similarity score ranges from -1 to 1, with 1 indicating a perfect match and -1 indicating no similarity.

By employing Doc2Vec and cosine similarity, this approach enables efficient and effective matching between job descriptions and resumes, helping to streamline the job application process and enhance the chances of finding the right candidates for specific positions.

Finally, the author also employs Gauge chart from Plotly to show the matching percentage with threshold that users could consider modifying thier CV to pass Application Tracking System (TSA) from the majority of employers.

### Coding
#### 1. Set up

In [None]:
## Install all dependencies
# !pip install gensim
# !pip install nltk
# !pip install pandas
# !pip install numpy
# !pip install requests
#!pip install PyPDF2

In [5]:
# Import libraries
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from numpy.linalg import norm
from termcolor import colored
import pandas as pd
import numpy as np
import requests
import PyPDF2
import re
import plotly.graph_objects as go
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 2. Prepare data
This dataset that we trained our model contains current job postings available on the City of New York’s official jobs site in 2020. You can follow this link to download: 
[New York Job Posting Dataset](https://data.world/city-of-ny/kpav-sd4t)

In [6]:
# Load data
df = pd.read_csv('jobs.csv')
# Check data
df.head()

Unnamed: 0,intitule_poste,entreprise,secteur_entreprise,description,salaire,experience,education,competences,ville,adresse_complete,...,salaire_max,salaire_moyen,latitude,longitude,duree_min,duree_max,duree_moyenne,recherche_effectuee,site_source,description_company
0,Ingénieur QA (Equipe Solution) (H/F),Datanumia,"Logiciels, Energie, SocialTech / GreenTech","En adoptant la démarche DORA, Datanumia favori...",Non spécifié,,Non spécifié,,Courbevoie,"4 Place des Vosges, 92400 Courbevoie, France",...,,,48.892859,2.248062,,,,data,Welcome to the Jungle,
1,Consultant senior Data (H/F),Klint,"Logiciels, Digital Marketing / Data Marketing,...",Rejoignez notre équipe de spécialistes Data en...,55K à 70K €,5.0,Bac +5,"collaboration et travail déquipe, power bi",Levallois-Perret,"74 Rue Anatole France, 92300 Levallois-Perret,...",...,70000.0,62500.0,48.892646,2.284642,,,,data,Welcome to the Jungle,
2,Consultant Data Engineer / Big Data (H/F),MP DATA,"Intelligence artificielle / Machine Learning, ...",Dans un contexte de croissance continue de nos...,40K à 58K €,,Non spécifié,"communication, langages de programmation",Balma,"3 Rue de Vidailhan, 31130 Balma, France",...,58000.0,49000.0,43.626219,1.488673,,,,data,Welcome to the Jungle,
3,Machine Learning Engineer (H/F) | Stage,Datascientest,"SaaS / Cloud Services, EdTech, Formation",Le MLOps est aujourd’hui incontournable pour i...,Non spécifié,,Non spécifié,"machine learning, tensorflow, kubernetes, pytorch",Puteaux,"1 Terrasse Bellini, 92800 Puteaux, France",...,,,48.886942,2.251445,3.0,6.0,4.5,data,Welcome to the Jungle,
4,Analytics Engineer - Confirmé·e,JAKALA,"Digital Marketing / Data Marketing, Big Data, ...","Au sein de notre Practice Data & AI , tu trava...",Non spécifié,3.0,Bac +5,"rédaction technique, travail déquipe, visualis...",Paris,Découvrir,...,,,48.844696,2.43665,,,,data,Welcome to the Jungle,


Since head() fuction does not show all data, we check column names to retain only necessary variables.

In [7]:
# Show column name
df['experience'] = "Experiences requises : " + df['experience'].astype(str)
df['competences'] = "Compétences requises : " + df['competences'].astype(str)
df['education'] = "Formation requise : " + df['education'].astype(str)

df = df[['intitule_poste','description','education','competences','experience']]

df.head()

Unnamed: 0,intitule_poste,description,education,competences,experience
0,Ingénieur QA (Equipe Solution) (H/F),"En adoptant la démarche DORA, Datanumia favori...",Formation requise : Non spécifié,Compétences requises : nan,Experiences requises : nan
1,Consultant senior Data (H/F),Rejoignez notre équipe de spécialistes Data en...,Formation requise : Bac +5,Compétences requises : collaboration et travai...,Experiences requises : 5.0
2,Consultant Data Engineer / Big Data (H/F),Dans un contexte de croissance continue de nos...,Formation requise : Non spécifié,"Compétences requises : communication, langages...",Experiences requises : nan
3,Machine Learning Engineer (H/F) | Stage,Le MLOps est aujourd’hui incontournable pour i...,Formation requise : Non spécifié,"Compétences requises : machine learning, tenso...",Experiences requises : nan
4,Analytics Engineer - Confirmé·e,"Au sein de notre Practice Data & AI , tu trava...",Formation requise : Bac +5,"Compétences requises : rédaction technique, tr...",Experiences requises : 3.0


Keep only some columns to train the model

In [8]:
# Create a new column called 'data' and merge the values of the other columns into it
df['data'] = df[['intitule_poste','education','competences','experience']].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)
# Drop the individual columns if you no longer need them
df.drop(['intitule_poste','education','competences','experience'], axis=1, inplace=True)
# Preview the updated dataframe
print(df.head())

                                         description  \
0  En adoptant la démarche DORA, Datanumia favori...   
1  Rejoignez notre équipe de spécialistes Data en...   
2  Dans un contexte de croissance continue de nos...   
3  Le MLOps est aujourd’hui incontournable pour i...   
4  Au sein de notre Practice Data & AI , tu trava...   

                                                data  
0  Ingénieur QA (Equipe Solution) (H/F) Formation...  
1  Consultant senior Data (H/F) Formation requise...  
2  Consultant Data Engineer / Big Data (H/F) Form...  
3  Machine Learning Engineer  (H/F) | Stage Forma...  
4  Analytics Engineer - Confirmé·e Formation requ...  


#### 3. Tokenize data
We tokenize the words in the 'data' column and tag them with unique identifiers using the TaggedDocument class.

In [9]:
# Tag data
data = list(df['data'])
tagged_data = [TaggedDocument(words = word_tokenize(_d.lower()), tags = [str(i)]) for i, _d in enumerate(data)]

#### 4. Model initialization and vocabulary buiding
Next, we initialize the Doc2Vec model with specific parameters.

**Parameters** of Doc2Vec are as follows: 

- `vector_size`: Dimensionality of the feature vectors. Default: 100.
- `window`: The window refers to the maximum distance between the target word and its context words within a sentence. Default: 5.
- `min_count`: Ignores all words with a total frequency lower than this. Default: 5.
- `epochs`: Number of iterations (epochs) over the corpus. Defaults to 5 for PV-DBOW and 10 for PV-DM.
- `dm`: Defines the training algorithm. If `dm = 1`, the Distributed Memory (PV-DM) model is used. If `dm = 0`, the Distributed Bag of Words (PV-DBOW) model is used. Default: 1 (PV-DM).
- `dbow_words`: If set to 1, trains word vectors (in addition to document vectors) using the PV-DBOW algorithm. Default: 0 (False).
- `dm_mean`: If set to 1, uses the mean of the context word vectors instead of concatenation when inferring vectors in the PV-DM model. Default: 0 (False).
- `dm_concat`: If set to 1, concatenates the document and context word vectors when inferring vectors in the PV-DM model. Default: 0 (False).
- `dm_tag_count`: Expected number of document tags per document, when using the PV-DM algorithm. Default: 1.
- `dbow_tag_count`: Expected number of document tags per document, when using the PV-DBOW algorithm. Default: 1.
- `alpha`: The initial learning rate. Default: 0.025.
- `min_alpha`: The learning rate will linearly drop to `min_alpha` as training progresses. Default: 0.0001.
- `hs`: If set to 1, hierarchical softmax activation function will be used. Default: 0 (Negative Sampling).
- `negative`: If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drawn. Default: 5.
- `ns_exponent`: The exponent used to shape the negative sampling distribution. Default: 0.75.


In [19]:
# Model initialization
model = Doc2Vec(vector_size = 50,
min_count = 5,
epochs = 100,
alpha = 0.001
)

In [20]:
# Vocabulary building
model.build_vocab(tagged_data)
# Get the vocabulary keys
keys = model.wv.key_to_index.keys()
# Print the length of the vocabulary keys
print(len(keys))

675


#### 5. Train and save the model
Train the model on tagged data.

In [21]:
# Train the model
for epoch in range(model.epochs):
    print(f"Training epoch {epoch+1}/{model.epochs}")
    model.train(tagged_data, 
                total_examples=model.corpus_count, 
                epochs=model.epochs)

model.save('cv_job_maching.model')
print("Model saved")

Training epoch 1/100
Training epoch 2/100
Training epoch 3/100
Training epoch 4/100
Training epoch 5/100
Training epoch 6/100
Training epoch 7/100
Training epoch 8/100
Training epoch 9/100
Training epoch 10/100
Training epoch 11/100
Training epoch 12/100
Training epoch 13/100
Training epoch 14/100
Training epoch 15/100
Training epoch 16/100
Training epoch 17/100
Training epoch 18/100
Training epoch 19/100
Training epoch 20/100
Training epoch 21/100
Training epoch 22/100
Training epoch 23/100
Training epoch 24/100
Training epoch 25/100
Training epoch 26/100
Training epoch 27/100
Training epoch 28/100
Training epoch 29/100
Training epoch 30/100
Training epoch 31/100
Training epoch 32/100
Training epoch 33/100
Training epoch 34/100
Training epoch 35/100
Training epoch 36/100
Training epoch 37/100
Training epoch 38/100
Training epoch 39/100
Training epoch 40/100
Training epoch 41/100
Training epoch 42/100
Training epoch 43/100
Training epoch 44/100
Training epoch 45/100
Training epoch 46/1

#### 6. Inputs of CV and JD

- **Resume**:

In this case, I assume that we upload our CV in PDF file, so I use PyPDF2 to extract data. You can also change how to read inputs appropreately.

In [10]:
pdf = PyPDF2.PdfReader('./cv/CV_Lucas_Coussy_english_version.pdf')
resume = ""
for i in range(len(pdf.pages)):
    pageObj = pdf.pages[i]
    resume += pageObj.extract_text()

In [11]:
# import fitz  # pymupdf

# doc = fitz.open("./cv/CV_Lucas_Coussy_english_version.pdf")

# resume_ = ""
# for page in doc:
#     resume_ += page.get_text("text") + "\n"

# print(resume_)

- **Job Description**:

From my perspective, I believe candidates will copy and paste the JD into textbox to check the matching percentage, so I will have JD Input in text as below.

In [12]:
jd = """
We are looking for a highly motivated individual to join our AI Special Forces team. 
A person who is passionate about delivering fast, effective, and high-quality support to clients, and is driven by the potential of technology and AI. 
This role is perfect for someone who loves solving problems, is highly organized, and has a strong inteZrest in technology and AI.
\nAs an AI Special Forces Specialist, you will play a critical role, acting as the first line of defense when clients encounter issues with their AI agents or 
need to integrate them with external systems. 
You\u2019ll work directly with customers to resolve questions, troubleshoot technical problems, and collaborate with internal teams (CS, Onboarding, Product, and Engineering) 
to ensure issues are resolved promptly and thoroughly. 
Your work is key to maintaining strong client relationships and ensuring satisfaction with the Darwin AI experience\n.
\nIn this role, you will:\nRespond to customer inquiries via WhatsApp, email, and Slack, ensuring fast responses and high customer satisfaction.
\nTroubleshoot and resolve technical problems, especially those related to AI behavior, configuration, and API integrations
\nMonitor and act on alerts from internal tools like Slack channels and customer feedback submitted in the Darwin platform
\nWork closely with Product and Engineering teams, escalating complex issues and contributing to product improvements.
\nDocument support activity in the appropriate platform, maintaining accurate logs of issues and resolutions.
\nIdentify recurring issues and contribute to internal documentation and FAQs.\nCollaborate with the Customer Success and Onboarding teams to ensure a seamless customer experience.
\nAudit AI conversations to detect bugs or opportunities for improvement.
\nEnsure that all critical feedback and issues are resolved within the SLA.
\nRequirements
\nExperience in Customer Support, Technical Support, or Helpdesk roles, ideally in SaaS or tech environments.
\nStrong troubleshooting skills and ability to resolve issues efficiently.
\nFamiliarity with AI behavior, JSON structures, and state machines (training provided).
\nExperience with AI configuration, WhatsApp, APIs, and third-party integrations.\nKnowledge of process automation; experience with Zapier is a plus.
\nProgramming knowledge, especially in Python, is a plus.\nAbility to explain technical concepts clearly to both technical and non-technical audiences
\nHighly organized, with the ability to manage multiple support cases at once.\nStrong written and verbal communication skills.
\nA customer-first mindset with a genuine desire to help clients succeed.\nA team player with adaptability in fast-paced environments.
\nPassion for technology, AI, and continuous learning.\nBenefits\n\u25cf\nLanguage Classes:
\nAccess to language classes (English, Portuguese, Spanish) to enhance communication skills.
\n\u25cf\nOpenAI or Gemini Premium License:\nComplimentary access to an OpenAI premium license for personal or professional use.
\n\u25cf\nPaid Time Off:\nEnjoy 25 days\/year of paid vacations and holidays to recharge and maintain a healthy work-life balance.
\n\u25cf\nSoft Hybrid Work:\nWe meet 3 days\/month in our Co Work offices, the rest of the time you can work remotely from wherever you like!
"""

In [13]:
#jd = "We are seeking a Computer Vision Engineer with strong software and AI fundamentals to build and deploy high-performance AI models. You will handle the full pipeline\u2014from training detection and segmentation models to optimizing them for production using NVIDIA TensorRT and Docker.\nCore Responsibilities\nModel Training: Train and fine-tune models for Detection, Classification, and Segmentation (e.g., YOLO, ResNet, U-Net).\nTracking: Implement Multi-Object Tracking (MOT) algorithms for complex video streams.\nEngineering: Write production-grade Python code with a focus on modularity and scalability.\nDeployment: Containerize applications using Docker for consistent deployment.\nRequirements\n3+ years in CV\/Deep Learning.\nPython, PyTorch, OpenCV.\nStrong preference for experience with NVIDIA TensorRT and model optimization (quantization\/pruning).\nSolid grasp of software engineering principles (Git, testing, CI\/CD).\nCan work on other non-vision AI implementations"


In [14]:
jd = "We are expanding rapidly and looking to hire four passionate\nComputer Vision Engineers\nto join our growing team. Ideal candidates should have at least\n2 years of professional experience\nin the industry or as an academic postgraduate researchers, having practical and theoretical knowledge in Machine Learning, Computer\/Machine Vision and Visual-Language Models (VLMs). Skills on embedded programming will be acknowledged, to explore the most efficient and practical algorithmic implementations in embedded platforms.\nIn this role, you should be able to work with an agile team of experienced engineers, solving complex vision AI problems by developing cutting edge technology. You will be involved in various products and product development phases working alongside some of the most talented people in the industry.\nMandatory: Fulfilled army obligations (for male candidates) \u2013 Please report it in your application CV\nRequirements\nCandidates should have a BSc degree in Electrical & Computer Engineering \/ Computer Science, and in addition:\nProven work experience as a SW Engineer (>2yrs of working experience), especially:\nProgramming experience with Python packages such as Scikit-learn PyTorch.\nExperience in object-oriented programming in C++ and Python.\nMust have proven knowledge of computer vision and machine learning principles and algorithms (e.g. MSc or PhD in computer vision or machine learning) or relevant proven experience.\nExperience with image segmentation, image classification and object detection deep learning models as well as CNNs, RNNs\/LSTMs, VLMs, Zero shot and open-set architectures, Vision Transformers etc.\nAbility to work with cross functional teams.\nAbility to learn new programming languages and technologies.\nDesired (but not mandatory) Skills:\nFamiliarity with Diffusion Models for image generation and enhancement.\n3D Computer Vision and Spatial Understanding.\nDepth Estimation & 3D Reconstruction: Working with point clouds, LiDAR data and stereo imaging.\nSimultaneous Localization and Mapping (SLAM): Developing algorithms for real-time mapping and navigation in robotics.\nEmbedded software background and understanding of embedded system architectures.\nHands on experience with Docker and microservice oriented development.\nCode versioning (Git) and MLOps.\nDemonstrated proactiveness and enthusiasm for technology, with a commitment to delivering high-quality results within an evolving environment.\nBenefits\nWork in a dynamic and pleasant environment at a fast-paced company\nDiscuss\/interact with tech-leaders at global scale, using cutting-edge tech and driving new markets\nCompetitive remuneration package\nHuge room for creativity and innovation\nPrivate medical insurance"


- **Develop a function to pre-process input text**:

In [15]:
def preprocess_text(text):
    # Convert the text to lowercase
    text = text.lower()
    
    # Remove punctuation from the text
    text = re.sub('[^a-z]', ' ', text)
    
    # Remove numerical values from the text
    text = re.sub(r'\d+', '', text)
    
    # Remove extra whitespaces
    text = ' '.join(text.split())
    
    return text

In [16]:
preprocess_text(resume)

'education education master iref erds university of bordeaux bordeaux bachelor s degree in mathematics university of clerm ont auvergne clerm ont ferrand preparatory class mpsi mathematics physics engineering lyc e lafayette clermont ferrand exp riences exp riences observation internship in data science beys clerm ont ferrand national mathematics competition concours g n ral lyc e jeanne d arc clerm ont ferrandjanuary march about me about mestudent in statistics and economics with a strong interest in artificial intelligence and machine learning i design projects combining data analysis and machine learning models which i regularly publish on my github lucas coussy data science internship contact coussy lucas gmail com rue blaise pascale talence linkedin com in lucas coussy b driver s license b native b toeiclanguages french english skills python pandas tensorflow keras seaborn r vba sql excel pow er bi interests ai street w orkout m athem atics m usic https github com lucas coussy pro

In [17]:
# Apply to CV and JD
input_CV = preprocess_text(resume)
input_JD = preprocess_text(jd)

#### 7. Matching
Using the trained model, we infer the document vectors for the resume and job description. Then, we calculate the cosine similarity between the two vectors to determine the match between the resume and the job description.

In [18]:
# Model evaluation
model = Doc2Vec.load('cv_job_maching.model')
v1 = model.infer_vector(input_CV.split())
v2 = model.infer_vector(input_JD.split())
similarity = 100*(np.dot(np.array(v1), np.array(v2))) / (norm(np.array(v1)) * norm(np.array(v2)))
print(round(similarity, 2))

73.69


#### 8. Visualization and Notification

In [19]:
# Visualization
fig = go.Figure(go.Indicator(
    domain = {'x': [0, 1], 'y': [0, 1]},
    value = similarity,
    mode = "gauge+number",
    title = {'text': "Matching percentage (%)"},
    #delta = {'reference': 100},
    gauge = {
        'axis': {'range': [0, 100]},
        'steps' : [
            {'range': [0, 50], 'color': "#FFB6C1"},
            {'range': [50, 70], 'color': "#FFFFE0"},
            {'range': [70, 100], 'color': "#90EE90"}
        ],
             'threshold' : {'line': {'color': "red", 'width': 4}, 'thickness': 0.75, 'value': 100}}))

fig.update_layout(width=600, height=400)  # Adjust the width and height as desired
fig.show()

# Print notification
if similarity < 50:
    print(colored("Low chance, need to modify your CV!", "red", attrs=["bold"]))
elif similarity >= 50 and similarity < 70:
    print(colored("Good chance but you can improve further!", "yellow", attrs=["bold"]))
else:
    print(colored("Excellent! You can submit your CV.", "green", attrs=["bold"]))

[1m[32mExcellent! You can submit your CV.[0m


## Acknowledgment

This implementation is a **modified version of the Doc2Vec notebook** originally provided by the authors.  
Their original work served as the foundation for this implementation, which has been adapted and extended to fit the specific needs of this project.

Original notebook to find [here](https://github.com/kirudang/CV-Job-matching).  
Original model architecture to find [here](https://github.com/jhlau/doc2vec).\
All credit for the original approach and implementation goes to the original authors.
