# **Building Resume Analysis Using Named Entity Recognition (NER)**

The steps detailed in this [DataCamp article by Adib Ali Anwan](https://www.datacamp.com/blog/what-is-named-entity-recognition-ner) were used to guide ChatGPT to generate the code for building this model.


We will create a system for analyzing resumes that helps hiring managers filter candidates based on their skills and attributes.

## **Install and Import Libraries**

We import the required packages and initialize the spaCy model and WordNet Lemmatizer for later use.

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [2]:
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## **Convert PDF to CSV**

Resumes are usually in PDF format, so we will need to convert them to a CSV file to be operated on. We can do this with PyPDF2. Here's a basic approach to convert PDF resumes into a CSV file:
* We define a function extract_text_from_pdf that takes a PDF file path as input and returns the extracted text from the PDF.
* Iterate over each PDF file path in the pdf_files list, extract the text from each PDF using the extract_text_from_pdf function, and store the text in a list.
* Then create a DataFrame with columns 'ID' (to uniquely identify each resume) and 'resume_text' (to store the extracted text from resumes).
* And lastly, we save the DataFrame to a CSV file named 'resumes.csv'.

In [4]:
import PyPDF2
import pandas as pd

def extract_text_from_pdf(pdf_path):
    text = ''
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

# List of PDF file paths containing resumes
pdf_files = ['IME_INYANG_CV_.pdf',
             'IME INYANG CV_bitnine.pdf',
             'IME INYANG JR_CV.pdf',
             'IME_INYANG_CV_BUA FOODS.pdf',
             'IME_INYANG_IOM_CV (graphic design_data viz).pdf']

# Extract text from each PDF resume and store it in a list
resumes_text = [extract_text_from_pdf(pdf_path) for pdf_path in pdf_files]

# Create a DataFrame with columns 'ID' and 'resume_text'
data = pd.DataFrame({'ID': range(1, len(pdf_files)+1), 'resume_text': resumes_text})

# Save the DataFrame to a CSV file
data.to_csv('resumes.csv', index=False)

## **Loading the Data and NER model**

Here, the CSV file has three columns: `'ID'`, and `'resume_text'`.

In [5]:
# Load data from CSV file
data = pd.read_csv('resumes.csv')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

## **Entity Ruler**
Let's add an entity ruler pipeline to the spaCy model and create an entity ruler using a JSON file containing labels and patterns for skills:

* We import EntityRuler from the `spacy.pipeline` module.
* We import `json` module.
* We add an entity ruler pipeline to the spaCy model using the `add_pipe` method.
* We specify the position of the entity ruler pipeline using the `before` parameter to ensure it runs before the Named Entity Recognition (NER) pipeline.
* We load `patterns` from a JSON file named `'skills_patterns.json'`, which contains labels and patterns for skills such as ".net", "cloud", and "aws".
* We convert the JSON content to a Python dictionary.
* We add the patterns to the entity ruler using the `add_patterns` method.

In [6]:
from spacy.pipeline import EntityRuler
# import json

# Add entity ruler pipeline to spaCy model
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define patterns as dictionaries
patterns = [
    {"label": "SKILL", "pattern": [{"LOWER": "matplotlib"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "python"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "pandas"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "seaborn"}]},
    {"label": "PERSON", "pattern": [{"LOWER": "ime"}, {"LOWER": "okon"}, {"LOWER": "inyang"}, {"LOWER": "jnr"}]},
    {"label": "PERSON", "pattern": [{"LOWER": "ime inyang"}]},
    {"label": "PERSON", "pattern": [{"LOWER": "ime okon inyang"}]}
]

# Add patterns to entity ruler
ruler.add_patterns(patterns)

## **Text Cleaning**
Let's clean the text data using NLTK following the steps below:

* We define a function `clean_text` that takes a text input and performs the cleaning.
* We use regular expressions to remove hyperlinks, special characters, and punctuations.
* We convert the text to lowercase and tokenize it into words.
* We lemmatize each word to its base form using the WordNet Lemmatizer.
* We remove English stop words using NLTK's stopwords corpus.
* Finally, we apply this cleaning function to the 'resume_text' column in the DataFrame and store the cleaned text in a new column called 'cleaned_resume'.

In [7]:
import nltk

# Download NLTK resources
nltk.download('punkt')  # Download the 'punkt' tokenizer resource
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.tokenize import word_tokenize

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove hyperlinks, special characters, and punctuations using regex
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s\n]', '', text)

    # Convert the text to lowercase
    text = text.lower()

    # Tokenize the text using nltk's word_tokenize
    words = word_tokenize(text)

    # Lemmatize the text to its base form for normalization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Remove English stop words
    stop_words = set(stopwords.words('english'))
    filtered_words = ' '.join([word for word in lemmatized_words if word not in stop_words])

    return filtered_words

# Clean the 'resume_text' column in the DataFrame
data['cleaned_resume'] = data['resume_text'].apply(clean_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## **Entity Recognition: Visualizing Named Entities in Text with `spaCy`**
Next:

* We import the displacy module from spaCy.
* We define options for visualization, specifying the entity labels we want to display and their corresponding colors.
* We loop through each resume text in the DataFrame.
* We process each resume text with the spaCy model to obtain a Doc object.
* We use displacy.render to visualize the named entities in the text with their labels highlighted. We set `jupyter=True` to display the visualization in a Jupyter notebook.

This will display the named entities for each resume text with their respective labels highlighted.

In [8]:
from spacy import displacy

# Define options for visualization
options = {'ents': ['PERSON', 'GPE', 'SKILL'],
           'colors': {'PERSON': 'orange',
                      'GPE': 'lightgreen',
                      'SKILL': 'lightblue'}}

# Visualize named entities in each resume
for resume_text in data['cleaned_resume']:
    doc = nlp(resume_text)
    displacy.render(doc, style="ent", jupyter=True, options=options)
    print('\n\n')


























## **Match Score**
To match resumes with company requirements and calculate a similarity score, we can use various methods such as TF-IDF, Word Embeddings (e.g., Word2Vec, GloVe), or BERT embeddings. Here, I'll demonstrate how to calculate the similarity score using TF-IDF (Term Frequency-Inverse Document Frequency) with cosine similarity.

First, let's define the requirements of the company, and then we'll calculate the similarity score for each resume based on these requirements:

* We define the company requirements as a string.
* We clean the company requirements using the clean_text function we defined earlier.
* We calculate the TF-IDF vectors for the company requirements and each resume text.
* We calculate the cosine similarity between the TF-IDF vector of the company requirements and each resume.
* We sort the indices of resumes based on the similarity scores in descending order.
* We display the top N most similar resumes along with their similarity scores.

You can adjust the value of top_n to display more or fewer similar resumes. Also, you can explore other similarity calculation methods and embeddings based on your preference and requirements.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Define the company requirements
company_requirements = """Data Analyst with experience using Python for data cleaning, data analysis, exploratory data analysis (EDA).
                          We are also looking for someone with the ability to explain complex mathematical concepts to non-mathematicians."""

# Combine the company requirements with stopwords removed
cleaned_company_requirements = clean_text(company_requirements)

# Calculate TF-IDF vectors for the company requirements and resume texts
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['cleaned_resume'])
company_tfidf = tfidf_vectorizer.transform([cleaned_company_requirements])

# Calculate cosine similarity between the company requirements and each resume
similarity_scores = cosine_similarity(company_tfidf, tfidf_matrix).flatten()

# Get the indices of resumes sorted by similarity score
sorted_indices = similarity_scores.argsort()[::-1]

# Display the top 5 most similar resumes
top_n = 5
for i in range(top_n):
    index = sorted_indices[i]
    print(f"Resume ID: {data['ID'][index]}")
    print(f"Similarity Score: {similarity_scores[index]}")
    print()

Resume ID: 1
Similarity Score: 0.42973824001170297

Resume ID: 2
Similarity Score: 0.31343639880770646

Resume ID: 5
Similarity Score: 0.3095427060390511

Resume ID: 4
Similarity Score: 0.29207240758415465

Resume ID: 3
Similarity Score: 0.03680589209560854



## **Skill Extractor Function**
Let's create a Python function that extracts skills from a resume using the entity ruler, matches them with required skills, and generates a similarity score:
* We define a function calculate_similarity that takes the resume text and required skills as input.
* We process the resume text with the spaCy model.
* We extract skills from the resume by filtering entities with the label "SKILL" using list comprehension.
* We calculate the number of matching skills between the resume and required skills.
* We calculate the similarity score by dividing the number of matching skills by the maximum of the lengths of required skills and extracted skills.
* Finally, we return the similarity score.

This function allows hiring managers to input a resume text and required skills, and it outputs a similarity score based on the matching skills. You can use this function in a loop to process multiple resumes and filter candidates based on their skills.

In [11]:
def calculate_similarity(resume_text, required_skills):
    # Process the resume text with the spaCy model
    doc = nlp(resume_text)

    # Extract skills from the resume using the entity ruler
    skills = [ent.text.lower() for ent in doc.ents if ent.label_ == "SKILL"]

    # Calculate the number of matching skills with required skills
    matching_skills = [skill for skill in skills if skill in required_skills]
    num_matching_skills = len(matching_skills)

    # Calculate the similarity score
    similarity_score = num_matching_skills / max(len(required_skills), len(skills))

    return similarity_score

In [15]:
for index, resume_text in data[['cleaned_resume']].itertuples():
  print(f"Resume ID: {data['ID'][index]}")
  required_skills = ["matplotlib", "python", "pandas", "seaborn"]
  similarity_score = calculate_similarity(resume_text, required_skills)
  print("Similarity Score:", similarity_score)

Resume ID: 1
Similarity Score: 1.0
Resume ID: 2
Similarity Score: 0.75
Resume ID: 3
Similarity Score: 0.0
Resume ID: 4
Similarity Score: 1.0
Resume ID: 5
Similarity Score: 1.0
