<a href="https://colab.research.google.com/github/chewzzz1014/fyp/blob/master/job_resume_score/src/job_resume_score_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
resume_text = '''
Zi Qing Chew
chewziqing@gmail.com | 016-2892475 | Kuala Lumpur, Malaysia | linkedin.com/in/ziqingchew | github.com/chewzzz1014
EDUCATION

Universiti Putra Malaysia					                                                   Oct 2021 - Current
Bachelor in Computer Science with Honours
Expected to graduate in July 2025. CGPA: 3.99

WORK EXPERIENCE

Ant International 									          	July 2024 – Oct 2024
Java Engineer Intern							                               Kuala Lumpur, Malaysia
Collaborated in developing an audit logging feature for Ant Group’s internal Foreign Exchange (FX) trade strategy system that records changes made by business users to trade strategies.
Conducted comprehensive system analysis and project planning, delivering presentations to project stakeholders and QA teams prior to the development phase.
Utilised Ant Group’s internal frameworks, middleware, and tools to implement the audit logging feature.
Skills: Java, Spring, Sofaboot, Ant Group internal middlewares (ZDAL, DRM, Ant Scheduler, Msg Broker)
Howuku  									          	             Feb 2023 – Sep 2023
Software Developer Intern							                    Kuala Lumpur, Malaysia
Developed and optimized A/B testing features, including code editor and previewer for CSS and JavaScript modifications for experiment variations.
Expanded A/B testing targeting rule by incorporating website visitor's OS, device, and browser rules.
Automated experiment-stopping criteria and email notifications based on user-defined experiment termination conditions.
Collaborated with cross-functional teams to debug, troubleshoot, and enhance Howuku platform features based on user feedback and performance data.
Skills: JavaScript, Bootstrap, Vue.js, Express.js, MySQL

PROJECTS

Personal Portfolio Website (chewzzz1014.github.io/portfolio-website)
Designed, developed and deployed personalised portfolio website featuring skills, selected projects, and downloadable resume.
Skills: JavaScript, React.js, CSS, Bootstrap
Depression Level Detection Chatbot (https://github.com/chewzzz1014/health-ease-project)
Developed machine learning application that evaluates a message's depression level and provided tailored mental health advice and information based on the depression severity.
Skills: Python, pandas, scikit-learn, Keras, FastAPI, Gradio
Clothing Store Website (https://github.com/chewzzz1014/CSC3402-MVC-Project)
Worked in team to build a CRUD Spring Boot application with attractive interfaces, data persistence, authentication and authorisation.
Developed the backend of the application that involves querying the database, building REST endpoints and implementing Thymeleaf in HTML for dynamic contents.
Skills: Spring Boot, Spring MVC, Thymeleaf, Hibernate, Bootstrap

SKILLS
Programming Languages: Java, Python, HTML, CSS, JavaScript, MySQL, OracleSQL
Frameworks and Libraries: Spring, Spring Boot, TypeScript, Node.js, Express.js, React.js, Vue.js, Bootstrap, Tailwind CSS
Tools: Git, Github, Jira, Tableau, Excel, Jupyter Notebook, Google Colab, VSCode, IntelliJ
'''

In [4]:
# load job descriptions from excel
import pandas as pd
job_desc_df = pd.read_excel("/content/drive/MyDrive/FYP/Implementation/Resume Dataset/job_desc.xlsx")
job_desc_df

Unnamed: 0,Job Title,Job Desc
0,Java Developer,This is a technical role as part of an applica...
1,Front End Developer (React),Your roles & responsibilities:\n\nDesign and d...
2,Junior Backend Developer (Golang),Job description\nCompany Description\n\nAbout ...
3,Digital Marketing Manager,Job description\nCompany Description\n\nNuraz ...
4,C&S Design Engineer (Sibu),ROLES & RESPONSIBILITIES:\n\nDesign and prepar...


In [5]:
# make prediction using trained NER model

import spacy
import string
from spacy import displacy

# convert text into small letter then remove punctuation
resume_text = resume_text.lower()
resume_text = resume_text.translate(str.maketrans('', '', string.punctuation))

# load trained model
trained_model = spacy.load("/content/drive/MyDrive/FYP/Implementation/spacy_output/model-best")

# create a Spacy doc and add text to it
doc = trained_model(resume_text)

# extract entities into a dictionary
entities_dict = {}
for ent in doc.ents:
    if ent.label_ in entities_dict:
        entities_dict[ent.label_].append(ent.text)
    else:
        entities_dict[ent.label_] = [ent.text]

# Print the dictionary
print(entities_dict)

# visualize predicted entities using displacy
colors = {
    "NAME": "lightblue",
    "LOC": "yellow",
    "PHONE": "pink",
    "EMAIL": "lightgreen",
    "JOB": "orange",
    "SKILL": "aqua",
    "COMPANY": "violet",
    "WORK PER": "salmon",
    "DEG": "lightcoral",
    "UNI": "lightgrey",
    "STUDY PER": "peachpuff",
}
options = {"ents": list(colors.keys()), "colors": colors}
displacy.render(doc, style="ent", jupyter=True, options=options)

{'SKILL': ['qing', 'chew', 'java', 'spring', 'sofaboot', 'javascript', 'bootstrap', 'vuejs', 'mysql', 'javascript', 'reactjs', 'css', 'bootstrap', 'python', 'pandas', 'scikitlearn', 'spring boot', 'spring mvc', 'hibernate', 'bootstrap', 'java', 'python', 'html', 'css', 'javascript', 'mysql', 'oraclesql', 'nodejs', 'vuejs', 'bootstrap', 'tailwind', 'css', 'git', 'github', 'jira', 'tableau', 'excel'], 'WORK PER': ['0162892475', 'july 2024', 'oct 2024', 'feb 2023', 'sep 2023'], 'STUDY PER': ['oct 2021', 'july 2025'], 'DEG': ['bachelor in computer science'], 'JOB': ['java engineer intern', 'software developer']}


# Text Preprocessing
1. Text Cleaning
2. Tokenization
3. Preparation
    *   Stop Word Removal
    *   Stemming
    *   Lemmatization


# Feature Extraction Using TF-IDF

# Consine Similarity

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# tokenization
import string
from collections import Counter

def tokenize(text):
    text = text.lower()
    no_punc_text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(no_punc_text)
    return tokens

resume_tokens = tokenize(resume_text)
resume_tokens_count = Counter(resume_tokens)
print('10 top tokens in resume:')
print(resume_tokens_count.most_common(10))

job_tokens = tokenize(job_desc_df.loc[0, 'Job Desc'])
job_tokens_count = Counter(job_tokens)
print('10 top tokens in job description:')
print(job_tokens_count.most_common(10))

10 top tokens in resume:
[('and', 17), ('to', 7), ('skills', 7), ('the', 6), ('spring', 6), ('in', 5), ('ant', 5), ('malaysia', 4), ('for', 4), ('developed', 4)]
10 top tokens in job description:
[('and', 17), ('to', 16), ('in', 8), ('design', 6), ('with', 5), ('requirements', 5), ('a', 4), ('technical', 4), ('of', 4), ('application', 4)]


In [8]:
# stop word removal
from nltk.corpus import stopwords

resume_tokens_filtered = [w for w in resume_tokens if not w in stopwords.words('english')]
resume_tokens_filtered_count = Counter(resume_tokens_filtered)
print('50 top tokens in resume after stop word removal:')
print(resume_tokens_filtered_count.most_common(50))

job_tokens_filtered = [w for w in job_tokens if not w in stopwords.words('english')]
job_tokens_filtered_count = Counter(job_tokens_filtered)
print('50 top tokens in job after stop word removal:')
print(job_tokens_filtered_count.most_common(50))

50 top tokens in resume after stop word removal:
[('skills', 7), ('spring', 6), ('ant', 5), ('malaysia', 4), ('developed', 4), ('css', 4), ('javascript', 4), ('website', 4), ('bootstrap', 4), ('kuala', 3), ('lumpur', 3), ('java', 3), ('group', 3), ('internal', 3), ('based', 3), ('depression', 3), ('application', 3), ('boot', 3), ('oct', 2), ('july', 2), ('2024', 2), ('–', 2), ('intern', 2), ('collaborated', 2), ('audit', 2), ('logging', 2), ('feature', 2), ('’', 2), ('trade', 2), ('system', 2), ('project', 2), ('teams', 2), ('frameworks', 2), ('tools', 2), ('howuku', 2), ('2023', 2), ('ab', 2), ('testing', 2), ('features', 2), ('experiment', 2), ('data', 2), ('vuejs', 2), ('expressjs', 2), ('mysql', 2), ('projects', 2), ('portfolio', 2), ('reactjs', 2), ('level', 2), ('python', 2), ('thymeleaf', 2)]
50 top tokens in job after stop word removal:
[('design', 6), ('requirements', 5), ('technical', 4), ('application', 4), ('software', 4), ('analysis', 4), ('develop', 4), ('team', 3), ('wor

In [9]:
# stemming
from nltk.stem.porter import *

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

stemmer = PorterStemmer()

resume_tokens_stemmed = stem_tokens(resume_tokens_filtered, stemmer)
resume_tokens_stemmed_count = Counter(resume_tokens_stemmed)
print('50 top tokens in resume after stemming:')
print(resume_tokens_stemmed_count.most_common(50))

job_tokens_stemmed = stem_tokens(job_tokens_filtered, stemmer)
job_tokens_stemmed_count = Counter(job_tokens_stemmed)
print('50 top tokens in job after stemming:')
print(job_tokens_stemmed_count.most_common(50))

50 top tokens in resume after stemming:
[('develop', 7), ('skill', 7), ('intern', 6), ('spring', 6), ('ant', 5), ('featur', 5), ('malaysia', 4), ('project', 4), ('css', 4), ('javascript', 4), ('websit', 4), ('bootstrap', 4), ('kuala', 3), ('lumpur', 3), ('experi', 3), ('java', 3), ('group', 3), ('team', 3), ('base', 3), ('depress', 3), ('applic', 3), ('boot', 3), ('oct', 2), ('juli', 2), ('work', 2), ('2024', 2), ('–', 2), ('collabor', 2), ('audit', 2), ('log', 2), ('’', 2), ('trade', 2), ('strategi', 2), ('system', 2), ('user', 2), ('framework', 2), ('middlewar', 2), ('tool', 2), ('implement', 2), ('howuku', 2), ('2023', 2), ('ab', 2), ('test', 2), ('rule', 2), ('data', 2), ('vuej', 2), ('expressj', 2), ('mysql', 2), ('portfolio', 2), ('reactj', 2)]
50 top tokens in job after stemming:
[('develop', 7), ('requir', 7), ('work', 6), ('design', 6), ('technic', 4), ('applic', 4), ('softwar', 4), ('perform', 4), ('analysi', 4), ('team', 3), ('abl', 3), ('technolog', 3), ('project', 3), ('pr

In [10]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [41]:
# for text preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def tokenize(text):
    return word_tokenize(text)

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

def tokenize_text(text):
    return [word for word in tokenize(text) if word not in stop_words]

def stem_text(tokens):
    return [stemmer.stem(word) for word in tokens]

def lemmatize_text(tokens, nlp):
    # doc = nlp(" ".join(tokens))
    # return [token.lemma_ for token in doc]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

def preprocess_text(text, nlp):
    text = clean_text(text)
    tokens = tokenize_text(text)
    lemmatized_tokens = lemmatize_text(tokens, nlp)
    # return lemmatized_tokens
    return ' '.join(lemmatized_tokens)

In [42]:
# tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

trained_ner_model = spacy.load("/content/drive/MyDrive/FYP/Implementation/spacy_output/model-best")
pretrained_model = spacy.load("en_core_web_lg")

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

def compute_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

def compute_similarity_skill(skills1, skills2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([' '.join(skills1), ' '.join(skills2)])
    return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

def compute_semantic_similarity(skills1, skills2, nlp):
    if not skills1 or not skills2:
        return 0.0
    skill_vectors1 = [pretrained_model(skill).vector for skill in skills1]
    skill_vectors2 = [pretrained_model(skill).vector for skill in skills2]
    similarity_matrix = cosine_similarity(skill_vectors1, skill_vectors2)
    return similarity_matrix.mean()

def normalize_skills(skills):
    # avoid redundant skill
    return set([skill.lower().strip() for skill in skills])

cleaned_resume = preprocess_text(resume_text, pretrained_model)
print(cleaned_resume)
resume_doc = trained_ner_model(cleaned_resume)
resume_skills = [ent.text for ent in resume_doc.ents if ent.label_ == 'SKILL']
resume_skills = normalize_skills(resume_skills)

score_df = pd.DataFrame(columns=['Resume Skill', 'Job Title', 'Job Skill', 'Similarity Score', 'Skill Similarity Score',
                                 'Semantic Skill Similarity Score', 'Final Similarity Score', 'Final Similarity Score (%)'])

for index, row in job_desc_df.iterrows():
    job_title = row['Job Title']
    job_desc = preprocess_text(row['Job Desc'], pretrained_model)

    job_doc = trained_ner_model(job_desc)
    job_skills = [ent.text for ent in job_doc.ents if ent.label_ == 'SKILL']
    job_skills = normalize_skills(job_skills)

    # between texts
    similarity_score = compute_similarity(resume_text, job_desc)

    # between skills extracted from text
    skill_similarity_score = compute_similarity_skill(resume_skills, job_skills)

    # between skills extracted from text, semantical comparision
    semantic_skill_similarity_score = compute_semantic_similarity(resume_skills, job_skills, pretrained_model)

    final_similarity_score = 0.7 * skill_similarity_score + 0.3 * semantic_skill_similarity_score

    result_dict = {
        'Resume Skill': resume_skills,
        'Job Title': job_title,
        'Job Skill': job_skills,
        'Similarity Score': similarity_score,
        'Skill Similarity Score': skill_similarity_score,
        'Semantic Skill Similarity Score': semantic_skill_similarity_score,
        'Final Similarity Score': final_similarity_score,
        'Final Similarity Score (%)': final_similarity_score * 100
    }
    score_df = pd.concat([score_df, pd.DataFrame([result_dict])], ignore_index=True)
score_df

zi qing chew chewziqinggmailcom kuala lumpur malaysia linkedincominziqingchew githubcomchewzzz education universiti putra malaysia oct current bachelor computer science honour expected graduate july cgpa work experience ant international july oct java engineer intern kuala lumpur malaysia collaborated developing audit logging feature ant group internal foreign exchange fx trade strategy system record change made business user trade strategy conducted comprehensive system analysis project planning delivering presentation project stakeholder qa team prior development phase utilised ant group internal framework middleware tool implement audit logging feature skill java spring sofaboot ant group internal middlewares zdal drm ant scheduler msg broker howuku feb sep software developer intern kuala lumpur malaysia developed optimized ab testing feature including code editor previewer cs javascript modification experiment variation expanded ab testing targeting rule incorporating website visit

  score_df = pd.concat([score_df, pd.DataFrame([result_dict])], ignore_index=True)


Unnamed: 0,Resume Skill,Job Title,Job Skill,Similarity Score,Skill Similarity Score,Semantic Skill Similarity Score,Final Similarity Score,Final Similarity Score (%)
0,"{tableau, hibernate, nodejs, github, mysql, sp...",Java Developer,"{jsp, maven, communication, spring, java, plsq...",0.054195,0.135254,0.178767,0.148307,14.830743
1,"{tableau, hibernate, nodejs, github, mysql, sp...",Front End Developer (React),"{restful, apis, postgresql, dynamodb, aws, mon...",0.047363,0.061565,0.188061,0.099514,9.951421
2,"{tableau, hibernate, nodejs, github, mysql, sp...",Junior Backend Developer (Golang),"{communication, oop, sdlc, objectoriented, uni...",0.051505,0.0,0.061737,0.018521,1.8521
3,"{tableau, hibernate, nodejs, github, mysql, sp...",Digital Marketing Manager,{communication},0.01802,0.0,0.159898,0.047969,4.796928
4,"{tableau, hibernate, nodejs, github, mysql, sp...",C&S Design Engineer (Sibu),"{civil engineering, team player, interpersonal...",0.019033,0.0,0.15682,0.047046,4.704601


In [41]:
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the SpaCy model (adjust path to your model)
nlp = spacy.load("/content/drive/MyDrive/FYP/Implementation/spacy_output/model-best")

# Initialize text processing components
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define preprocessing function
def preprocess_text(text):
    # Text cleaning: lowercase, remove punctuation and extra whitespace
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = ' '.join(text.split())

    # Tokenization
    tokens = text.split()

    # Stop word removal
    tokens = [word for word in tokens if word not in stop_words]

    # Stemming and lemmatization
    tokens = [stemmer.stem(word) for word in tokens]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Rejoin tokens to a single string
    return ' '.join(tokens)

# Preprocess resume and job description
resume_text_processed = preprocess_text(resume_text)
job_description_processed = preprocess_text(job_desc.iloc[0, 'Job Desc'])

# Extract skills from resume using NER
resume_doc = nlp(resume_text)
skills = [ent.text for ent in resume_doc.ents if ent.label_ == 'SKILL']

# Define function to compute cosine similarity
def compute_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

# Similarity calculation
if len(skills) >= 5:  # if sufficient skills are detected
    skills_text = ' '.join(skills)
    print(skills_text)
    similarity_score = compute_similarity(skills_text, job_description_processed)
else:
    # Use the entire resume and job description text if skills are insufficient
    similarity_score = compute_similarity(resume_text_processed, job_description_processed)

similarity_score_skill = compute_similarity(skills_text, job_description_processed)
similarity_score_overall = compute_similarity(resume_text_processed, job_description_processed)

similarity_score = similarity_score_skill * 0.3 + similarity_score_overall * 0.7

print(f"Cosine Similarity Score: {similarity_score * 100:.2f}%")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


qing chew java spring sofaboot javascript bootstrap vuejs mysql javascript reactjs css bootstrap python pandas scikitlearn spring boot spring mvc hibernate bootstrap java python html css javascript mysql oraclesql nodejs vuejs bootstrap tailwind css git github jira tableau excel
Cosine Similarity Score: 14.66%
