## __Step 1: Extracting Text from Resume (PDF)__

We are doing this because-

NLP models don’t understand PDFs — they understand text

In [4]:
# install required library
! pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m107.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.7


In [5]:
# Create a Simple File
import fitz         # PyMuPDF

def read_resume(pdf_file_name):
    text = ""

    pdf = fitz.open(pdf_file_name)

    for page in pdf:
        text = text + page.get_text()

    pdf.close()

    return text


In [6]:
resume_text = read_resume('/content/vanshika CV.pdf')
print(resume_text[:500])


VANSHIKA
ACADEMIC DETAILS
Year
Degree / Board
Institute
GPA / Marks(%)
---
M.Sc in Mathematics
Indian Institute of Technology Delhi
---
2025
B.Sc.(Hons),  Maths and Computing
Panjab University, Chandigarh
9.1
2022
CBSE
R.B.D.A.V.Sr.Sec.Public School
93.4
2020
CBSE
.R.B.D.A.V.Sr.Sec.Public School
94
SCHOLASTIC ACHIEVEMENTS
• JAM (Mathematics): AIR 152 out of 13000
• TIFR: Qualified TIFR 2025
• Scholarships: Recieved Gargi Scholarship(merit based one from each batch) in Bachelors
• University Topp


## __Step 2: Text Cleaning__

In [8]:
import re

def clean_text(text):
    text = text.lower()                          # make everything lowercase
    text = re.sub(r'\s+', ' ', text)             # remove extra spaces and newlines
    text = re.sub(r'[^a-zA-Z0-9 ]', '', text)    # remove special characters
    return text.strip()


In [10]:
# cleaning the resume content
cleaned_resume_text = clean_text(resume_text)
print(cleaned_resume_text[:500])

vanshika academic details year degree  board institute gpa  marks  msc in mathematics indian institute of technology delhi  2025 bschons maths and computing panjab university chandigarh 91 2022 cbse rbdavsrsecpublic school 934 2020 cbse rbdavsrsecpublic school 94 scholastic achievements  jam mathematics air 152 out of 13000  tifr qualified tifr 2025  scholarships recieved gargi scholarshipmerit based one from each batch in bachelors  university topper in bachelors  nimcet qualified nimcet 2025 e


## __Step 3: Understanding Text Meaning__

In [11]:
# load required library if not in system
!pip install sentence-transformers



In [12]:
# Load the Embedding Model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## __Step 4: Create Job Description Text__

In [13]:
job_description = """
We are looking for a Data Science Intern with strong skills in
Python, SQL, Machine Learning, NLP, and data analysis.
Experience with pandas and scikit-learn is a plus.
"""

In [14]:
# Generate Embeddings
resume_embedding = model.encode(cleaned_resume_text)
jd_embedding = model.encode(job_description)


## __Step 5: Measure Similarity__

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_score = cosine_similarity(
    [resume_embedding],
    [jd_embedding]
)

print(similarity_score)

[[0.32501936]]


In [16]:
# converting to percentage
match_percentage = similarity_score[0][0] * 100
print(f"Resume Match Score: {match_percentage:.2f}%")


Resume Match Score: 32.50%


## __Step 6: Skill Extraction & Missing Skill Analysis__

__Step 6.1: Create a Skill List__

In [17]:
skills_list = [
    "python", "sql", "machine learning", "deep learning",
    "nlp", "data analysis", "pandas", "numpy",
    "scikit-learn", "tensorflow", "pytorch",
    "excel", "power bi", "tableau"
]

__Step 6.2: Extract Skills from Text__

In [18]:
def extract_skills(text, skills):
    found_skills = []

    for skill in skills:
        if skill in text:
            found_skills.append(skill)

    return found_skills

__Step 6.3: Extract Resume & Job description Skills__

In [19]:
resume_skills = extract_skills(cleaned_resume_text, skills_list)
jd_skills = extract_skills(job_description.lower(), skills_list)

print("Skills in Resume:", resume_skills)
print("Skills in Job Description:", jd_skills)

Skills in Resume: ['python', 'sql', 'machine learning', 'pandas']
Skills in Job Description: ['python', 'sql', 'machine learning', 'nlp', 'data analysis', 'pandas', 'scikit-learn']


__Step 6.4: Find Missing Skills__

In [20]:
missing_skills = list(set(jd_skills) - set(resume_skills))
print("Missing Skills:", missing_skills)


Missing Skills: ['nlp', 'data analysis', 'scikit-learn']


## __Step 7: Making Results Human-Friendly (Interpretation Layer)__

__Step 7.1: Interpret the Match Score__

In [22]:
# score interpretation function
def interpret_score(score):
    if score >= 75:
        return "Strong Match"
    elif score >= 50:
        return "Moderate Match"
    else:
        return "Low Match"

__Step 7.2: Display a Clean Final Output__

In [25]:
print("=== *RESUME MATCH ANALYSIS* ===\n")

print(f"Match Score: {match_percentage:.2f}%")
print("Match Level:", interpret_score(match_percentage))

print("\nSkills Found in Resume:")
print(", ".join(resume_skills) if resume_skills else "None")

print("\nSkills Required in Job Description:")
print(", ".join(jd_skills) if jd_skills else "None")

print("\nMissing Skills:")
print(", ".join(missing_skills) if missing_skills else "None")

=== *RESUME MATCH ANALYSIS* ===

Match Score: 32.50%
Match Level: Low Match

Skills Found in Resume:
python, sql, machine learning, pandas

Skills Required in Job Description:
python, sql, machine learning, nlp, data analysis, pandas, scikit-learn

Missing Skills:
nlp, data analysis, scikit-learn


## __Step 8: Improving my model__


__Step 8.1: Creating  Skill Categories__

In [28]:
matching_skills = list(set(resume_skills) & set(jd_skills))
missing_skills = list(set(jd_skills) - set(resume_skills))
extra_skills = list(set(resume_skills) - set(jd_skills))

__Step 8.2: Displaying extra skills__

In [29]:
print("=== SKILL GAP ANALYSIS ===\n")

print("Matching Skills:")
print(", ".join(matching_skills) if matching_skills else "None")

print("\nMissing Skills (Important to Learn):")
print(", ".join(missing_skills) if missing_skills else "None")

print("\nExtra Skills (Nice to Have):")
print(", ".join(extra_skills) if extra_skills else "None")

=== SKILL GAP ANALYSIS ===

Matching Skills:
sql, python, pandas, machine learning

Missing Skills (Important to Learn):
nlp, data analysis, scikit-learn

Extra Skills (Nice to Have):
None


__Step 8.3: Resume Improvement Suggestions__

In [30]:
# creating suggestion function
def generate_suggestions(match_score, missing_skills):
    suggestions = []

    if match_score < 75:
        suggestions.append("Consider adding projects or experience relevant to the job description.")

    if missing_skills:
        suggestions.append("Add or highlight these skills in your resume: " + ", ".join(missing_skills))

    if match_score < 50:
        suggestions.append("Consider improving sections like Education or Work Experience for clarity and detail.")

    return suggestions


In [31]:
# apply the function
suggestions = generate_suggestions(match_percentage, missing_skills)

print("=== RESUME IMPROVEMENT SUGGESTIONS ===")
for idx, s in enumerate(suggestions, 1):
    print(f"{idx}. {s}")


=== RESUME IMPROVEMENT SUGGESTIONS ===
1. Consider adding projects or experience relevant to the job description.
2. Add or highlight these skills in your resume: nlp, data analysis, scikit-learn
3. Consider improving sections like Education or Work Experience for clarity and detail.
