
# Resume Screening
**Goal:** Build a resume screening system that ranks resumes against job descriptions using embeddings and NLP techniques.  

**Features included:**
- Parse resumes from **.pdf**, **.docx**, **.txt** files.
- Preprocess text and extract entities/skills (spaCy + PhraseMatcher + regex).
- Compute embeddings with **sentence-transformers** and rank resumes by **cosine similarity** to job descriptions.
- Provide interpretable justifications: matched skills, years of experience, and top sentences that matched.
- Optional classifier (train on labeled matches if you have them).
- Simple **Streamlit** front-end example to upload a resume and view top matches.
- Save results and export top candidates.


## 1) Install dependencies

In [None]:

# Uncomment to install required packages in your environment
# %pip install sentence-transformers pandas scikit-learn spacy pdfplumber python-docx streamlit
# %pip install fuzzywuzzy[speedup]  # optional for fuzzy skill matching
# python -m spacy download en_core_web_sm


## 2) Imports & Setup

In [2]:

import os, re, glob, json, math, warnings
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional

import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

import spacy
from spacy.matcher import PhraseMatcher

# Embeddings
from sentence_transformers import SentenceTransformer

# Resume file parsing
import pdfplumber
import docx

# Visualization
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,4)

# Device
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print('Device:', DEVICE)

# spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except Exception:
    nlp = spacy.blank("en")
    warnings.warn("spaCy model not found; using blank English model. Run `python -m spacy download en_core_web_sm` if needed.")


  from .autonotebook import tqdm as notebook_tqdm


Device: cpu


## 3) Load Resumes & Jobs Dataset

In [None]:

def strip_html(html: str) -> str:
    if not isinstance(html, str) or not html:
        return ""
    # Remove script/style blocks
    html = re.sub(r"<script[\s\S]*?</script>", " ", html, flags=re.IGNORECASE)
    html = re.sub(r"<style[\s\S]*?</style>", " ", html, flags=re.IGNORECASE)
    # Replace <br> and block tags with newlines
    html = re.sub(r"<(?:br|BR)\s*/?>", "\n", html)
    html = re.sub(r"</(?:p|div|section|li|h[1-6]|tr)>", "\n", html)
    # Strip remaining tags
    text = re.sub(r"<[^>]+>", " ", html)
    # Unescape basic entities
    text = (text
        .replace("&nbsp;", " ")
        .replace("&amp;", "&")
        .replace("&lt;", "<")
        .replace("&gt;", ">")
    )
    # Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Use the provided Resume CSV instead of scanning folders
resumejobs_base = os.path.join("datasets", "resumejobs")
resume_csv = os.path.join(resumejobs_base, "Resume", "Resume.csv")
resumes_df = pd.DataFrame()

if os.path.exists(resume_csv):
    rdf = pd.read_csv(resume_csv)
    # Expect columns: ID, Resume_str, Resume_html, Category
    lower_map = {c.lower(): c for c in rdf.columns}
    id_col = lower_map.get('id')
    txt_col = lower_map.get('resume_str')
    html_col = lower_map.get('resume_html')
    cat_col = lower_map.get('category')

    if txt_col is None and html_col is None:
        raise ValueError("Resume.csv must contain Resume_str or Resume_html column.")

    # Prefer text; fallback to stripped HTML
    if txt_col is None:
        rdf['__resume_text__'] = rdf[html_col].astype(str).apply(strip_html)
    else:
        rdf['__resume_text__'] = rdf[txt_col].astype(str)
        # If text is missing/empty but HTML exists, fill from HTML
        if html_col is not None:
            empty_mask = rdf['__resume_text__'].isna() | (rdf['__resume_text__'].str.strip() == '')
            rdf.loc[empty_mask, '__resume_text__'] = rdf.loc[empty_mask, html_col].astype(str).apply(strip_html)

    # Build unified DataFrame expected by downstream cells
    resumes_df = pd.DataFrame({
        'path': rdf[id_col] if id_col else rdf.index.astype(str),
        'id': rdf[id_col].astype(str) if id_col else rdf.index.astype(str),
        'text': rdf['__resume_text__'].astype(str),
        'category': rdf[cat_col] if cat_col else ''
    })
    print(f"Loaded resumes from Resume.csv: {len(resumes_df)}")
else:
    # Fallback to previous folder-based logic if CSV is not present
    print(f"Resume CSV not found at {resume_csv}. Falling back to folder scan.")

    # Helper: read resume text from .txt, .pdf, .docx
    def read_txt(path):
        with open(path, 'r', encoding='utf-8', errors='ignore') as f:
            return f.read()

    def read_pdf(path):
        text = []
        try:
            with pdfplumber.open(path) as pdf:
                for page in pdf.pages:
                    text.append(page.extract_text() or "")
        except Exception as e:
            warnings.warn(f"pdfplumber failed for {path}: {e}")
        return "\n".join(text)

    def read_docx(path):
        try:
            doc = docx.Document(path)
            return "\n".join([p.text for p in doc.paragraphs])
        except Exception as e:
            warnings.warn(f"python-docx failed for {path}: {e}")
            return ""

    def read_resume(path):
        ext = path.lower().split('.')[-1]
        if ext == 'txt':
            return read_txt(path)
        elif ext == 'pdf':
            return read_pdf(path)
        elif ext == 'docx':
            return read_docx(path)
        else:
            raise ValueError("Unsupported resume file type: " + ext)

    # Prefer resumes from datasets/resumejobs/data/data/<CATEGORY>/*.pdf (per attached structure)
    resume_root = os.path.join(resumejobs_base, "data", "data")
    resume_paths = []
    if os.path.exists(resume_root):
        # Collect PDFs first, then optional DOCX/TXT
        resume_paths = sorted(glob.glob(os.path.join(resume_root, "*", "*.pdf")))
        resume_paths += sorted(glob.glob(os.path.join(resume_root, "*", "*.docx")))
        resume_paths += sorted(glob.glob(os.path.join(resume_root, "*", "*.txt")))
    else:
        # Fallback to local 'resumes/' folder
        resume_folder = "resumes"
        if not os.path.exists(resume_folder):
            os.makedirs(resume_folder)
            print("Created 'resumes/' folder â€” place your resume files there and re-run the cell.")
        resume_paths = sorted(glob.glob(os.path.join(resume_folder, "*.*")))

    print(f"Found {len(resume_paths)} resumes")

    resumes = []
    for p in resume_paths:
        try:
            text = read_resume(p)
            category = os.path.basename(os.path.dirname(p)) if os.path.sep in p else ""
            resumes.append({
                "path": p,
                "text": text,
                "id": os.path.basename(p),
                "category": category
            })
        except Exception as e:
            warnings.warn(f"Failed to read {p}: {e}")

    resumes_df = pd.DataFrame(resumes)
    print("Loaded resumes:", len(resumes_df))

# Load jobs dataset: prefer datasets/resumejobs/job_descriptions.csv, else fallback to local jobs.csv
jobs_df = None
job_csv_candidates = [
    os.path.join(resumejobs_base, "job_descriptions.csv"),
    "jobs.csv",
]
job_csv = next((p for p in job_csv_candidates if os.path.exists(p)), None)

if job_csv is not None:
    # If using the large attached file, limit to the first 100,000 rows
    if os.path.basename(job_csv).lower() == 'job_descriptions.csv':
        df = pd.read_csv(job_csv, nrows=100000)
        loaded_note = " (first 100000 rows)"
    else:
        df = pd.read_csv(job_csv)
        loaded_note = ""

    # Normalize columns -> 'title', 'description', 'job_id'
    lower_map = {c.lower(): c for c in df.columns}

    desc_keys = [
        'job description','description','job_description','jd','desc','text',
        'qualifications','requirements','responsibilities','summary'
    ]
    title_keys = ['job title','title','job_title','position','role']
    id_keys = ['job id','job_id','jobid','id']

    def pick(keys):
        for k in keys:
            if k in lower_map:
                return lower_map[k]
        return None

    desc_col = pick(desc_keys)
    title_col = pick(title_keys)
    id_col = pick(id_keys)

    if desc_col is None:
        raise ValueError(f"Could not find a description-like column in {job_csv}. Include one (e.g., 'Job Description' or 'Description').")

    rename_map = {desc_col: 'description'}
    if title_col: rename_map[title_col] = 'title'
    if id_col: rename_map[id_col] = 'job_id'

    df = df.rename(columns=rename_map)

    if 'title' not in df.columns:
        df['title'] = df.index.astype(str)
    if 'job_id' not in df.columns:
        df['job_id'] = df.index.astype(str)

    jobs_df = df
    print(f"Loaded jobs from {os.path.basename(job_csv)}{loaded_note}: {len(jobs_df)}")
else:
    print("No job CSV found. Place datasets/resumejobs/job_descriptions.csv or a local jobs.csv with 'title' and 'description'.")


Loaded resumes from Resume.csv: 2484
Loaded jobs from job_descriptions.csv (first 100000 rows): 100000
Loaded jobs from job_descriptions.csv (first 100000 rows): 100000


## 4) Preprocess & Extract Entities / Skills

In [3]:

# Basic cleaning
def normalize_text(text):
    text = text.replace('\r', ' ').replace('\n', ' ').strip()
    text = re.sub(r'\s+', ' ', text)
    return text

# Ensure resumes_df has the expected columns
required_resume_cols = {'text', 'id'}
missing_cols = [c for c in required_resume_cols if c not in resumes_df.columns]
if missing_cols:
    raise ValueError(f"resumes_df missing required columns: {missing_cols}")

resumes_df['text_clean'] = resumes_df['text'].astype(str).apply(normalize_text)
if 'jobs_df' in globals() and jobs_df is not None:
    # Ensure description_clean exists for later steps
    base_desc_col = 'description' if 'description' in jobs_df.columns else jobs_df.columns[0]
    jobs_df['description_clean'] = jobs_df[base_desc_col].astype(str).apply(normalize_text)

# Regex extractors (email, phone, years)
EMAIL_RE = re.compile(r'[\w\.\-+%]+@[\w\-]+\.[\w\.-]+')
PHONE_RE = re.compile(r'\+?\d[\d\-\s()]{6,}\d')
YEXP_RE = re.compile(r'(\d+)\+?\s*(?:years|yrs|y)')

def extract_basic_entities(text: str):
    emails = EMAIL_RE.findall(text)
    phones = PHONE_RE.findall(text)
    years = [int(m) for m in YEXP_RE.findall(text)]
    return {"emails": emails, "phones": phones, "years_mentioned": years}

resumes_df['basic_entities'] = resumes_df['text_clean'].apply(extract_basic_entities)

# Skill extraction using PhraseMatcher with a skills list
# 1) Load base skills from skills.txt (if present) or fall back to a default list
base_skills = []
if os.path.exists("skills.txt"):
    with open("skills.txt","r",encoding="utf-8") as f:
        base_skills = [line.strip() for line in f if line.strip()]
else:
    # default common skills (extend this file for better matching)
    base_skills = [
        "python","pandas","numpy","sql","excel","machine learning","deep learning",
        "tensorflow","pytorch","scikit-learn","nlp","natural language processing",
        "data analysis","tableau","power bi","spark","aws","azure","docker","kubernetes",
        "git","jira","hadoop","bash","linux","java","c++","c#","scala","node.js","react",
        "flask","django","fastapi","airflow","looker","snowflake","redshift","gcp","bigquery"
    ]

# 2) Derive additional candidate skills from job descriptions (if available)
derived_skills = []
if 'jobs_df' in globals() and jobs_df is not None and len(jobs_df) > 0:
    try:
        from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP
    except Exception:
        SPACY_STOP = set()
    # Custom stopwords to filter generic HR words
    EXTRA_STOP = {
        'experience','experiences','experienced','responsible','responsibilities','skills','skill',
        'ability','abilities','work','working','knowledge','requirements','requirement','role','roles',
        'team','teams','strong','excellent','good','great','years','year','including','include','includes',
        'etc','must','plus','preferred','nice','fit','position','job','candidate','candidates','company',
        'business','clients','customer','support','using','use','used','within','across','based','make','well'
    }
    STOP = set(w.lower() for w in (SPACY_STOP or set())) | EXTRA_STOP

    # Build a corpus from title + description
    text_parts = []
    if 'title' in jobs_df.columns:
        text_parts.append(jobs_df['title'].astype(str).tolist())
    desc_col = 'description_clean' if 'description_clean' in jobs_df.columns else ('description' if 'description' in jobs_df.columns else None)
    if desc_col:
        text_parts.append(jobs_df[desc_col].astype(str).tolist())
    corpus = "\n".join([t for sub in text_parts for t in (sub or [])])

    # Tokenize with a regex to keep tech strings like c++, c#, node.js
    TOKEN_RE = re.compile(r"[A-Za-z][A-Za-z0-9+.#-]{1,}")
    raw_tokens = [t.lower() for t in TOKEN_RE.findall(corpus)]
    tokens = [t for t in raw_tokens if len(t) >= 2 and t not in STOP and not t.isdigit()]

    # Unigram and bigram frequencies
    uni_counts = Counter(tokens)
    bigrams = [f"{tokens[i]} {tokens[i+1]}" for i in range(len(tokens)-1)]
    bi_counts = Counter(bigrams)

    # Uppercase acronyms (kept lower for matcher attr=LOWER)
    ACRO_RE = re.compile(r"\b[A-Z]{2,}(?:\.[A-Z]{2,})?\b")
    acronyms = [a.lower() for a in ACRO_RE.findall(corpus)]

    # Keep frequent and skill-like terms
    top_unigrams = [w for w, c in uni_counts.most_common(200) if c >= 3]
    top_bigrams = [w for w, c in bi_counts.most_common(200) if c >= 2]

    # Simple filters to remove overly-generic bigrams
    def bigram_ok(bg: str) -> bool:
        a, b = bg.split(' ', 1)
        return (a not in STOP) and (b not in STOP) and (len(a) > 2 or len(b) > 2)

    top_bigrams = [bg for bg in top_bigrams if bigram_ok(bg)]

    derived_skills = list(dict.fromkeys(top_unigrams + top_bigrams + acronyms))

# 3) Merge and deduplicate while preserving order
skills_list = list(dict.fromkeys([*base_skills, *derived_skills]))
print(f"Skills in matcher: {len(skills_list)}")

# Build PhraseMatcher (case-insensitive)
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(s) for s in skills_list]
if patterns:
    matcher.add("SKILL", patterns)


# Optimize: use tokenizer-only doc creation and cap text length
MAX_TEXT_CHARS_FOR_SKILLS = 200_000

def extract_skills(text: str):
    text = text[:MAX_TEXT_CHARS_FOR_SKILLS]
    doc = nlp.make_doc(text)
    spans = []
    for match_id, start, end in matcher(doc):
        spans.append(doc[start:end].text)
    # frequency and normalized unique list
    freq = Counter([s.lower() for s in spans])
    return {"skills": list(dict.fromkeys([s for s in spans])), "skills_freq": dict(freq)}

resumes_df['skills'] = resumes_df['text_clean'].apply(extract_skills)


Skills in matcher: 452


## 5) Compute Embeddings & Rank Resumes by Similarity

In [4]:

# Load sentence-transformers model
EMB_MODEL_NAME = "all-MiniLM-L6-v2"
print("Loading embedding model:", EMB_MODEL_NAME)
embedder = SentenceTransformer(EMB_MODEL_NAME, device=DEVICE)

# Adaptive batch sizes
RESUME_BATCH = 128 if DEVICE == "cuda" else 16
JOB_BATCH = 64 if DEVICE == "cuda" else 8

# Compute embeddings for resumes
if len(resumes_df) > 0:
    resume_texts = resumes_df['text_clean'].astype(str).tolist()
    print(f"Encoding resumes: {len(resume_texts)} (batch={RESUME_BATCH})")
    resume_embeddings = embedder.encode(
        resume_texts,
        batch_size=RESUME_BATCH,
        show_progress_bar=True,
        convert_to_numpy=True,
    )
else:
    resume_embeddings = np.zeros((0, embedder.get_sentence_embedding_dimension()))

# Compute embeddings for job descriptions (if available)
if 'jobs_df' in globals() and jobs_df is not None and len(jobs_df) > 0:
    job_texts = jobs_df['description_clean'].astype(str).tolist()
    print(f"Encoding jobs: {len(job_texts)} (batch={JOB_BATCH})")
    job_embeddings = embedder.encode(
        job_texts,
        batch_size=JOB_BATCH,
        show_progress_bar=True,
        convert_to_numpy=True,
    )
else:
    job_embeddings = np.zeros((0, embedder.get_sentence_embedding_dimension()))

print("Resume embeddings shape:", resume_embeddings.shape)
print("Job embeddings shape:", job_embeddings.shape)

# Similarity function
def rank_resumes_for_job(job_idx: int, top_k: int = 5):
    if job_embeddings.shape[0] == 0:
        raise ValueError("No job embeddings available. Provide a jobs dataset.")
    job_emb = job_embeddings[job_idx].reshape(1, -1)
    sims = cosine_similarity(job_emb, resume_embeddings).squeeze()  # shape (n_resumes,)
    idxs = np.argsort(sims)[::-1][:top_k]

    # Precompute job words once
    job_words = set(
        w.lower() for w in re.findall(r'\w+', jobs_df.iloc[job_idx]['description_clean']) if len(w) > 2
    )

    rows = []
    for i in idxs:
        score = float(sims[i])
        row = resumes_df.iloc[i].to_dict()
        row['match_score'] = round(float(score)*100, 2)
        # Get matched skills intersection
        rskills = set([s.lower() for s in row.get('skills', {}).get('skills', [])])
        matched_skills = list(rskills & job_words)
        # Fallback: token-wise overlap between resume skills and job words
        if not matched_skills:
            def tokens_in_job(s: str) -> bool:
                toks = [t.lower() for t in re.findall(r'\w+', s) if len(t) > 2]
                return any(t in job_words for t in toks)
            matched_skills = [s for s in rskills if tokens_in_job(s)]
        row['matched_skills'] = matched_skills
        # years mentioned
        row['years_mentioned'] = row.get('basic_entities', {}).get('years_mentioned', [])
        rows.append(row)
    return pd.DataFrame(rows)

# Example: rank resumes for the first job
if 'jobs_df' in globals() and jobs_df is not None and len(jobs_df)>0 and len(resume_embeddings)>0:
    top_matches = rank_resumes_for_job(0, top_k=5)
    display(top_matches[['id','match_score','matched_skills','years_mentioned']])
else:
    print("Provide Resume.csv and job_descriptions.csv (or jobs.csv) to run the ranking example.")


Loading embedding model: all-MiniLM-L6-v2
Encoding resumes: 2484 (batch=16)
Encoding resumes: 2484 (batch=16)


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 156/156 [01:11<00:00,  2.19it/s]



Encoding jobs: 100000 (batch=8)


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 12500/12500 [13:29<00:00, 15.44it/s] 



Resume embeddings shape: (2484, 384)
Job embeddings shape: (100000, 384)


Unnamed: 0,id,match_score,matched_skills,years_mentioned
0,75329822,69.76,"[analyze, social, create, content, media, driv...",[]
1,15479281,65.71,"[analyze, social, create, content, media, metr...",[]
2,24677466,65.21,"[social, content, media, organizations, brand,...",[17]
3,11677012,64.92,"[analyze, social, media]",[]
4,94492380,63.13,"[social, create, content, media, drive, brand]","[20, 8, 2]"


## 6) Train a classifier on labeled pairs

In [None]:

labeled_path = "labeled_pairs.csv"  # expected columns: job_id, resume_id, label
if os.path.exists(labeled_path):
    pairs = pd.read_csv(labeled_path)
    
    # Map ids to indices (robust to presence/absence of 'job_id' column)
    job_ids_series = jobs_df.get('job_id', jobs_df.index.astype(str))
    job_id_to_idx = {str(jid): i for i, jid in enumerate(job_ids_series.astype(str).tolist())}
    
    # Build multiple resume-id lookup strategies to support both Resume.csv IDs and file-based IDs
    res_ids = resumes_df['id'].astype(str).tolist()
    resume_by_id = {rid: i for i, rid in enumerate(res_ids)}
    resume_by_id_basename = {os.path.basename(rid): i for i, rid in enumerate(res_ids)}
    resume_by_path_basename = {}
    if 'path' in resumes_df.columns:
        paths = resumes_df['path'].astype(str).tolist()
        resume_by_path_basename = {os.path.basename(p): i for i, p in enumerate(paths)}
    
    def map_resume_index(rid_raw) -> Optional[int]:
        rid_raw = str(rid_raw).strip()
        # Try exact ID, basename of ID (if it looked like a path), and basename of stored path
        for candidate in (
            resume_by_id.get(rid_raw),
            resume_by_id.get(os.path.basename(rid_raw)),
            resume_by_id_basename.get(os.path.basename(rid_raw)),
            resume_by_path_basename.get(os.path.basename(rid_raw)),
):
            if candidate is not None:
                return candidate
        return None
    
    X = []
    y = []
    used = 0
    for _, row in pairs.iterrows():
        jid = str(row['job_id'])
        rid_raw = row['resume_id']
        jidx = job_id_to_idx.get(jid)
        ridx = map_resume_index(rid_raw)
        if jidx is None or ridx is None:
            continue
        je = job_embeddings[jidx]
        re_ = resume_embeddings[ridx]
        # Include cosine similarity as an explicit scalar feature
        cos = float(cosine_similarity(je.reshape(1, -1), re_.reshape(1, -1)).squeeze())
        # Feature vector: concat embeddings, elementwise product, abs diff, and cosine similarity
        feat = np.concatenate([je, re_, je * re_, np.abs(je - re_), [cos]])
        X.append(feat)
        y.append(int(row['label']))
        used += 1
    
    if not X:
        print("No matching (job_id, resume_id) pairs found between labeled_pairs.csv and current datasets â€” skipping classifier training.")
    else:
        X = np.vstack(X)
        y = np.array(y)
        scaler = StandardScaler()
        Xs = scaler.fit_transform(X)
        clf = LogisticRegression(max_iter=500)
        clf.fit(Xs, y)
        print(f"Trained classifier on {len(y)} labeled pairs. You can now predict match probabilities for new pairs.")
else:
    print("No labeled_pairs.csv found â€” skipping classifier training.")


## 7) Present Top-ranked Resumes with Justification

In [8]:

def justify_match(job_idx: int, resume_row: pd.Series, top_sentences: int = 3):
    # Find sentences in resume that have highest similarity to job description sentences
    job_text = jobs_df.iloc[job_idx].get('description_clean', jobs_df.iloc[job_idx].get('description',''))
    # split into sentences
    job_sents = [s for s in re.split(r'[\n\.!?]+', job_text) if s.strip()][:200]  # cap for speed
    res_sents = [s for s in re.split(r'[\n\.!?]+', resume_row['text_clean']) if s.strip()][:200]
    if not job_sents or not res_sents:
        return {"top_sentences": [], "matched_skills": resume_row.get('matched_skills', [])}
    # embed sentences (small number) and compute similarity
    emb_job_sents = embedder.encode(job_sents, convert_to_numpy=True)
    emb_res_sents = embedder.encode(res_sents, convert_to_numpy=True)
    sim = cosine_similarity(emb_res_sents, emb_job_sents)  # res x job
    # For each resume sentence, pick max similarity to any job sentence
    res_scores = sim.max(axis=1)
    top_idx = np.argsort(res_scores)[-top_sentences:][::-1]
    top_sents = [res_sents[i].strip() for i in top_idx]
    return {"top_sentences": top_sents, "matched_skills": resume_row.get('matched_skills', [])}

# Display nicely for a job
def present_top_for_job(job_idx:int, top_k:int=5):
    job = jobs_df.iloc[job_idx]
    print("JOB:", job.get('title','(no title)'))
    desc_preview = job.get('description_clean', job.get('description',''))
    print("DESCRIPTION (first 250 chars):", (desc_preview[:250] if isinstance(desc_preview,str) else str(desc_preview)) , "...\n")
    matches = rank_resumes_for_job(job_idx, top_k=top_k)
    for i, r in matches.iterrows():
        print(f"Rank {i+1}: Resume {r['id']} â€” Match Score: {r['match_score']}%\n")
        justification = justify_match(job_idx, r, top_sentences=3)
        print("Matched skills:", justification['matched_skills'])
        print("Top matching sentences from resume:")
        for s in justification['top_sentences']:
            print("-", s)
        print("\n---\n")

# Example presentation (if jobs present)
if jobs_df is not None and len(jobs_df)>0:
    present_top_for_job(0, top_k=3)
else:
    print("No jobs available to present matches.")


JOB: Digital Marketing Specialist
DESCRIPTION (first 250 chars): Social Media Managers oversee an organizations social media presence. They create and schedule content, engage with followers, and analyze social media metrics to drive brand awareness and engagement. ...

Rank 1: Resume 75329822 â€” Match Score: 69.76%

Matched skills: ['analyze', 'social', 'create', 'content', 'media', 'drive', 'brand', 'engage']
Top matching sentences from resume:
- Social Media Management
- Public Relations and Social Media Manager 11/2012 to 06/2014 Company Name Responsible for the execution & management of strategies supporting content development, influencer marketing, events, strategic partnerships, cause marketing and social media campaigns
- PUBLIC RELATIONS/SOCIAL MEDIA MANAGEMENT Summary Public Relations Manager with strong communications, event planning, media relations and social media experience within consumer brands

---

Rank 2: Resume 15479281 â€” Match Score: 65.71%

Matched skills: ['

## 8) Streamlit Front-end to test models

In [None]:
# Save this as app.py and run: streamlit run app.py
streamlit_app = r"""
import streamlit as st
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import os, re, io
import numpy as np

st.set_page_config(page_title="Resume Screening App", page_icon="ðŸ“„", layout="wide")
st.title('Resume Screening')

# Model loading (adjust path if running in separate env)
@st.cache_resource
def load_model():
    return SentenceTransformer('%s')

model = load_model()

st.sidebar.header('Job selector')
# Prefer datasets/resumejobs/job_descriptions.csv, else fallback to jobs.csv
jobs = None
job_text = ''

def normalize_text(text: str):
    if not isinstance(text, str):
        text = str(text) if text is not None else ''
    return re.sub(r'\s+', ' ', text.replace('\r',' ').replace('\n',' ')).strip()

def load_jobs_df():
    # Try dataset path first
    ds_path = os.path.join('datasets','resumejobs','job_descriptions.csv')
    if os.path.exists(ds_path):
        df = pd.read_csv(ds_path, nrows=5000)  # cap for snappy UI
    elif os.path.exists('jobs.csv'):
        df = pd.read_csv('jobs.csv')
    else:
        return None
    lower_map = {c.lower(): c for c in df.columns}
    def pick(keys):
        for k in keys:
            if k in lower_map:
                return lower_map[k]
        return None
    desc_col = pick(['job description','description','job_description','jd','desc','text','summary','requirements','qualifications']) or df.columns[0]
    title_col = pick(['job title','title','job_title','position','role'])
    id_col = pick(['job id','job_id','jobid','id'])
    rename_map = {desc_col: 'description'}
    if title_col: rename_map[title_col] = 'title'
    if id_col: rename_map[id_col] = 'job_id'
    df = df.rename(columns=rename_map)
    if 'title' not in df.columns:
        df['title'] = df.index.astype(str)
    if 'job_id' not in df.columns:
        df['job_id'] = df.index.astype(str)
    df['description_clean'] = df['description'].astype(str).apply(normalize_text)
    return df

# Lightweight skill matching
@st.cache_resource
def load_skills():
    skills = []
    if os.path.exists('skills.txt'):
        with open('skills.txt','r',encoding='utf-8',errors='ignore') as f:
            skills = [l.strip() for l in f if l.strip()]
    return [s.lower() for s in skills]

skills_list = load_skills()

jobs = load_jobs_df()
if jobs is not None and len(jobs) > 0:
    job_idx = st.sidebar.selectbox(
        'Choose job',
        jobs.index.tolist(),
        format_func=lambda i: f"{jobs.loc[i,'title']} (id={jobs.loc[i,'job_id']})" if 'job_id' in jobs.columns else f"{jobs.loc[i,'title']} (row {i})"
    )
    job_text = jobs.loc[job_idx, 'description_clean']
else:
    st.sidebar.info('Place datasets/resumejobs/job_descriptions.csv or jobs.csv in the app folder.')
    job_text = st.text_area('Or paste a job description here:')

# Sidebar options
st.sidebar.header('Options')
top_k = st.sidebar.slider('Top-K resumes to display', 1, 50, 10)
show_just = st.sidebar.checkbox('Show justifications for top matches', value=False)
just_n = st.sidebar.slider('Sentences per resume (when showing justification)', 1, 5, 3)

# Readers for uploaded files
def read_txt_file(uploaded):
    data = uploaded.read()
    try:
        return data.decode('utf-8', errors='ignore') if isinstance(data, (bytes, bytearray)) else str(data)
    finally:
        try:
            uploaded.seek(0)
        except Exception:
            pass

def read_pdf_file(uploaded):
    text = ''
    try:
        import pdfplumber
        with pdfplumber.open(io.BytesIO(uploaded.read())) as pdf:
            for p in pdf.pages:
                text += p.extract_text() or ''
    except Exception:
        try:
            uploaded.seek(0)
        except Exception:
            pass
    finally:
        try:
            uploaded.seek(0)
        except Exception:
            pass
    return text

def read_docx_file(uploaded):
    text = ''
    try:
        import docx
        doc = docx.Document(io.BytesIO(uploaded.read()))
        text = '\n'.join([p.text for p in doc.paragraphs])
    except Exception:
        pass
    finally:
        try:
            uploaded.seek(0)
        except Exception:
            pass
    return text

def extract_skills_simple(text_lower: str):
    if not skills_list:
        return []
    found = []
    for s in skills_list:
        if s and s in text_lower:
            found.append(s)
    # dedupe preserve order
    seen = set()
    out = []
    for s in found:
        if s not in seen:
            out.append(s)
            seen.add(s)
    return out

# Inference helpers
def get_score(job_text: str, resume_text: str) -> float:
    if not job_text or not resume_text:
        return 0.0
    job_emb = model.encode([job_text], convert_to_numpy=True)
    resume_emb = model.encode([resume_text], convert_to_numpy=True)
    return float(cosine_similarity(job_emb, resume_emb).squeeze())

def justify(job_text: str, resume_text: str, k: int = 3):
    job_sents = [s.strip() for s in re.split(r'[\n\.!?]+', job_text) if s.strip()][:200]
    res_sents = [s.strip() for s in re.split(r'[\n\.!?]+', resume_text) if s.strip()][:200]
    if not job_sents or not res_sents:
        return []
    ej = model.encode(job_sents, convert_to_numpy=True)
    er = model.encode(res_sents, convert_to_numpy=True)
    sim = cosine_similarity(er, ej).max(axis=1)
    idx = np.argsort(sim)[-k:][::-1]
    return [res_sents[i] for i in idx]

uploaded_files = st.file_uploader('Upload resumes (.pdf, .docx, .txt) â€” multiple allowed', type=['pdf','docx','txt'], accept_multiple_files=True)

if not job_text:
    st.warning('Provide or select a job description first.')
else:
    if uploaded_files:
        rows = []
        for uf in uploaded_files:
            name = uf.name
            ext = name.split('.')[-1].lower()
            if ext == 'txt':
                resume_text = read_txt_file(uf)
            elif ext == 'pdf':
                resume_text = read_pdf_file(uf)
            elif ext == 'docx':
                resume_text = read_docx_file(uf)
            else:
                resume_text = ''
            text_clean = normalize_text(resume_text)
            score = get_score(job_text, text_clean)
            matched = extract_skills_simple(text_clean.lower()) if text_clean else []
            rows.append({
                'file': name,
                'chars': len(text_clean),
                'matching score': round(score*100, 2),
                'text': text_clean,
            })
        res = pd.DataFrame(rows).sort_values('matching score', ascending=False).reset_index(drop=True)
        st.subheader('Top matches')
        st.dataframe(res[['file','chars','matching score']].head(top_k), use_container_width=True)
        st.download_button('Download results CSV', res.to_csv(index=False).encode('utf-8'), file_name='resume_screening_results.csv', mime='text/csv')

        if show_just:
            st.markdown('---')
            st.subheader('Justifications')
            top_subset = res.head(top_k)
            for i, row in top_subset.iterrows():
                with st.expander(f"{row['file']} â€” Matching Score: {row['matching score']}%%"):
                    sents = justify(job_text, row['text'], k=just_n)
                    if sents:
                        for s in sents:
                            st.write('- ', s)
                    else:
                        st.write('No sentences found.')
    else:
        st.info('Upload one or more resumes to score against the selected job.')

""" % EMB_MODEL_NAME

with open("resume_screening.py","w",encoding="utf-8") as f:
    f.write(streamlit_app)

print('Wrote Streamlit app to resume_screening.py. Run it with: streamlit run resume_screening.py')

Wrote Streamlit app to resume_screening.py. Run it with: streamlit run resume_screening.py


## 9) Summary

This notebook built an end-to-end resume screening workflow:

- Parsed resumes from PDF/DOCX/TXT/CSV.
- Normalized and cleaned text, extracted basic entities (emails, phones, years), and matched a skills list via spaCy PhraseMatcher.
- Loaded jobs and normalized title/description fields.
- Computed sentence embeddings with an embeddings model, then ranked resumes by cosine similarity to job descriptions.
- Displayed top matches, with matched skills and example sentences as lightweight justification.
- Optional: trained a simple classifier if labeled pairs exist.
- Generated a Streamlit for interactive screening.

### Suggested next steps

- Add OCR for scanned PDFs (e.g., pytesseract) and handle images inside PDFs.
- Weight skills by recency or section (e.g., experience > summary > hobbies).
- Multi-signal scoring: combine semantic similarity, skill overlap, years, and keyword boosts.
- Cache embeddings to disk; store in a vector index (FAISS/Annoy) for fast retrieval.
- Package with Docker; externalize models and data paths via env vars.
  

