# AI-Assisted Recruitment System

This notebook implements an AI-powered recruitment system that matches job postings with resumes using natural language processing and machine learning techniques.

## Features:
- Job posting analysis and cleaning
- Resume parsing and skill extraction
- Intelligent job-resume matching
- Scoring and ranking system
- Interactive matching interface


In [70]:
"""
Dependencies and configuration for the AI recruitment system.
"""
import pandas as pd
import numpy as np
import re
import warnings
from collections import Counter
from typing import List, Dict, Tuple, Optional, Union

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder

try:
    from transformers import AutoTokenizer, AutoModel
    from sentence_transformers import SentenceTransformer
    import torch
    BERT_AVAILABLE = True
except ImportError:
    BERT_AVAILABLE = False

try:
    from flask import Flask, request, jsonify
    from flask_cors import CORS
    FLASK_AVAILABLE = True
except ImportError:
    FLASK_AVAILABLE = False

import json
import pickle
from datetime import datetime
import os
from pathlib import Path

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('wordnet', quiet=True)
except:
    pass


In [71]:
try:
    DATABASE_AVAILABLE
except NameError:
    DATABASE_AVAILABLE = False


In [72]:
"""
Load job postings, resumes, and cleaned datasets.
"""
df_jobs = pd.read_csv('Dataset/data job posts.csv')
df_resumes = pd.read_csv('Dataset/Resume.csv')
df_cleaned = pd.read_csv('Dataset/updated_data_final_cleaned.csv')

print(f"Datasets loaded - Jobs: {df_jobs.shape}, Resumes: {df_resumes.shape}, Cleaned: {df_cleaned.shape}")


Datasets loaded - Jobs: (19001, 24), Resumes: (2484, 4), Cleaned: (32481, 3)


In [73]:
"""
Explore job postings dataset structure and sample records.
"""
if df_jobs is not None:
    print(f"Job Posts: {df_jobs.shape}")
    print(f"Missing values: {df_jobs.isnull().sum().sum()}")
    print("\nSample records:")
    print(df_jobs[['Title', 'Company', 'Location']].head(3))


Job Posts: (19001, 24)
Missing values: 137017

Sample records:
                                                      Title  \
0                                   Chief Financial Officer   
1  Full-time Community Connections Intern (paid internship)   
2                                       Country Coordinator   

                                           Company  \
0             AMERIA Investment Consulting Company   
1  International Research & Exchanges Board (IREX)   
2        Caucasus Environmental NGO Network (CENN)   

                                                                                              Location  
0                                                                                     Yerevan, Armenia  
1  IREX Armenia Main Office; Yerevan, Armenia \r\nDESCRIPTION:   IREX currently seeks to fill the p...  
2                                                                                     Yerevan, Armenia  


In [74]:
"""
Explore resume dataset structure, categories, and sample records.
"""
print(f"Resume Dataset: {df_resumes.shape}")
print(f"\nMissing values: {df_resumes.isnull().sum().sum()}")
print(f"\nResume categories ({df_resumes['Category'].nunique()}):")
print(df_resumes['Category'].value_counts().head(10))
df_resumes.head(3)


Resume Dataset: (2484, 4)

Missing values: 0

Resume categories (24):
Category
INFORMATION-TECHNOLOGY    120
BUSINESS-DEVELOPMENT      120
FINANCE                   118
ADVOCATE                  118
ACCOUNTANT                118
ENGINEERING               118
CHEF                      118
AVIATION                  117
FITNESS                   117
SALES                     116
Name: count, dtype: int64


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\n\nHR ADMINISTRATOR Summary Dedicated Cu...,"<div class=""fontsize fontface vmargins hmargins linespacing pagesize"" id=""document""> <div class=...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS Summary Versatile media professional with ba...","<div class=""fontsize fontface vmargins hmargins linespacing pagesize"" id=""document""> <div class=...",HR
2,33176873,"HR DIRECTOR Summary Over 20 years experience in recruiting, 15 plus years ...","<div class=""fontsize fontface vmargins hmargins linespacing pagesize"" id=""document""> <div class=...",HR


In [75]:
"""
Initialize spaCy NLP model for text processing and skill extraction.
"""
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("spaCy model not found. Install with: python -m spacy download en_core_web_sm")
    nlp = None


In [76]:
"""
Initialize BERT and RoBERTa models for semantic text matching and similarity computation.
"""
if BERT_AVAILABLE:
    try:
        bert_model = SentenceTransformer('all-MiniLM-L6-v2')
        bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        bert_model_direct = AutoModel.from_pretrained('bert-base-uncased')
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        bert_model_direct.to(device)
        print(f"BERT models initialized on {device}")
    except Exception as e:
        print(f"BERT initialization failed: {e}")
        bert_model = bert_tokenizer = bert_model_direct = None
else:
    bert_model = bert_tokenizer = bert_model_direct = None

if BERT_AVAILABLE:
    try:
        roberta_model = SentenceTransformer('all-distilroberta-v1')
        roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
        roberta_model_direct = AutoModel.from_pretrained('roberta-base')
        roberta_model_direct.to(device)
        print(f"RoBERTa models initialized on {device}")
    except Exception as e:
        print(f"RoBERTa initialization failed: {e}")
        roberta_model = roberta_tokenizer = roberta_model_direct = None
else:
    roberta_model = roberta_tokenizer = roberta_model_direct = None

def get_bert_embeddings(texts, model_type='sentence_transformer'):
    """
    Generate BERT embeddings for input texts.
    
    Args:
        texts: List of text strings or single text string
        model_type: Type of model to use ('sentence_transformer')
    
    Returns:
        numpy array of embeddings
    """
    if not BERT_AVAILABLE or not texts:
        return np.array([])
    
    try:
        if model_type == 'sentence_transformer' and bert_model:
            return bert_model.encode(texts, convert_to_tensor=False)
        return np.array([])
    except Exception as e:
        print(f"BERT embedding error: {e}")
        return np.array([])

def calculate_bert_similarity(text1, text2):
    """
    Compute semantic similarity between two text strings using BERT embeddings.
    
    Args:
        text1: First text string
        text2: Second text string
    
    Returns:
        Similarity score between 0 and 1
    """
    if not BERT_AVAILABLE or not bert_model:
        return 0.0
    
    try:
        embeddings = bert_model.encode([text1, text2])
        return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    except:
        return 0.0

def calculate_roberta_similarity(text1, text2):
    """
    Compute semantic similarity between two text strings using RoBERTa embeddings.
    
    Args:
        text1: First text string
        text2: Second text string
    
    Returns:
        Similarity score between 0 and 1
    """
    if not BERT_AVAILABLE or not roberta_model:
        return 0.0
    
    try:
        embeddings = roberta_model.encode([text1, text2])
        return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    except:
        return 0.0

def extract_skills_bert(text, threshold=0.7):
    """
    Extract relevant skills from text using BERT semantic matching.
    
    Args:
        text: Input text to analyze
        threshold: Minimum similarity threshold for skill matching
    
    Returns:
        List of extracted skill keywords
    """
    if not BERT_AVAILABLE or not text:
        return []
    
    skill_keywords = [
        'python', 'java', 'javascript', 'react', 'angular', 'vue', 'node.js',
        'machine learning', 'deep learning', 'artificial intelligence', 'ai',
        'data science', 'data analysis', 'statistics', 'sql', 'database',
        'aws', 'azure', 'docker', 'kubernetes', 'git', 'github',
        'project management', 'agile', 'scrum', 'leadership', 'communication',
        'marketing', 'sales', 'finance', 'accounting', 'human resources',
        'design', 'ui', 'ux', 'photoshop', 'illustrator', 'figma',
        'mobile development', 'ios', 'android', 'swift', 'kotlin',
        'web development', 'frontend', 'backend', 'full stack', 'devops'
    ]
    
    try:
        extracted_skills = []
        text_lower = text.lower()
        
        for skill in skill_keywords:
            similarity = calculate_bert_similarity(text_lower, skill)
            if similarity >= threshold:
                extracted_skills.append(skill)
        
        return list(set(extracted_skills))
    except Exception as e:
        print(f"BERT skill extraction error: {e}")
        return []


BERT models initialized on cpu


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RoBERTa models initialized on cpu


In [77]:
"""
Configure caching system for preprocessed data to improve performance.
"""
CACHE_DIR = Path("cache")
CACHE_DIR.mkdir(exist_ok=True)

CACHE_JOBS = CACHE_DIR / "df_jobs_clean.pkl"
CACHE_RESUMES = CACHE_DIR / "df_resumes_clean.pkl"


In [78]:
"""
BERT and RoBERTa-based matching functions for semantic job-resume matching.
"""
def find_best_matches_bert(job_index, df_jobs, df_resumes, top_n=5):
    """
    Find top matching resumes for a job using BERT semantic embeddings.
    
    Args:
        job_index: Index of the job in df_jobs
        df_jobs: DataFrame of job postings with 'CleanText' column
        df_resumes: DataFrame of resumes with 'CleanText' column
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked matches and similarity scores
    """
    if not BERT_AVAILABLE or bert_model is None:
        return pd.DataFrame()
    
    try:
        job_text = df_jobs.iloc[job_index]['CleanText']
        resume_texts = df_resumes['CleanText'].tolist()
        
        job_embedding = bert_model.encode([job_text])
        resume_embeddings = bert_model.encode(resume_texts)
        
        similarities = cosine_similarity(job_embedding, resume_embeddings).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            results.append({
                'Rank': i + 1,
                'Resume_ID': df_resumes.iloc[idx]['ID'],
                'Category': df_resumes.iloc[idx]['Category'],
                'BERT_Similarity_Score': similarities[idx],
                'Resume_Text': df_resumes.iloc[idx]['Resume_str'][:200] + '...'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"BERT matching error: {e}")
        return pd.DataFrame()

def enhanced_matching_bert(job_index, df_jobs, df_resumes, top_n=5, skill_weight=0.3, bert_weight=0.7):
    """
    Enhanced matching combining BERT semantic similarity with skill overlap.
    
    Args:
        job_index: Index of the job in df_jobs
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        top_n: Number of top matches to return
        skill_weight: Weight for skill overlap score (default 0.3)
        bert_weight: Weight for BERT similarity score (default 0.7)
    
    Returns:
        DataFrame with ranked matches and combined scores
    """
    if not BERT_AVAILABLE or bert_model is None:
        return pd.DataFrame()
    
    try:
        job = df_jobs.iloc[job_index]
        job_text = job['CleanText']
        job_skills = set(job['Skills']) if job['Skills'] else set()
        
        job_embedding = bert_model.encode([job_text])
        resume_texts = df_resumes['CleanText'].tolist()
        resume_embeddings = bert_model.encode(resume_texts)
        
        bert_similarities = cosine_similarity(job_embedding, resume_embeddings).flatten()
        
        skill_scores = []
        for idx, resume in df_resumes.iterrows():
            resume_skills = set(resume['Skills']) if resume['Skills'] else set()
            
            if job_skills and resume_skills:
                overlap = len(job_skills.intersection(resume_skills))
                skill_score = overlap / len(job_skills) if job_skills else 0
            else:
                skill_score = 0
            
            skill_scores.append(skill_score)
        
        combined_scores = bert_weight * bert_similarities + skill_weight * np.array(skill_scores)
        top_indices = combined_scores.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            resume = df_resumes.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Resume_ID': resume['ID'],
                'Category': resume['Category'],
                'BERT_Similarity_Score': bert_similarities[idx],
                'Skill_Overlap_Score': skill_scores[idx],
                'Combined_Score': combined_scores[idx],
                'Resume_Text': resume['Resume_str'][:200] + '...'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"Enhanced BERT matching error: {e}")
        return pd.DataFrame()

def search_jobs_by_keywords_bert(keywords, df_jobs, top_n=5):
    """
    Search for jobs using BERT semantic matching on keyword queries.
    
    Args:
        keywords: Search query string
        df_jobs: DataFrame of job postings
        top_n: Number of top results to return
    
    Returns:
        DataFrame with ranked job matches
    """
    if not BERT_AVAILABLE or bert_model is None:
        return pd.DataFrame()
    
    try:
        clean_keywords = clean_text(keywords)
        keyword_embedding = bert_model.encode([clean_keywords])
        job_texts = df_jobs['CleanText'].tolist()
        job_embeddings = bert_model.encode(job_texts)
        
        similarities = cosine_similarity(keyword_embedding, job_embeddings).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            job = df_jobs.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Title': job['Title'],
                'Company': job['Company'],
                'Location': job['Location'],
                'BERT_Similarity_Score': similarities[idx],
                'Description': job['JobDescription'][:200] + '...' if pd.notna(job['JobDescription']) else 'N/A'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"BERT job search error: {e}")
        return pd.DataFrame()

def find_best_matches_roberta(job_index, df_jobs, df_resumes, top_n=5):
    """
    Find top matching resumes for a job using RoBERTa semantic embeddings.
    
    Args:
        job_index: Index of the job in df_jobs
        df_jobs: DataFrame of job postings with 'CleanText' column
        df_resumes: DataFrame of resumes with 'CleanText' column
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked matches and similarity scores
    """
    if not BERT_AVAILABLE or roberta_model is None:
        return pd.DataFrame()
    
    try:
        job_text = df_jobs.iloc[job_index]['CleanText']
        resume_texts = df_resumes['CleanText'].tolist()
        
        job_embedding = roberta_model.encode([job_text])
        resume_embeddings = roberta_model.encode(resume_texts)
        
        similarities = cosine_similarity(job_embedding, resume_embeddings).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            results.append({
                'Rank': i + 1,
                'Resume_ID': df_resumes.iloc[idx]['ID'],
                'Category': df_resumes.iloc[idx]['Category'],
                'RoBERTa_Similarity_Score': similarities[idx],
                'Resume_Text': df_resumes.iloc[idx]['Resume_str'][:200] + '...'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"RoBERTa matching error: {e}")
        return pd.DataFrame()

def enhanced_matching_roberta(job_index, df_jobs, df_resumes, top_n=5, skill_weight=0.3, roberta_weight=0.7):
    """
    Enhanced matching combining RoBERTa semantic similarity with skill overlap.
    
    Args:
        job_index: Index of the job in df_jobs
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        top_n: Number of top matches to return
        skill_weight: Weight for skill overlap score (default 0.3)
        roberta_weight: Weight for RoBERTa similarity score (default 0.7)
    
    Returns:
        DataFrame with ranked matches and combined scores
    """
    if not BERT_AVAILABLE or roberta_model is None:
        return pd.DataFrame()
    
    try:
        job = df_jobs.iloc[job_index]
        job_text = job['CleanText']
        job_skills = set(job['Skills']) if job['Skills'] else set()
        
        job_embedding = roberta_model.encode([job_text])
        resume_texts = df_resumes['CleanText'].tolist()
        resume_embeddings = roberta_model.encode(resume_texts)
        
        roberta_similarities = cosine_similarity(job_embedding, resume_embeddings).flatten()
        
        skill_scores = []
        for idx, resume in df_resumes.iterrows():
            resume_skills = set(resume['Skills']) if resume['Skills'] else set()
            
            if job_skills and resume_skills:
                overlap = len(job_skills.intersection(resume_skills))
                skill_score = overlap / len(job_skills) if job_skills else 0
            else:
                skill_score = 0
            
            skill_scores.append(skill_score)
        
        combined_scores = roberta_weight * roberta_similarities + skill_weight * np.array(skill_scores)
        top_indices = combined_scores.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            resume = df_resumes.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Resume_ID': resume['ID'],
                'Category': resume['Category'],
                'RoBERTa_Similarity_Score': roberta_similarities[idx],
                'Skill_Overlap_Score': skill_scores[idx],
                'Combined_Score': combined_scores[idx],
                'Resume_Text': resume['Resume_str'][:200] + '...'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"Enhanced RoBERTa matching error: {e}")
        return pd.DataFrame()


In [79]:
"""
Text preprocessing and TF-IDF matching functions.
"""
def clean_text(text):
    """
    Clean and normalize text by removing HTML, special characters, and normalizing whitespace.
    
    Args:
        text: Raw text string
    
    Returns:
        Cleaned lowercase text string
    """
    if pd.isna(text):
        return ""
    
    text = str(text).lower()
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r'[^a-z\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def extract_skills(text, nlp_model):
    """
    Extract skill keywords from text using spaCy NLP model.
    
    Args:
        text: Input text to analyze
        nlp_model: spaCy language model
    
    Returns:
        List of unique skill keywords
    """
    if not nlp_model or not text:
        return []
    
    doc = nlp_model(text)
    skills = []
    
    for token in doc:
        if (token.pos_ in ['NOUN', 'PROPN'] and 
            not token.is_stop and 
            len(token.text) > 2 and
            token.text.isalpha()):
            skills.append(token.lemma_.lower())
    
    return list(set(skills))

def lemmatize_text(text, nlp_model):
    """
    Lemmatize text and remove stopwords for improved matching.
    
    Args:
        text: Input text string
        nlp_model: spaCy language model
    
    Returns:
        Lemmatized text string
    """
    if not nlp_model or not text:
        return ""
    
    doc = nlp_model(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])

def find_best_matches(job_index, resume_tfidf, job_tfidf, df_resumes, top_n=5):
    """
    Find top matching resumes for a job using TF-IDF cosine similarity.
    
    Args:
        job_index: Index of the job in the TF-IDF matrix
        resume_tfidf: TF-IDF matrix for resumes
        job_tfidf: TF-IDF matrix for jobs
        df_resumes: DataFrame of resumes
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked matches and similarity scores
    """
    try:
        job_vector = job_tfidf[job_index]
        similarities = cosine_similarity(job_vector, resume_tfidf).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            results.append({
                'Rank': i + 1,
                'Resume_ID': df_resumes.iloc[idx]['ID'],
                'Category': df_resumes.iloc[idx]['Category'],
                'Similarity_Score': similarities[idx],
                'Resume_Text': df_resumes.iloc[idx]['Resume_str'][:200] + '...'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"TF-IDF matching error: {e}")
        return pd.DataFrame()


In [80]:
"""
Preprocess job postings and resumes with caching for performance.
Combines job fields, cleans text, extracts skills, and lemmatizes for matching.
"""
if CACHE_JOBS.exists() and CACHE_RESUMES.exists():
    try:
        df_jobs_clean = pd.read_pickle(CACHE_JOBS)
        df_resumes_clean = pd.read_pickle(CACHE_RESUMES)
        print(f"Loaded from cache: {len(df_jobs_clean)} jobs, {len(df_resumes_clean)} resumes")
        cache_available = True
    except Exception as e:
        print(f"Cache load error: {e}. Reprocessing data...")
        cache_available = False
    else:
        cache_available = True
else:
    cache_available = False

if not cache_available and df_jobs is not None and df_resumes is not None:
    df_jobs_subset = df_jobs.head(1000)
    df_resumes_subset = df_resumes.head(5000)
    
    print(f"Processing: {len(df_jobs_subset)} jobs, {len(df_resumes_subset)} resumes")
    
    job_columns = ['Title', 'Company', 'Location', 'JobDescription', 'JobRequirment', 'RequiredQual']
    df_jobs_clean = df_jobs_subset[job_columns].copy()
    df_jobs_clean = df_jobs_clean.dropna(subset=['Title', 'JobDescription'])
    df_jobs_clean = df_jobs_clean.reset_index(drop=True)
    
    df_jobs_clean['CombinedText'] = (
        df_jobs_clean['Title'].fillna('') + ' ' +
        df_jobs_clean['JobDescription'].fillna('') + ' ' +
        df_jobs_clean['JobRequirment'].fillna('') + ' ' +
        df_jobs_clean['RequiredQual'].fillna('')
    )
    
    df_jobs_clean['CleanText'] = df_jobs_clean['CombinedText'].apply(clean_text)
    
    if nlp:
        df_jobs_clean['Skills'] = df_jobs_clean['CleanText'].apply(lambda x: extract_skills(x, nlp))
        df_jobs_clean['LemmatizedText'] = df_jobs_clean['CleanText'].apply(lambda x: lemmatize_text(x, nlp))
    else:
        df_jobs_clean['Skills'] = [[] for _ in range(len(df_jobs_clean))]
        df_jobs_clean['LemmatizedText'] = df_jobs_clean['CleanText']
    
    df_resumes_clean = df_resumes_subset.copy()
    df_resumes_clean['CleanText'] = df_resumes_clean['Resume_str'].apply(clean_text)
    
    if nlp:
        df_resumes_clean['Skills'] = df_resumes_clean['CleanText'].apply(lambda x: extract_skills(x, nlp))
        df_resumes_clean['LemmatizedText'] = df_resumes_clean['CleanText'].apply(lambda x: lemmatize_text(x, nlp))
    else:
        df_resumes_clean['Skills'] = [[] for _ in range(len(df_resumes_clean))]
        df_resumes_clean['LemmatizedText'] = df_resumes_clean['CleanText']
    
    df_jobs_clean.to_pickle(CACHE_JOBS)
    df_resumes_clean.to_pickle(CACHE_RESUMES)
    print(f"Saved to cache: {len(df_jobs_clean)} jobs, {len(df_resumes_clean)} resumes")
elif df_jobs is None or df_resumes is None:
    print("No data available for processing")

Loaded from cache: 822 jobs, 2484 resumes


In [81]:
"""
Create TF-IDF vector representations for jobs and resumes.
Combines all text to build a shared vocabulary for consistent feature space.
"""
if 'df_jobs_clean' in locals() and 'df_resumes_clean' in locals():
    all_texts = list(df_jobs_clean['LemmatizedText']) + list(df_resumes_clean['LemmatizedText'])
    
    vectorizer = TfidfVectorizer(
        max_features=5000,
        stop_words='english',
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.8
    )
    
    tfidf_matrix = vectorizer.fit_transform(all_texts)
    
    n_jobs = len(df_jobs_clean)
    job_tfidf = tfidf_matrix[:n_jobs]
    resume_tfidf = tfidf_matrix[n_jobs:]
    
    print(f"TF-IDF matrix created: {tfidf_matrix.shape}")
else:
    print("Cleaned data not available")


TF-IDF matrix created: (3306, 5000)


In [82]:
"""
Compare TF-IDF, BERT, and RoBERTa matching methods for performance evaluation.
"""
def compare_matching_methods(job_index=0, top_n=5):
    """
    Compare TF-IDF, BERT, and RoBERTa matching methods on a sample job.
    
    Args:
        job_index: Index of job to test
        top_n: Number of top matches to return
    
    Returns:
        Dictionary containing results from all methods
    """
    if df_jobs is None or df_resumes is None:
        print("Datasets not loaded. Run data loading cell first.")
        return
    
    print(f"Comparing methods for job: {df_jobs.iloc[job_index]['Title']}")
    print("=" * 80)
    
    try:
        tfidf_matches = find_best_matches(job_index, resume_tfidf, job_tfidf, df_resumes_clean, top_n)
    except Exception as e:
        print(f"TF-IDF method failed: {str(e)}")
        tfidf_matches = pd.DataFrame()
    
    try:
        bert_matches = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n)
    except Exception as e:
        print(f"BERT method failed: {str(e)}")
        bert_matches = pd.DataFrame()
    
    try:
        roberta_matches = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n)
    except Exception as e:
        print(f"RoBERTa method failed: {str(e)}")
        roberta_matches = pd.DataFrame()
    
    try:
        enhanced_bert_matches = enhanced_matching_bert(job_index, df_jobs_clean, df_resumes_clean, top_n)
    except Exception as e:
        print(f"Enhanced BERT method failed: {str(e)}")
        enhanced_bert_matches = pd.DataFrame()
    
    try:
        enhanced_roberta_matches = enhanced_matching_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n)
    except Exception as e:
        print(f"Enhanced RoBERTa method failed: {str(e)}")
        enhanced_roberta_matches = pd.DataFrame()
    
    print("\nRESULTS COMPARISON:")
    print("=" * 80)
    
    if not tfidf_matches.empty:
        print("\nTF-IDF Results:")
        print(tfidf_matches[['Rank', 'Resume_ID', 'Category', 'Similarity_Score']].to_string(index=False))
    
    if not bert_matches.empty:
        print("\nBERT Results:")
        print(bert_matches[['Rank', 'Resume_ID', 'Category', 'BERT_Similarity_Score']].to_string(index=False))
    
    if not roberta_matches.empty:
        print("\nRoBERTa Results:")
        print(roberta_matches[['Rank', 'Resume_ID', 'Category', 'RoBERTa_Similarity_Score']].to_string(index=False))
    
    if not enhanced_bert_matches.empty:
        print("\nEnhanced BERT Results:")
        print(enhanced_bert_matches[['Rank', 'Resume_ID', 'Category', 'BERT_Similarity_Score', 'Skill_Overlap_Score', 'Combined_Score']].to_string(index=False))
    
    if not enhanced_roberta_matches.empty:
        print("\nEnhanced RoBERTa Results:")
        print(enhanced_roberta_matches[['Rank', 'Resume_ID', 'Category', 'RoBERTa_Similarity_Score', 'Skill_Overlap_Score', 'Combined_Score']].to_string(index=False))
    
    return {
        'tfidf_matches': tfidf_matches,
        'bert_matches': bert_matches,
        'roberta_matches': roberta_matches,
        'enhanced_bert_matches': enhanced_bert_matches,
        'enhanced_roberta_matches': enhanced_roberta_matches
    }

def test_skill_extraction_comparison():
    """
    Compare skill extraction methods between spaCy and BERT.
    """
    if df_jobs is None or not BERT_AVAILABLE:
        print("Datasets not loaded or BERT not available.")
        return
    
    if 'df_jobs_clean' in globals() and len(globals()['df_jobs_clean']) > 0:
        sample_job = df_jobs_clean.iloc[0]
        job_text = sample_job['CleanText']
        
        print(f"Sample Job: {sample_job['Title']}")
        
        if nlp:
            spacy_skills = extract_skills(job_text, nlp)
            print(f"\nspaCy Skills ({len(spacy_skills)}): {spacy_skills[:10]}")
        
        bert_skills = extract_skills_bert(job_text)
        print(f"\nBERT Skills ({len(bert_skills)}): {bert_skills}")
        
        if nlp and spacy_skills:
            spacy_set = set(spacy_skills)
            bert_set = set(bert_skills)
            overlap = len(spacy_set.intersection(bert_set))
            union = len(spacy_set.union(bert_set))
            
            print(f"\nSkill Extraction Comparison:")
            print(f"  spaCy skills: {len(spacy_skills)}")
            print(f"  BERT skills: {len(bert_skills)}")
            print(f"  Overlap: {overlap}")
            print(f"  Jaccard similarity: {overlap/union:.3f}")

comparison_results = compare_matching_methods(job_index=0, top_n=5)
test_skill_extraction_comparison()


Comparing methods for job: Chief Financial Officer

RESULTS COMPARISON:

TF-IDF Results:
 Rank  Resume_ID Category  Similarity_Score
    1   12071138  FINANCE          0.438454
    2   19234823 ADVOCATE          0.431891
    3   18636651  FINANCE          0.426520
    4   17392859  FINANCE          0.401932
    5   84356308  FINANCE          0.398916

BERT Results:
 Rank  Resume_ID Category  BERT_Similarity_Score
    1   17392859  FINANCE               0.752344
    2   15891494  FINANCE               0.743289
    3   14722634  FINANCE               0.737250
    4   26767199  FINANCE               0.730880
    5   38441665  FINANCE               0.723948

RoBERTa Results:
 Rank  Resume_ID    Category  RoBERTa_Similarity_Score
    1   16507693 AGRICULTURE                  0.842645
    2   19243556     FINANCE                  0.820709
    3   19234823    ADVOCATE                  0.816237
    4   38441665     FINANCE                  0.808682
    5   26767199     FINANCE                 

In [83]:
"""
System validation tests for BERT, RoBERTa similarity and skill extraction.
"""
if BERT_AVAILABLE:
    if 'bert_model' in globals() and bert_model is not None:
        similarity = calculate_bert_similarity("python developer", "software engineer")
        print(f"BERT similarity test: {similarity:.3f}")
    
    if 'roberta_model' in globals() and roberta_model is not None:
        similarity = calculate_roberta_similarity("python developer", "software engineer")
        print(f"RoBERTa similarity test: {similarity:.3f}")

if nlp:
    skills = extract_skills("I know Python and machine learning", nlp)
    print(f"Skills extracted: {skills}")


BERT similarity test: 0.494
RoBERTa similarity test: 0.365
Skills extracted: ['learning', 'machine', 'python']


In [84]:
"""
Simple validation test for TF-IDF, BERT, and RoBERTa matching methods.
"""
def simple_test():
    """
    Validate that all matching methods work correctly.
    """
    if df_jobs is None or df_resumes is None:
        print("Datasets not loaded")
        return
    
    try:
        tfidf_results = find_best_matches(0, resume_tfidf, job_tfidf, df_resumes_clean, 3)
        print(f"TF-IDF: Found {len(tfidf_results)} matches")
        if not tfidf_results.empty:
            print(f"Top TF-IDF match: {tfidf_results.iloc[0]['Resume_ID']}")
    except Exception as e:
        print(f"TF-IDF error: {e}")
    
    try:
        bert_results = find_best_matches_bert(0, df_jobs_clean, df_resumes_clean, 3)
        print(f"BERT: Found {len(bert_results)} matches")
        if not bert_results.empty:
            print(f"Top BERT match: {bert_results.iloc[0]['Resume_ID']}")
    except Exception as e:
        print(f"BERT error: {e}")
    
    try:
        roberta_results = find_best_matches_roberta(0, df_jobs_clean, df_resumes_clean, 3)
        print(f"RoBERTa: Found {len(roberta_results)} matches")
        if not roberta_results.empty:
            print(f"Top RoBERTa match: {roberta_results.iloc[0]['Resume_ID']}")
    except Exception as e:
        print(f"RoBERTa error: {e}")

simple_test()


TF-IDF: Found 3 matches
Top TF-IDF match: 12071138
BERT: Found 3 matches
Top BERT match: 17392859
RoBERTa: Found 3 matches
Top RoBERTa match: 16507693


In [85]:
"""
Check availability of datasets and processed data structures.
"""
def check_data_availability():
    """
    Display status of all required data structures.
    """
    status = {
        'df_jobs': 'df_jobs' in locals() and df_jobs is not None,
        'df_resumes': 'df_resumes' in locals() and df_resumes is not None,
        'df_jobs_clean': 'df_jobs_clean' in locals(),
        'df_resumes_clean': 'df_resumes_clean' in locals(),
        'job_tfidf': 'job_tfidf' in locals(),
        'resume_tfidf': 'resume_tfidf' in locals()
    }
    
    for key, value in status.items():
        print(f"{key}: {'Available' if value else 'Not available'}")
    
    if status['df_jobs']:
        print(f"\nOriginal data - Jobs: {df_jobs.shape}, Resumes: {df_resumes.shape if status['df_resumes'] else 'N/A'}")
    
    if status['df_jobs_clean']:
        print(f"Cleaned data - Jobs: {df_jobs_clean.shape}, Resumes: {df_resumes_clean.shape if status['df_resumes_clean'] else 'N/A'}")

check_data_availability()


df_jobs: Not available
df_resumes: Not available
df_jobs_clean: Not available
df_resumes_clean: Not available
job_tfidf: Not available
resume_tfidf: Not available


In [86]:
"""
BERT integration demonstration and system status overview.
"""
def demo_bert_features():
    """
    Demonstrate BERT and RoBERTa semantic similarity capabilities.
    """
    if not BERT_AVAILABLE:
        print("BERT/RoBERTa is not available. Install transformers and torch.")
        return
    
    print(f"Device: {device if 'device' in globals() else 'CPU'}")
    
    test_pairs = [
        ("machine learning engineer", "data scientist"),
        ("python developer", "software engineer"),
        ("project manager", "team lead"),
        ("marketing specialist", "sales representative")
    ]
    
    if 'bert_model' in globals() and bert_model is not None:
        print("\nBERT Semantic Similarity Demo:")
        print("-" * 40)
        for text1, text2 in test_pairs:
            similarity = calculate_bert_similarity(text1, text2)
            print(f"'{text1}' ↔ '{text2}': {similarity:.3f}")
    
    if 'roberta_model' in globals() and roberta_model is not None:
        print("\nRoBERTa Semantic Similarity Demo:")
        print("-" * 40)
        for text1, text2 in test_pairs:
            similarity = calculate_roberta_similarity(text1, text2)
            print(f"'{text1}' ↔ '{text2}': {similarity:.3f}")

def system_status():
    """
    Display current system status and available capabilities.
    """
    print("\nSYSTEM STATUS")
    print("=" * 50)
    
    print(f"Datasets loaded: {df_jobs is not None and df_resumes is not None}")
    print(f"BERT available: {BERT_AVAILABLE}")
    print(f"spaCy available: {nlp is not None}")
    print(f"Flask available: {FLASK_AVAILABLE}")
    print(f"Database available: {DATABASE_AVAILABLE}")
    
    if df_jobs is not None and df_resumes is not None:
        print(f"\nDataset Statistics:")
        print(f"  Job posts: {len(df_jobs):,}")
        print(f"  Resumes: {len(df_resumes):,}")
        print(f"  Resume categories: {df_resumes['Category'].nunique()}")
    
    print(f"\nMatching Methods Available:")
    print(f"  TF-IDF + Cosine Similarity: Available")
    print(f"  BERT Semantic Matching: {'Available' if BERT_AVAILABLE and 'bert_model' in globals() and globals().get('bert_model') is not None else 'Not available'}")
    print(f"  RoBERTa Semantic Matching: {'Available' if BERT_AVAILABLE and 'roberta_model' in globals() and globals().get('roberta_model') is not None else 'Not available'}")
    print(f"  Enhanced BERT + Skills: {'Available' if BERT_AVAILABLE and 'bert_model' in globals() and globals().get('bert_model') is not None else 'Not available'}")
    print(f"  Enhanced RoBERTa + Skills: {'Available' if BERT_AVAILABLE and 'roberta_model' in globals() and globals().get('roberta_model') is not None else 'Not available'}")

demo_bert_features()
system_status()


Device: cpu

BERT Semantic Similarity Demo:
----------------------------------------
'machine learning engineer' ↔ 'data scientist': 0.608
'python developer' ↔ 'software engineer': 0.494
'project manager' ↔ 'team lead': 0.309
'marketing specialist' ↔ 'sales representative': 0.598

RoBERTa Semantic Similarity Demo:
----------------------------------------
'machine learning engineer' ↔ 'data scientist': 0.597
'python developer' ↔ 'software engineer': 0.365
'project manager' ↔ 'team lead': 0.184
'marketing specialist' ↔ 'sales representative': 0.384

SYSTEM STATUS
Datasets loaded: True
BERT available: True
spaCy available: True
Flask available: True
Database available: False

Dataset Statistics:
  Job posts: 19,001
  Resumes: 2,484
  Resume categories: 24

Matching Methods Available:
  TF-IDF + Cosine Similarity: Available
  BERT Semantic Matching: Available
  RoBERTa Semantic Matching: Available
  Enhanced BERT + Skills: Available
  Enhanced RoBERTa + Skills: Available


In [87]:
# Duplicate cell - functions already defined in cell 10
# This cell can be removed or kept as reference


In [88]:
"""
TF-IDF based job search by keywords.
"""
def search_jobs_by_keywords(keywords, df_jobs, job_tfidf, vectorizer, top_n=5):
    """
    Search for jobs matching keyword query using TF-IDF similarity.
    
    Args:
        keywords: Search query string
        df_jobs: DataFrame of job postings
        job_tfidf: TF-IDF matrix for jobs
        vectorizer: Fitted TfidfVectorizer instance
        top_n: Number of top results to return
    
    Returns:
        DataFrame with ranked job matches
    """
    clean_keywords = clean_text(keywords)
    keyword_vector = vectorizer.transform([clean_keywords])
    similarities = cosine_similarity(keyword_vector, job_tfidf).flatten()
    top_indices = similarities.argsort()[-top_n:][::-1]
    
    results = []
    for i, idx in enumerate(top_indices):
        job = df_jobs.iloc[idx]
        results.append({
            'Rank': i + 1,
            'Title': job['Title'],
            'Company': job['Company'],
            'Location': job['Location'],
            'Similarity_Score': similarities[idx],
            'Description': job['JobDescription'][:200] + '...' if pd.notna(job['JobDescription']) else 'N/A'
        })
    
    return pd.DataFrame(results)

search_results = search_jobs_by_keywords("software developer python", df_jobs_clean, job_tfidf, vectorizer)
print("Job search results for 'software developer python':")
print(search_results)


Job search results for 'software developer python':
   Rank                                          Title  \
0     1                  Software Developer/Programmer   
1     2                             Software developer   
2     3  Senior Software Developer (several positions)   
3     4                             Software Developer   
4     5                         Developers Team Leader   

                                       Company          Location  \
0                                      IIG LLC  Yerevan, Armenia   
1                                     Xalt LLC  Yerevan, Armenia   
2                                    ZenteX.AM  Yerevan, Armenia   
3  Synergy International Systems, Inc./Armenia  Yerevan, Armenia   
4                                    Zenteq.am  Yerevan, Armenia   

   Similarity_Score  \
0          0.436298   
1          0.419878   
2          0.412981   
3          0.345645   
4          0.313760   

                                                   

In [89]:
"""
Export processed datasets and results to CSV files.
"""
df_jobs_clean.to_csv('cleaned_job_posts.csv', index=False)
df_resumes_clean.to_csv('cleaned_resumes.csv', index=False)

if 'enhanced_matches' in locals():
    enhanced_matches.to_csv('job_resume_matches.csv', index=False)

if 'search_results' in locals():
    search_results.to_csv('job_search_results.csv', index=False)

print("Exported files:")
print("  - cleaned_job_posts.csv")
print("  - cleaned_resumes.csv")
if 'enhanced_matches' in locals():
    print("  - job_resume_matches.csv")
if 'search_results' in locals():
    print("  - job_search_results.csv")


Exported files:
  - cleaned_job_posts.csv
  - cleaned_resumes.csv
  - job_search_results.csv


In [90]:
"""
System summary and statistics.
"""
print("AI RECRUITMENT SYSTEM SUMMARY")
print("=" * 50)
print(f"\nDataset Statistics:")
print(f"  Total Job Posts: {len(df_jobs_clean)}")
print(f"  Total Resumes: {len(df_resumes_clean)}")
print(f"  Resume Categories: {df_resumes_clean['Category'].nunique()}")
print(f"  Unique Companies: {df_jobs_clean['Company'].nunique()}")

print(f"\nSystem Features:")
print("  - Job posting analysis and cleaning")
print("  - Resume parsing and skill extraction")
print("  - TF-IDF based text similarity matching")
print("  - BERT semantic matching")
print("  - Enhanced matching with skill overlap")
print("  - Interactive job search by keywords")
print("  - Results export to CSV")


AI RECRUITMENT SYSTEM SUMMARY

Dataset Statistics:
  Total Job Posts: 822
  Total Resumes: 2484
  Resume Categories: 24
  Unique Companies: 416

System Features:
  - Job posting analysis and cleaning
  - Resume parsing and skill extraction
  - TF-IDF based text similarity matching
  - BERT semantic matching
  - Enhanced matching with skill overlap
  - Interactive job search by keywords
  - Results export to CSV


In [91]:
"""
Find top candidates for a job posting using BERT (preferred) or TF-IDF matching.
"""
def top_candidates_for_job(job_index=None, title_contains=None, top_n=10):
    """
    Find top matching candidates for a job by index or title search.
    
    Args:
        job_index: Index of job in df_jobs_clean (optional)
        title_contains: Substring to search in job titles (optional)
        top_n: Number of top candidates to return
    
    Returns:
        DataFrame with ranked candidates
    """
    required = ['df_jobs_clean', 'df_resumes_clean']
    for v in required:
        if v not in globals():
            raise RuntimeError(f"{v} not found. Run preprocessing cells first.")
    
    if job_index is None and not title_contains:
        raise ValueError("Provide either job_index or title_contains.")

    if job_index is None:
        mask = df_jobs_clean['Title'].fillna('').str.contains(title_contains, case=False, na=False)
        if not mask.any():
            raise ValueError(f"No job found with title containing: {title_contains}")
        job_index = mask.idxmax()

    job_row = df_jobs_clean.iloc[job_index]
    print(f"Job [{job_index}] — {job_row['Title']} | {job_row.get('Company','N/A')} | {job_row.get('Location','N/A')}")

    use_roberta = 'find_best_matches_roberta' in globals() and BERT_AVAILABLE and (roberta_model is not None)
    use_bert = 'find_best_matches_bert' in globals() and BERT_AVAILABLE and (bert_model is not None)
    results = None

    if use_roberta:
        try:
            results = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
            results = results[['Rank','Resume_ID','Category','RoBERTa_Similarity_Score','Resume_Text']]
            results = results.rename(columns={'RoBERTa_Similarity_Score':'Score'})
            method = "RoBERTa"
        except Exception as e:
            print(f"RoBERTa matching unavailable: {e}")
            results = None

    if results is None and use_bert:
        try:
            results = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
            results = results[['Rank','Resume_ID','Category','BERT_Similarity_Score','Resume_Text']]
            results = results.rename(columns={'BERT_Similarity_Score':'Score'})
            method = "BERT"
        except Exception as e:
            print(f"BERT matching unavailable: {e}")
            results = None

    if results is None:
        required_tfidf = ['job_tfidf','resume_tfidf','find_best_matches']
        if not all(r in globals() for r in required_tfidf):
            raise RuntimeError("TF-IDF artifacts missing. Run TF-IDF vectorization cell first.")
        results = find_best_matches(job_index, resume_tfidf, job_tfidf, df_resumes_clean, top_n=top_n)
        results = results[['Rank','Resume_ID','Category','Similarity_Score','Resume_Text']]
        results = results.rename(columns={'Similarity_Score':'Score'})
        method = "TF-IDF"

    print(f"\nTop {top_n} candidates ({method}):")
    display(results)
    return results

top10 = top_candidates_for_job(job_index=0, top_n=10)

Job [0] — Chief Financial Officer | AMERIA Investment Consulting Company | Yerevan, Armenia

Top 10 candidates (RoBERTa):


Unnamed: 0,Rank,Resume_ID,Category,Score,Resume_Text
0,1,16507693,AGRICULTURE,0.842645,BUDGET ANALYST SERIES 0560 Summary Accounting Skills Knowledge of automa...
1,2,19243556,FINANCE,0.820709,"DIRECTOR OF FINANCE Executive Profile Dynamic, results-oriented Controller wi..."
2,3,19234823,ADVOCATE,0.816237,FINANCE DIRECTOR Professional Summary To find a new and challenging position t...
3,4,38441665,FINANCE,0.808682,"FINANCE DIRECTOR Professional Summary Results oriented, dependable and mo..."
4,5,26767199,FINANCE,0.803197,FINANCE MANAGER Summary Flexible Financial Manager with the ability to mult...
5,6,27409087,FINANCE,0.802374,HEAD OF ACCOUNTS AND FINANCE Summary Flexible Accountant who adapts seaml...
6,7,12802330,ACCOUNTANT,0.802321,"LEAD ACCOUNTANT Highlights QuickBooks, Peachtree, In-house Account..."
7,8,14722634,FINANCE,0.80147,FINANCE DIRECTOR Summary Remarkably astute and analytical professional with ov...
8,9,23139819,ACCOUNTANT,0.79229,ACCOUNTANT Summary \n\nAccomplished professional with exceptional skills\ndeve...
9,10,20918464,FINANCE,0.791255,SENIOR ACCOUNTANT / FINANCE CONTROLLER Summary Aim to work for...


In [92]:
"""
Find best matching jobs for a candidate resume using BERT (preferred) or TF-IDF matching.
"""
def find_best_jobs_for_resume(resume_index, df_jobs, df_resumes, job_tfidf, resume_tfidf, top_n=10):
    """
    Find top matching jobs for a resume using TF-IDF cosine similarity.
    
    Args:
        resume_index: Index of the resume in df_resumes
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        job_tfidf: TF-IDF matrix for jobs
        resume_tfidf: TF-IDF matrix for resumes
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked job matches and similarity scores
    """
    try:
        resume_vector = resume_tfidf[resume_index]
        similarities = cosine_similarity(resume_vector, job_tfidf).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            job = df_jobs.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Title': job['Title'],
                'Company': job.get('Company', 'N/A'),
                'Location': job.get('Location', 'N/A'),
                'Similarity_Score': similarities[idx],
                'Description': job.get('JobDescription', 'N/A')[:200] + '...' if pd.notna(job.get('JobDescription', '')) else 'N/A'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"TF-IDF matching error: {e}")
        return pd.DataFrame()

def find_best_jobs_for_resume_bert(resume_index, df_jobs, df_resumes, top_n=10):
    """
    Find top matching jobs for a resume using BERT semantic embeddings.
    
    Args:
        resume_index: Index of the resume in df_resumes
        df_jobs: DataFrame of job postings with 'CleanText' column
        df_resumes: DataFrame of resumes with 'CleanText' column
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked job matches and similarity scores
    """
    if not BERT_AVAILABLE or bert_model is None:
        return pd.DataFrame()
    
    try:
        resume_text = df_resumes.iloc[resume_index]['CleanText']
        job_texts = df_jobs['CleanText'].tolist()
        
        resume_embedding = bert_model.encode([resume_text])
        job_embeddings = bert_model.encode(job_texts)
        
        similarities = cosine_similarity(resume_embedding, job_embeddings).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            job = df_jobs.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Title': job['Title'],
                'Company': job.get('Company', 'N/A'),
                'Location': job.get('Location', 'N/A'),
                'BERT_Similarity_Score': similarities[idx],
                'Description': job.get('JobDescription', 'N/A')[:200] + '...' if pd.notna(job.get('JobDescription', '')) else 'N/A'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"BERT matching error: {e}")
        return pd.DataFrame()

def find_best_jobs_for_resume_roberta(resume_index, df_jobs, df_resumes, top_n=10):
    """
    Find top matching jobs for a resume using RoBERTa semantic embeddings.
    
    Args:
        resume_index: Index of the resume in df_resumes
        df_jobs: DataFrame of job postings with 'CleanText' column
        df_resumes: DataFrame of resumes with 'CleanText' column
        top_n: Number of top matches to return
    
    Returns:
        DataFrame with ranked job matches and similarity scores
    """
    if not BERT_AVAILABLE or roberta_model is None:
        return pd.DataFrame()
    
    try:
        resume_text = df_resumes.iloc[resume_index]['CleanText']
        job_texts = df_jobs['CleanText'].tolist()
        
        resume_embedding = roberta_model.encode([resume_text])
        job_embeddings = roberta_model.encode(job_texts)
        
        similarities = cosine_similarity(resume_embedding, job_embeddings).flatten()
        top_indices = similarities.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            job = df_jobs.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Title': job['Title'],
                'Company': job.get('Company', 'N/A'),
                'Location': job.get('Location', 'N/A'),
                'RoBERTa_Similarity_Score': similarities[idx],
                'Description': job.get('JobDescription', 'N/A')[:200] + '...' if pd.notna(job.get('JobDescription', '')) else 'N/A'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"RoBERTa matching error: {e}")
        return pd.DataFrame()

def enhanced_jobs_for_resume_bert(resume_index, df_jobs, df_resumes, top_n=10, skill_weight=0.3, bert_weight=0.7):
    """
    Enhanced matching combining BERT semantic similarity with skill overlap.
    
    Args:
        resume_index: Index of the resume in df_resumes
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        top_n: Number of top matches to return
        skill_weight: Weight for skill overlap score (default 0.3)
        bert_weight: Weight for BERT similarity score (default 0.7)
    
    Returns:
        DataFrame with ranked job matches and combined scores
    """
    if not BERT_AVAILABLE or bert_model is None:
        return pd.DataFrame()
    
    try:
        resume = df_resumes.iloc[resume_index]
        resume_text = resume['CleanText']
        resume_skills = set(resume['Skills']) if resume['Skills'] else set()
        
        resume_embedding = bert_model.encode([resume_text])
        job_texts = df_jobs['CleanText'].tolist()
        job_embeddings = bert_model.encode(job_texts)
        
        bert_similarities = cosine_similarity(resume_embedding, job_embeddings).flatten()
        
        skill_scores = []
        for idx, job in df_jobs.iterrows():
            job_skills = set(job['Skills']) if job['Skills'] else set()
            
            if resume_skills and job_skills:
                overlap = len(resume_skills.intersection(job_skills))
                skill_score = overlap / len(resume_skills) if resume_skills else 0
            else:
                skill_score = 0
            
            skill_scores.append(skill_score)
        
        combined_scores = bert_weight * bert_similarities + skill_weight * np.array(skill_scores)
        top_indices = combined_scores.argsort()[-top_n:][::-1]
        
        results = []
        for i, idx in enumerate(top_indices):
            job = df_jobs.iloc[idx]
            results.append({
                'Rank': i + 1,
                'Title': job['Title'],
                'Company': job.get('Company', 'N/A'),
                'Location': job.get('Location', 'N/A'),
                'BERT_Similarity_Score': bert_similarities[idx],
                'Skill_Overlap_Score': skill_scores[idx],
                'Combined_Score': combined_scores[idx],
                'Description': job.get('JobDescription', 'N/A')[:200] + '...' if pd.notna(job.get('JobDescription', '')) else 'N/A'
            })
        
        return pd.DataFrame(results)
    except Exception as e:
        print(f"Enhanced BERT matching error: {e}")
        return pd.DataFrame()

def find_jobs_for_candidate(resume_index=None, resume_id=None, top_n=10):
    """
    Find top matching jobs for a candidate by resume index or ID.
    
    Args:
        resume_index: Index of resume in df_resumes_clean (optional)
        resume_id: ID of the resume (optional)
        top_n: Number of top job matches to return
    
    Returns:
        DataFrame with ranked job matches
    """
    # Access variables from global scope (Jupyter notebook)
    g = globals()
    
    if 'df_jobs_clean' not in g:
        raise RuntimeError("df_jobs_clean not found. Please run the data preprocessing cell (Cell 11) first.")
    if 'df_resumes_clean' not in g:
        raise RuntimeError("df_resumes_clean not found. Please run the data preprocessing cell (Cell 11) first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    if resume_index is None and resume_id is None:
        raise ValueError("Provide either resume_index or resume_id.")
    
    if resume_index is None:
        mask = df_resumes_clean['ID'] == resume_id
        if not mask.any():
            raise ValueError(f"No resume found with ID: {resume_id}")
        resume_index = mask.idxmax()
    
    resume_row = df_resumes_clean.iloc[resume_index]
    print(f"Resume [{resume_index}] — ID: {resume_row['ID']} | Category: {resume_row.get('Category','N/A')}")
    print(f"Resume preview: {resume_row['Resume_str'][:150]}...")
    
    use_roberta = ('find_best_jobs_for_resume_roberta' in g and 
                   'BERT_AVAILABLE' in g and g.get('BERT_AVAILABLE', False) and 
                   g.get('roberta_model') is not None)
    use_bert = ('find_best_jobs_for_resume_bert' in g and 
                'BERT_AVAILABLE' in g and g.get('BERT_AVAILABLE', False) and 
                g.get('bert_model') is not None)
    results = None
    
    if use_roberta:
        try:
            results = g['find_best_jobs_for_resume_roberta'](resume_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
            results = results[['Rank','Title','Company','Location','RoBERTa_Similarity_Score','Description']]
            results = results.rename(columns={'RoBERTa_Similarity_Score':'Score'})
            method = "RoBERTa"
        except Exception as e:
            print(f"RoBERTa matching unavailable: {e}")
            results = None
    
    if results is None and use_bert:
        try:
            results = g['find_best_jobs_for_resume_bert'](resume_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
            results = results[['Rank','Title','Company','Location','BERT_Similarity_Score','Description']]
            results = results.rename(columns={'BERT_Similarity_Score':'Score'})
            method = "BERT"
        except Exception as e:
            print(f"BERT matching unavailable: {e}")
            results = None
    
    if results is None:
        if 'job_tfidf' not in g:
            raise RuntimeError("job_tfidf not found. Please run the TF-IDF vectorization cell (Cell 12) first.")
        if 'resume_tfidf' not in g:
            raise RuntimeError("resume_tfidf not found. Please run the TF-IDF vectorization cell (Cell 12) first.")
        
        results = g['find_best_jobs_for_resume'](resume_index, df_jobs_clean, df_resumes_clean, 
                                                 g['job_tfidf'], g['resume_tfidf'], top_n=top_n)
        results = results[['Rank','Title','Company','Location','Similarity_Score','Description']]
        results = results.rename(columns={'Similarity_Score':'Score'})
        method = "TF-IDF"
    
    print(f"\nTop {top_n} matching jobs ({method}):")
    display(results)
    return results

In [93]:
"""
Batch processing functions for multiple jobs/resumes matching.
"""
def batch_find_jobs_for_resume(resume_indices, df_jobs, df_resumes, job_tfidf, resume_tfidf, 
                                model_type='bert', top_n=10):
    """
    Find top N jobs for multiple resumes (batch processing).
    
    Args:
        resume_indices: List of resume indices or single index
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        job_tfidf: TF-IDF matrix for jobs
        resume_tfidf: TF-IDF matrix for resumes
        model_type: 'bert', 'roberta', or 'tfidf'
        top_n: Number of top jobs to return per resume
    
    Returns:
        Dictionary mapping resume_index to DataFrame of top jobs
    """
    if isinstance(resume_indices, int):
        resume_indices = [resume_indices]
    
    results = {}
    g = globals()
    
    for resume_idx in resume_indices:
        if model_type.lower() == 'bert':
            if 'find_best_jobs_for_resume_bert' in g:
                results[resume_idx] = find_best_jobs_for_resume_bert(
                    resume_idx, df_jobs, df_resumes, top_n=top_n
                )
        elif model_type.lower() == 'roberta':
            if 'find_best_jobs_for_resume_roberta' in g:
                results[resume_idx] = find_best_jobs_for_resume_roberta(
                    resume_idx, df_jobs, df_resumes, top_n=top_n
                )
        else:  # TF-IDF
            results[resume_idx] = find_best_jobs_for_resume(
                resume_idx, df_jobs, df_resumes, job_tfidf, resume_tfidf, top_n=top_n
            )
    
    return results

def batch_find_resumes_for_job(job_indices, df_jobs, df_resumes, job_tfidf, resume_tfidf,
                                model_type='bert', top_n=10):
    """
    Find top N resumes for multiple jobs (batch processing).
    
    Args:
        job_indices: List of job indices or single index
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        job_tfidf: TF-IDF matrix for jobs
        resume_tfidf: TF-IDF matrix for resumes
        model_type: 'bert', 'roberta', or 'tfidf'
        top_n: Number of top resumes to return per job
    
    Returns:
        Dictionary mapping job_index to DataFrame of top resumes
    """
    if isinstance(job_indices, int):
        job_indices = [job_indices]
    
    results = {}
    g = globals()
    
    for job_idx in job_indices:
        if model_type.lower() == 'bert':
            if 'find_best_matches_bert' in g:
                results[job_idx] = find_best_matches_bert(
                    job_idx, df_jobs, df_resumes, top_n=top_n
                )
        elif model_type.lower() == 'roberta':
            if 'find_best_matches_roberta' in g:
                results[job_idx] = find_best_matches_roberta(
                    job_idx, df_jobs, df_resumes, top_n=top_n
                )
        else:  # TF-IDF
            results[job_idx] = find_best_matches(
                job_idx, resume_tfidf, job_tfidf, df_resumes, top_n=top_n
            )
    
    return results

def get_top_n_jobs_for_resume(resume_index, n=10, model='bert'):
    """
    Get top N jobs for a single resume.
    
    Args:
        resume_index: Index of resume in df_resumes_clean
        n: Number of top jobs to return
        model: 'bert', 'roberta', or 'tfidf'
    
    Returns:
        DataFrame with top N matching jobs
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found. Run preprocessing cells first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    resume_row = df_resumes_clean.iloc[resume_index]
    print(f"Resume [{resume_index}] — ID: {resume_row['ID']} | Category: {resume_row.get('Category','N/A')}")
    
    if model.lower() == 'bert' and 'find_best_jobs_for_resume_bert' in g:
        results = find_best_jobs_for_resume_bert(resume_index, df_jobs_clean, df_resumes_clean, top_n=n)
        results = results.rename(columns={'BERT_Similarity_Score': 'Score'})
    elif model.lower() == 'roberta' and 'find_best_jobs_for_resume_roberta' in g:
        results = find_best_jobs_for_resume_roberta(resume_index, df_jobs_clean, df_resumes_clean, top_n=n)
        results = results.rename(columns={'RoBERTa_Similarity_Score': 'Score'})
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing. Run TF-IDF vectorization cell first.")
        results = find_best_jobs_for_resume(resume_index, df_jobs_clean, df_resumes_clean, 
                                           g['job_tfidf'], g['resume_tfidf'], top_n=n)
        results = results.rename(columns={'Similarity_Score': 'Score'})
    
    print(f"\nTop {n} matching jobs ({model.upper()}):")
    display(results)
    return results

def get_top_n_resumes_for_job(job_index, n=10, model='bert'):
    """
    Get top N resumes for a single job.
    
    Args:
        job_index: Index of job in df_jobs_clean
        n: Number of top resumes to return
        model: 'bert', 'roberta', or 'tfidf'
    
    Returns:
        DataFrame with top N matching resumes
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found. Run preprocessing cells first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    job_row = df_jobs_clean.iloc[job_index]
    print(f"Job [{job_index}] — {job_row['Title']} | {job_row.get('Company','N/A')}")
    
    if model.lower() == 'bert' and 'find_best_matches_bert' in g:
        results = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=n)
        results = results.rename(columns={'BERT_Similarity_Score': 'Score'})
    elif model.lower() == 'roberta' and 'find_best_matches_roberta' in g:
        results = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=n)
        results = results.rename(columns={'RoBERTa_Similarity_Score': 'Score'})
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing. Run TF-IDF vectorization cell first.")
        results = find_best_matches(job_index, g['resume_tfidf'], g['job_tfidf'], df_resumes_clean, top_n=n)
        results = results.rename(columns={'Similarity_Score': 'Score'})
    
    print(f"\nTop {n} matching resumes ({model.upper()}):")
    display(results)
    return results

# Example usage:
# Get top 10 jobs for resume index 0 using BERT
# jobs = get_top_n_jobs_for_resume(resume_index=0, n=10, model='bert')

# Get top 10 resumes for job index 0 using RoBERTa
# resumes = get_top_n_resumes_for_job(job_index=0, n=10, model='roberta')

# Batch processing: Get top 5 jobs for multiple resumes
# batch_results = batch_find_jobs_for_resume([0, 1, 2], df_jobs_clean, df_resumes_clean, 
#                                            job_tfidf, resume_tfidf, model_type='bert', top_n=5)


In [94]:
"""
Generate Match Summary for hiring managers.
Provides detailed analysis of candidate-job fit including skills match, gaps, and alignment level.
"""
def generate_match_summary(job_index, resume_index, df_jobs, df_resumes, 
                           similarity_score, model_type='bert', threshold=0.6):
    """
    Generate a comprehensive match summary for hiring managers.
    
    Args:
        job_index: Index of job in df_jobs
        resume_index: Index of resume in df_resumes
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        similarity_score: Calculated similarity score
        model_type: Type of model used ('bert', 'roberta', 'tfidf')
        threshold: Minimum score threshold for match (default 0.6)
    
    Returns:
        Dictionary containing match summary components
    """
    job = df_jobs.iloc[job_index]
    resume = df_resumes.iloc[resume_index]
    
    job_skills = set(job.get('Skills', [])) if job.get('Skills') else set()
    resume_skills = set(resume.get('Skills', [])) if resume.get('Skills') else set()
    
    # Calculate skill overlap
    matching_skills = job_skills.intersection(resume_skills)
    missing_skills = job_skills - resume_skills
    extra_skills = resume_skills - job_skills
    
    skill_match_ratio = len(matching_skills) / len(job_skills) if job_skills else 0
    
    # Determine alignment level
    if similarity_score >= 0.75 and skill_match_ratio >= 0.7:
        alignment_level = "Excellent Match"
        recommendation = "Strongly Recommended"
    elif similarity_score >= 0.65 and skill_match_ratio >= 0.5:
        alignment_level = "Good Match"
        recommendation = "Recommended"
    elif similarity_score >= threshold and skill_match_ratio >= 0.3:
        alignment_level = "Moderate Match"
        recommendation = "Consider with Training"
    else:
        alignment_level = "Weak Match"
        recommendation = "Not Recommended"
    
    # Generate summary text
    summary = {
        'candidate_id': resume.get('ID', 'N/A'),
        'candidate_category': resume.get('Category', 'N/A'),
        'job_title': job.get('Title', 'N/A'),
        'company': job.get('Company', 'N/A'),
        'similarity_score': round(similarity_score, 3),
        'alignment_level': alignment_level,
        'recommendation': recommendation,
        'why_fit': f"Candidate demonstrates {similarity_score*100:.1f}% semantic alignment with the job requirements. "
                  f"Skill overlap of {skill_match_ratio*100:.1f}% indicates relevant experience.",
        'matching_skills': list(matching_skills)[:10],
        'missing_skills': list(missing_skills)[:10],
        'extra_skills': list(extra_skills)[:10],
        'skill_match_ratio': round(skill_match_ratio, 3),
        'gaps': f"Missing {len(missing_skills)} key skills: {', '.join(list(missing_skills)[:5])}" if missing_skills else "No significant skill gaps identified."
    }
    
    return summary

def format_match_summary(summary_dict):
    """
    Format match summary as a readable report for hiring managers.
    
    Args:
        summary_dict: Dictionary from generate_match_summary()
    
    Returns:
        Formatted string report
    """
    report = f"""
{'='*70}
MATCH SUMMARY REPORT
{'='*70}

Job Position: {summary_dict['job_title']}
Company: {summary_dict['company']}
Candidate ID: {summary_dict['candidate_id']}
Candidate Category: {summary_dict['candidate_category']}

OVERALL ASSESSMENT
{'-'*70}
Similarity Score: {summary_dict['similarity_score']:.3f}
Alignment Level: {summary_dict['alignment_level']}
Recommendation: {summary_dict['recommendation']}

WHY THIS CANDIDATE IS A FIT
{'-'*70}
{summary_dict['why_fit']}

SKILLS ANALYSIS
{'-'*70}
Matching Skills ({len(summary_dict['matching_skills'])}): {', '.join(summary_dict['matching_skills'][:10])}
Skill Match Ratio: {summary_dict['skill_match_ratio']*100:.1f}%

GAPS IDENTIFIED
{'-'*70}
{summary_dict['gaps']}

ADDITIONAL SKILLS
{'-'*70}
Candidate has {len(summary_dict['extra_skills'])} additional skills: {', '.join(summary_dict['extra_skills'][:5])}

{'='*70}
"""
    return report

def get_match_summary_for_candidate(job_index, resume_index, model='bert'):
    """
    Get formatted match summary for a specific job-resume pair.
    
    Args:
        job_index: Index of job
        resume_index: Index of resume
        model: 'bert', 'roberta', or 'tfidf'
    
    Returns:
        Formatted match summary string
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found. Run preprocessing cells first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    # Calculate similarity score
    if model.lower() == 'bert' and 'find_best_matches_bert' in g:
        results = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['BERT_Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    elif model.lower() == 'roberta' and 'find_best_matches_roberta' in g:
        results = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['RoBERTa_Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing. Run TF-IDF vectorization cell first.")
        results = find_best_matches(job_index, g['resume_tfidf'], g['job_tfidf'], df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    
    summary = generate_match_summary(job_index, resume_index, df_jobs_clean, df_resumes_clean, 
                                     score, model_type=model)
    return format_match_summary(summary)

# Example usage:
# summary = get_match_summary_for_candidate(job_index=0, resume_index=0, model='bert')
# print(summary)


In [95]:
"""
Generate Feedback Report for rejected candidates.
Provides actionable insights on missing skills, learning paths, and improvements.
"""
def generate_candidate_feedback(job_index, resume_index, df_jobs, df_resumes, 
                               similarity_score, model_type='bert', threshold=0.6):
    """
    Generate detailed feedback report for candidates who don't meet the threshold.
    
    Args:
        job_index: Index of job in df_jobs
        resume_index: Index of resume in df_resumes
        df_jobs: DataFrame of job postings
        df_resumes: DataFrame of resumes
        similarity_score: Calculated similarity score
        model_type: Type of model used
        threshold: Minimum score threshold (default 0.6)
    
    Returns:
        Dictionary containing feedback components
    """
    job = df_jobs.iloc[job_index]
    resume = df_resumes.iloc[resume_index]
    
    job_skills = set(job.get('Skills', [])) if job.get('Skills') else set()
    resume_skills = set(resume.get('Skills', [])) if resume.get('Skills') else set()
    
    missing_skills = job_skills - resume_skills
    matching_skills = job_skills.intersection(resume_skills)
    
    skill_gap = len(missing_skills) / len(job_skills) if job_skills else 1.0
    
    # Determine improvement areas
    improvement_areas = []
    if similarity_score < threshold:
        improvement_areas.append("Overall profile alignment needs improvement")
    if skill_gap > 0.3:
        improvement_areas.append(f"Missing {len(missing_skills)} critical skills")
    if len(matching_skills) < 3:
        improvement_areas.append("Limited overlap with required skills")
    
    # Generate learning path suggestions
    learning_paths = []
    skill_categories = {
        'technical': ['python', 'java', 'javascript', 'sql', 'database', 'aws', 'docker'],
        'soft_skills': ['leadership', 'communication', 'project management', 'agile', 'scrum'],
        'domain': ['machine learning', 'data science', 'web development', 'mobile development']
    }
    
    for skill in list(missing_skills)[:5]:
        category = 'technical'
        for cat, skills in skill_categories.items():
            if any(s in skill.lower() for s in skills):
                category = cat
                break
        
        if category == 'technical':
            learning_paths.append(f"{skill}: Consider online courses (Coursera, Udemy) or certification programs")
        elif category == 'soft_skills':
            learning_paths.append(f"{skill}: Practice through projects, mentorship, or workshops")
        else:
            learning_paths.append(f"{skill}: Build portfolio projects and gain hands-on experience")
    
    # Actionable improvements
    improvements = []
    if similarity_score < 0.5:
        improvements.append("Enhance resume keywords to better match job description terminology")
    if len(matching_skills) < len(job_skills) * 0.5:
        improvements.append(f"Focus on acquiring top {min(5, len(missing_skills))} missing skills: {', '.join(list(missing_skills)[:5])}")
    improvements.append("Highlight relevant projects and experiences more prominently")
    improvements.append("Consider obtaining relevant certifications to strengthen profile")
    
    feedback = {
        'candidate_id': resume.get('ID', 'N/A'),
        'job_title': job.get('Title', 'N/A'),
        'company': job.get('Company', 'N/A'),
        'current_score': round(similarity_score, 3),
        'threshold': threshold,
        'meets_threshold': similarity_score >= threshold,
        'missing_skills': list(missing_skills),
        'matching_skills': list(matching_skills),
        'skill_gap_percentage': round(skill_gap * 100, 1),
        'improvement_areas': improvement_areas,
        'learning_paths': learning_paths[:5],
        'actionable_improvements': improvements
    }
    
    return feedback

def format_candidate_feedback(feedback_dict):
    """
    Format candidate feedback as a readable report.
    
    Args:
        feedback_dict: Dictionary from generate_candidate_feedback()
    
    Returns:
        Formatted string report
    """
    report = f"""
{'='*70}
CANDIDATE FEEDBACK REPORT
{'='*70}

Job Position: {feedback_dict['job_title']}
Company: {feedback_dict['company']}
Candidate ID: {feedback_dict['candidate_id']}

CURRENT STATUS
{'-'*70}
Match Score: {feedback_dict['current_score']:.3f}
Threshold: {feedback_dict['threshold']}
Status: {'Meets Requirements' if feedback_dict['meets_threshold'] else 'Below Threshold'}

SKILLS ANALYSIS
{'-'*70}
Matching Skills: {len(feedback_dict['matching_skills'])}
Missing Skills: {len(feedback_dict['missing_skills'])}
Skill Gap: {feedback_dict['skill_gap_percentage']}%

WHAT YOU'RE MISSING
{'-'*70}
"""
    
    if feedback_dict['missing_skills']:
        report += f"Critical missing skills:\n"
        for i, skill in enumerate(feedback_dict['missing_skills'][:10], 1):
            report += f"  {i}. {skill}\n"
    else:
        report += "No significant missing skills identified.\n"
    
    report += f"""
IMPROVEMENT AREAS
{'-'*70}
"""
    for area in feedback_dict['improvement_areas']:
        report += f"  • {area}\n"
    
    report += f"""
SUGGESTED LEARNING PATHS
{'-'*70}
"""
    for path in feedback_dict['learning_paths']:
        report += f"  • {path}\n"
    
    report += f"""
ACTIONABLE IMPROVEMENTS
{'-'*70}
"""
    for improvement in feedback_dict['actionable_improvements']:
        report += f"  • {improvement}\n"
    
    report += f"""
NEXT STEPS
{'-'*70}
1. Focus on acquiring the top 3-5 missing skills
2. Update resume with relevant keywords from job description
3. Build portfolio projects demonstrating required skills
4. Consider relevant certifications or courses
5. Re-apply once improvements are made

{'='*70}
"""
    return report

def get_feedback_for_candidate(job_index, resume_index, model='bert', threshold=0.6):
    """
    Get formatted feedback report for a candidate.
    
    Args:
        job_index: Index of job
        resume_index: Index of resume
        model: 'bert', 'roberta', or 'tfidf'
        threshold: Minimum score threshold
    
    Returns:
        Formatted feedback report string
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found. Run preprocessing cells first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    # Calculate similarity score
    if model.lower() == 'bert' and 'find_best_matches_bert' in g:
        results = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['BERT_Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    elif model.lower() == 'roberta' and 'find_best_matches_roberta' in g:
        results = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['RoBERTa_Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing. Run TF-IDF vectorization cell first.")
        results = find_best_matches(job_index, g['resume_tfidf'], g['job_tfidf'], df_resumes_clean, top_n=1000)
        if not results.empty:
            candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
            if not candidate_results.empty:
                score = candidate_results.iloc[0]['Similarity_Score']
            else:
                score = 0.0
        else:
            score = 0.0
    
    feedback = generate_candidate_feedback(job_index, resume_index, df_jobs_clean, df_resumes_clean, 
                                          score, model_type=model, threshold=threshold)
    return format_candidate_feedback(feedback)

# Example usage:
# feedback = get_feedback_for_candidate(job_index=0, resume_index=100, model='bert', threshold=0.6)
# print(feedback)


In [96]:
"""
Evaluation Metrics for matching system performance.
Includes Accuracy, Precision/Recall, and text generation metrics (ROUGE/BLEU/METEOR).
"""
try:
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
    from sklearn.metrics import confusion_matrix
    METRICS_AVAILABLE = True
except ImportError:
    METRICS_AVAILABLE = False

try:
    from rouge_score import rouge_scorer
    ROUGE_AVAILABLE = True
except ImportError:
    ROUGE_AVAILABLE = False
    try:
        import nltk
        nltk.download('punkt', quiet=True)
        from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
        from nltk.translate.meteor_score import meteor_score
        BLEU_METEOR_AVAILABLE = True
    except:
        BLEU_METEOR_AVAILABLE = False

def evaluate_matching_accuracy(true_matches, predicted_matches, top_k=10):
    """
    Evaluate matching accuracy using ground truth labels.
    
    Args:
        true_matches: Dictionary {job_index: [list of true resume IDs]}
        predicted_matches: Dictionary {job_index: DataFrame with predicted matches}
        top_k: Number of top predictions to consider
    
    Returns:
        Dictionary with accuracy metrics
    """
    if not METRICS_AVAILABLE:
        return {"error": "sklearn not available"}
    
    all_true = []
    all_pred = []
    
    for job_idx in true_matches.keys():
        if job_idx in predicted_matches:
            true_resume_ids = set(true_matches[job_idx])
            pred_df = predicted_matches[job_idx]
            
            if not pred_df.empty and 'Resume_ID' in pred_df.columns:
                pred_resume_ids = set(pred_df.head(top_k)['Resume_ID'].tolist())
                
                # Binary classification: 1 if in top-k, 0 otherwise
                for resume_id in true_resume_ids:
                    all_true.append(1)
                    all_pred.append(1 if resume_id in pred_resume_ids else 0)
    
    if len(all_true) == 0:
        return {"error": "No matching data available"}
    
    accuracy = accuracy_score(all_true, all_pred)
    precision = precision_score(all_true, all_pred, zero_division=0)
    recall = recall_score(all_true, all_pred, zero_division=0)
    f1 = f1_score(all_true, all_pred, zero_division=0)
    
    return {
        'accuracy': round(accuracy, 3),
        'precision': round(precision, 3),
        'recall': round(recall, 3),
        'f1_score': round(f1, 3),
        'total_samples': len(all_true)
    }

def calculate_rouge_score(reference, generated):
    """
    Calculate ROUGE scores for generated text evaluation.
    
    Args:
        reference: Reference text string
        generated: Generated text string
    
    Returns:
        Dictionary with ROUGE-1, ROUGE-2, ROUGE-L scores
    """
    if not ROUGE_AVAILABLE:
        return {"error": "rouge_score library not available. Install with: pip install rouge-score"}
    
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, generated)
    
    return {
        'rouge1': {
            'precision': round(scores['rouge1'].precision, 3),
            'recall': round(scores['rouge1'].recall, 3),
            'fmeasure': round(scores['rouge1'].fmeasure, 3)
        },
        'rouge2': {
            'precision': round(scores['rouge2'].precision, 3),
            'recall': round(scores['rouge2'].recall, 3),
            'fmeasure': round(scores['rouge2'].fmeasure, 3)
        },
        'rougeL': {
            'precision': round(scores['rougeL'].precision, 3),
            'recall': round(scores['rougeL'].recall, 3),
            'fmeasure': round(scores['rougeL'].fmeasure, 3)
        }
    }

def calculate_bleu_meteor(reference, generated):
    """
    Calculate BLEU and METEOR scores for text evaluation.
    
    Args:
        reference: Reference text (string or list of tokens)
        generated: Generated text (string or list of tokens)
    
    Returns:
        Dictionary with BLEU and METEOR scores
    """
    if not BLEU_METEOR_AVAILABLE:
        return {"error": "NLTK not properly configured"}
    
    try:
        from nltk.tokenize import word_tokenize
        
        if isinstance(reference, str):
            ref_tokens = word_tokenize(reference.lower())
        else:
            ref_tokens = reference
        
        if isinstance(generated, str):
            gen_tokens = word_tokenize(generated.lower())
        else:
            gen_tokens = generated
        
        smoothing = SmoothingFunction().method1
        bleu = sentence_bleu([ref_tokens], gen_tokens, smoothing_function=smoothing)
        
        meteor = meteor_score([ref_tokens], gen_tokens)
        
        return {
            'bleu': round(bleu, 3),
            'meteor': round(meteor, 3)
        }
    except Exception as e:
        return {"error": str(e)}

def evaluate_summary_quality(summary_text, reference_summary=None):
    """
    Evaluate quality of generated match summaries using ROUGE, BLEU, METEOR.
    
    Args:
        summary_text: Generated summary text
        reference_summary: Reference summary for comparison (optional)
    
    Returns:
        Dictionary with evaluation metrics
    """
    results = {}
    
    if reference_summary:
        rouge_scores = calculate_rouge_score(reference_summary, summary_text)
        results['rouge'] = rouge_scores
        
        bleu_meteor = calculate_bleu_meteor(reference_summary, summary_text)
        results['bleu_meteor'] = bleu_meteor
    else:
        results['note'] = "Reference summary not provided. Cannot calculate ROUGE/BLEU/METEOR."
    
    return results

def comprehensive_evaluation(job_index, top_n=10, model='bert'):
    """
    Comprehensive evaluation of matching system for a specific job.
    
    Args:
        job_index: Index of job to evaluate
        top_n: Number of top matches to evaluate
        model: 'bert', 'roberta', or 'tfidf'
    
    Returns:
        Dictionary with comprehensive evaluation metrics
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    job = df_jobs_clean.iloc[job_index]
    job_category_keywords = job.get('Category', '').lower() if 'Category' in job else ''
    
    # Get predictions
    if model.lower() == 'bert' and 'find_best_matches_bert' in g:
        predictions = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
    elif model.lower() == 'roberta' and 'find_best_matches_roberta' in g:
        predictions = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=top_n)
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing.")
        predictions = find_best_matches(job_index, g['resume_tfidf'], g['job_tfidf'], df_resumes_clean, top_n=top_n)
    
    if predictions.empty:
        return {"error": "No predictions generated"}
    
    # Category-based evaluation (assuming category match is a proxy for correctness)
    if 'Category' in predictions.columns and 'Category' in df_resumes_clean.columns:
        predicted_categories = predictions['Category'].tolist()
        job_category = job.get('Category', '')
        
        # Calculate category match rate
        category_matches = sum(1 for cat in predicted_categories if cat == job_category)
        category_accuracy = category_matches / len(predicted_categories) if predicted_categories else 0
        
        # Calculate average similarity score
        score_col = 'BERT_Similarity_Score' if 'BERT_Similarity_Score' in predictions.columns else \
                   'RoBERTa_Similarity_Score' if 'RoBERTa_Similarity_Score' in predictions.columns else \
                   'Similarity_Score'
        avg_score = predictions[score_col].mean()
        max_score = predictions[score_col].max()
        min_score = predictions[score_col].min()
    else:
        category_accuracy = 0
        score_col = 'Similarity_Score'
        avg_score = predictions[score_col].mean() if score_col in predictions.columns else 0
        max_score = predictions[score_col].max() if score_col in predictions.columns else 0
        min_score = predictions[score_col].min() if score_col in predictions.columns else 0
    
    # Generate sample summary and evaluate
    sample_resume_idx = 0
    try:
        sample_summary = get_match_summary_for_candidate(job_index, sample_resume_idx, model=model)
        summary_length = len(sample_summary.split())
    except:
        sample_summary = ""
        summary_length = 0
    
    evaluation = {
        'job_index': job_index,
        'job_title': job.get('Title', 'N/A'),
        'model_used': model.upper(),
        'top_n': top_n,
        'category_accuracy': round(category_accuracy, 3),
        'average_similarity': round(avg_score, 3),
        'max_similarity': round(max_score, 3),
        'min_similarity': round(min_score, 3),
        'summary_length': summary_length,
        'predictions_count': len(predictions)
    }
    
    return evaluation

# Example usage:
# eval_results = comprehensive_evaluation(job_index=0, top_n=10, model='bert')
# print(eval_results)


In [97]:
"""
Ethical Considerations and Responsible AI Discussion.
Covers bias, fairness, explainability, and responsible AI practices.
"""
def ethical_considerations_report():
    """
    Generate comprehensive ethical considerations report for the AI recruitment system.
    
    Returns:
        Formatted report on ethical considerations
    """
    report = f"""
{'='*70}
ETHICAL CONSIDERATIONS & RESPONSIBLE AI
{'='*70}

1. BIAS MITIGATION
{'-'*70}
Potential Biases Identified:
  • Gender bias: Model may favor certain gender-associated terms
  • Cultural bias: Language patterns may favor certain cultural backgrounds
  •Educational bias: Over-reliance on specific institutions or credentials
  • Experience bias: May favor candidates with more keywords over actual experience

Mitigation Strategies:
  • Use multiple models (BERT, RoBERTa, TF-IDF) to reduce single-model bias
  • Skill-based matching reduces reliance on demographic indicators
  • Transparent scoring allows human review of automated decisions
  • Regular model auditing and bias testing recommended
  • Diverse training data representation

2. FAIRNESS
{'-'*70}
Fairness Measures Implemented:
  • Objective similarity scoring based on content, not demographics
  • Multiple evaluation criteria (semantic + skills) for balanced assessment
  • Threshold-based filtering allows consistent application
  • Category-agnostic matching (does not explicitly filter by candidate category)

Fairness Concerns:
  • Model may still encode societal biases from training data
  • Skill extraction may favor certain terminology styles
  • No explicit demographic fairness checks implemented

Recommendations:
  • Implement demographic parity monitoring
  • Add fairness metrics (equalized odds, demographic parity)
  • Regular bias audits on matched candidates
  • Human-in-the-loop for final hiring decisions

3. EXPLAINABILITY
{'-'*70}
Explainability Features:
  • Match summaries provide clear reasoning for recommendations
  • Skill overlap analysis shows specific matching criteria
  • Similarity scores are transparent and interpretable
  • Feedback reports explain why candidates don't match

Limitations:
  • Deep learning models (BERT/RoBERTa) are black boxes
  • Feature importance not explicitly shown
  • Decision boundaries not clearly defined

Improvements Needed:
  • Add SHAP/LIME explanations for model decisions
  • Highlight most influential words/phrases in matching
  • Provide confidence intervals for scores
  • Visualize decision process flow

4. RESPONSIBLE AI PRACTICES
{'-'*70}
Current Practices:
  • Human oversight: System provides recommendations, not final decisions
  • Transparency: All scores and criteria are visible
  • Accountability: Clear documentation of matching logic
  • Privacy: No personal information stored beyond resume content

Best Practices Implemented:
  • Multiple model ensemble reduces single-point-of-failure
  • Caching system ensures consistent results
  • Error handling prevents silent failures
  • Comprehensive logging for audit trails

Areas for Improvement:
  • Add consent mechanisms for data usage
  • Implement data retention policies
  • Regular model performance monitoring
  • Bias testing framework
  • User feedback collection mechanism

5. DATA PRIVACY & SECURITY
{'-'*70}
Privacy Measures:
  • Resume data processed locally (no external API calls for matching)
  • Cached data stored securely in local files
  • No personally identifiable information (PII) in logs
  • Optional anonymization before processing

Security Considerations:
  • Input validation prevents injection attacks
  • File access restricted to necessary operations
  • No network transmission of sensitive data during matching

6. ACCOUNTABILITY & GOVERNANCE
{'-'*70}
Accountability Framework:
  • Clear documentation of system limitations
  • Transparent scoring methodology
  • Audit trail through match summaries
  • Human review required for final decisions

Governance Recommendations:
  • Regular model performance reviews
  • Bias audit schedule (quarterly recommended)
  • Stakeholder feedback integration
  • Continuous improvement process
  • Compliance with employment regulations

7. LIMITATIONS & DISCLAIMERS
{'-'*70}
System Limitations:
  • AI matching is a tool, not a replacement for human judgment
  • Scores may not reflect all relevant factors (personality, culture fit)
  • Model trained on historical data may perpetuate existing biases
  • Language model limitations may miss nuanced qualifications
  • No guarantee of perfect matches or hiring success

Recommended Usage:
  • Use as initial screening tool to reduce manual review time
  • Always include human review for final hiring decisions
  • Consider multiple factors beyond AI scores
  • Regularly update and retrain models with new data
  • Monitor for bias and adjust thresholds as needed

{'='*70}
"""
    return report

def print_ethical_considerations():
    """Print the ethical considerations report."""
    print(ethical_considerations_report())

# Generate and display ethical considerations
print_ethical_considerations()



ETHICAL CONSIDERATIONS & RESPONSIBLE AI

1. BIAS MITIGATION
----------------------------------------------------------------------
Potential Biases Identified:
  • Gender bias: Model may favor certain gender-associated terms
  • Cultural bias: Language patterns may favor certain cultural backgrounds
  •Educational bias: Over-reliance on specific institutions or credentials
  • Experience bias: May favor candidates with more keywords over actual experience

Mitigation Strategies:
  • Use multiple models (BERT, RoBERTa, TF-IDF) to reduce single-model bias
  • Skill-based matching reduces reliance on demographic indicators
  • Transparent scoring allows human review of automated decisions
  • Regular model auditing and bias testing recommended
  • Diverse training data representation

2. FAIRNESS
----------------------------------------------------------------------
Fairness Measures Implemented:
  • Objective similarity scoring based on content, not demographics
  • Multiple evaluation 

In [98]:
"""
Complete Job-Resume Evaluation System
Generates all three required outputs: similarity score, match summary, and feedback report.
"""
def complete_job_resume_evaluation(job_index, resume_index, model='bert', threshold=0.6):
    """
    Complete evaluation system that generates all three required outputs:
    1. Similarity Score
    2. Match Summary for hiring managers
    3. Feedback report for candidates
    
    Args:
        job_index: Index of job in df_jobs_clean
        resume_index: Index of resume in df_resumes_clean
        model: 'bert', 'roberta', or 'tfidf'
        threshold: Minimum score threshold
    
    Returns:
        Dictionary containing all three outputs
    """
    g = globals()
    
    if 'df_jobs_clean' not in g or 'df_resumes_clean' not in g:
        raise RuntimeError("df_jobs_clean or df_resumes_clean not found. Run preprocessing cells first.")
    
    df_jobs_clean = g['df_jobs_clean']
    df_resumes_clean = g['df_resumes_clean']
    
    # Calculate similarity score
    if model.lower() == 'bert' and 'find_best_matches_bert' in g:
        results = find_best_matches_bert(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        score_col = 'BERT_Similarity_Score'
    elif model.lower() == 'roberta' and 'find_best_matches_roberta' in g:
        results = find_best_matches_roberta(job_index, df_jobs_clean, df_resumes_clean, top_n=1000)
        score_col = 'RoBERTa_Similarity_Score'
    else:
        if 'job_tfidf' not in g or 'resume_tfidf' not in g:
            raise RuntimeError("TF-IDF artifacts missing.")
        results = find_best_matches(job_index, g['resume_tfidf'], g['job_tfidf'], df_resumes_clean, top_n=1000)
        score_col = 'Similarity_Score'
    
    if not results.empty:
        candidate_results = results[results['Resume_ID'] == df_resumes_clean.iloc[resume_index]['ID']]
        if not candidate_results.empty:
            similarity_score = candidate_results.iloc[0][score_col]
        else:
            similarity_score = 0.0
    else:
        similarity_score = 0.0
    
    # Generate match summary
    summary_dict = generate_match_summary(job_index, resume_index, df_jobs_clean, df_resumes_clean, 
                                         similarity_score, model_type=model, threshold=threshold)
    match_summary = format_match_summary(summary_dict)
    
    # Generate feedback report
    feedback_dict = generate_candidate_feedback(job_index, resume_index, df_jobs_clean, df_resumes_clean, 
                                                similarity_score, model_type=model, threshold=threshold)
    feedback_report = format_candidate_feedback(feedback_dict)
    
    return {
        'similarity_score': round(similarity_score, 3),
        'match_summary': match_summary,
        'feedback_report': feedback_report,
        'meets_threshold': similarity_score >= threshold,
        'summary_dict': summary_dict,
        'feedback_dict': feedback_dict
    }

def display_complete_evaluation(job_index, resume_index, model='bert', threshold=0.6):
    """
    Display all three outputs in a formatted way.
    
    Args:
        job_index: Index of job
        resume_index: Index of resume
        model: 'bert', 'roberta', or 'tfidf'
        threshold: Minimum score threshold
    """
    results = complete_job_resume_evaluation(job_index, resume_index, model=model, threshold=threshold)
    
    print("="*70)
    print("COMPLETE JOB-RESUME EVALUATION")
    print("="*70)
    print(f"\n1. SIMILARITY SCORE: {results['similarity_score']:.3f}")
    print(f"   Threshold: {threshold}")
    print(f"   Status: {'PASS' if results['meets_threshold'] else 'FAIL'}")
    
    print("\n" + "="*70)
    print("2. MATCH SUMMARY (For Hiring Managers)")
    print("="*70)
    print(results['match_summary'])
    
    if not results['meets_threshold']:
        print("\n" + "="*70)
        print("3. FEEDBACK REPORT (For Candidate)")
        print("="*70)
        print(results['feedback_report'])
    else:
        print("\n" + "="*70)
        print("3. FEEDBACK REPORT (For Candidate)")
        print("="*70)
        print("Candidate meets threshold. Feedback report available upon request.")
        print(results['feedback_report'])

# Example usage:
# display_complete_evaluation(job_index=0, resume_index=0, model='bert', threshold=0.6)


In [99]:
"""
RoBERTa Model Verification and Testing
"""
print("=" * 60)
print("RoBERTa MODEL VERIFICATION")
print("=" * 60)

if 'roberta_model' in globals() and roberta_model is not None:
    print("RoBERTa model: Available")
    print(f"Model type: {type(roberta_model)}")
    
    # Test similarity calculation
    print("\nTesting RoBERTa similarity:")
    test_sim = calculate_roberta_similarity("software engineer", "developer")
    print(f"  'software engineer' ↔ 'developer': {test_sim:.3f}")
    
    # Test matching if data is available
    if 'df_jobs_clean' in globals() and 'df_resumes_clean' in globals():
        print("\nTesting RoBERTa matching:")
        try:
            roberta_test = find_best_matches_roberta(0, df_jobs_clean, df_resumes_clean, top_n=3)
            if not roberta_test.empty:
                print(f"  Found {len(roberta_test)} matches")
                print(f"  Top match score: {roberta_test.iloc[0]['RoBERTa_Similarity_Score']:.3f}")
            else:
                print("  No matches found")
        except Exception as e:
            print(f"  Matching test failed: {e}")
    
    print("\nRoBERTa is ready to use!")
    print("\nExample usage:")
    print("  get_top_n_jobs_for_resume(resume_index=0, n=10, model='roberta')")
    print("  get_top_n_resumes_for_job(job_index=0, n=10, model='roberta')")
    print("  compare_matching_methods(job_index=0, top_n=5)  # Includes RoBERTa")
else:
    print("RoBERTa model: NOT AVAILABLE")
    print("Possible reasons:")
    print("  1. Model initialization failed (check error messages above)")
    print("  2. transformers library not installed")
    print("  3. Model download failed")
    print("\nTo fix: Re-run Cell 7 to initialize RoBERTa model")


RoBERTa MODEL VERIFICATION
RoBERTa model: Available
Model type: <class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>

Testing RoBERTa similarity:
  'software engineer' ↔ 'developer': 0.526

Testing RoBERTa matching:
  Found 3 matches
  Top match score: 0.843

RoBERTa is ready to use!

Example usage:
  get_top_n_jobs_for_resume(resume_index=0, n=10, model='roberta')
  get_top_n_resumes_for_job(job_index=0, n=10, model='roberta')
  compare_matching_methods(job_index=0, top_n=5)  # Includes RoBERTa


In [100]:
jobs_for_candidate = find_jobs_for_candidate(resume_index=1000, top_n=10)

Resume [1000] — ID: 40987524 | Category: SALES
Resume preview:          SALES       Summary    Over 17 years of sales and operations management experience in specialty and big-box retail and 4 years sales experien...

Top 10 matching jobs (RoBERTa):


Unnamed: 0,Rank,Title,Company,Location,Score,Description
0,1,Sales Manager,Ameria CJSC,"Yerevan, Armenia",0.641577,"On behalf of its partner, Ameria CJSC is seeking\r\napplicants for the position of Sales Manager..."
1,2,Customer Care Manager,Lycos Europe,"Yerevan, Armenia",0.637438,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
2,3,Customer Care Manager,Lycos Europe,"Yerevan, Armenia",0.632793,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
3,4,Customer Care Manager,Lycos Europe,"Yerevan, Armenia",0.632793,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
4,5,Customer Care Manager,Lycos Europe,"Yerevan, Armenia",0.632793,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
5,6,Sales & Marketing Manager,Ard Style,"Yerevan, Armenia",0.624279,Ard Style is looking for an experienced Sales and\r\nMarketing Manager to be responsible for dev...
6,7,Marketing Advisor,ACDI/VOCA,"Tbilisi, Georgia",0.623824,The Marketing Advisor will lead the project in\r\nidentifying and developing market opportunitie...
7,8,Customer Care Co-ordinator,Lycos Europe,"Yerevan, Armenia",0.612038,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
8,9,Customer Care Co-ordinator,Lycos Europe,"Yerevan, Armenia",0.612038,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."
9,10,Customer Care Co-ordinator,Lycos Europe,"Yerevan, Armenia",0.612038,"To build up our European Sales Support Team in\r\nArmenia, we are currently looking to recruit s..."


In [101]:
# Complete evaluation with all three outputs
display_complete_evaluation(job_index=0, resume_index=0, model='bert', threshold=0.6)

# Just match summary
summary = get_match_summary_for_candidate(job_index=0, resume_index=0, model='bert')
print(summary)

# Just feedback report
feedback = get_feedback_for_candidate(job_index=0, resume_index=100, model='bert', threshold=0.6)
print(feedback)

# Evaluation metrics
eval_results = comprehensive_evaluation(job_index=0, top_n=10, model='bert')
print(eval_results)

COMPLETE JOB-RESUME EVALUATION

1. SIMILARITY SCORE: 0.489
   Threshold: 0.6
   Status: FAIL

2. MATCH SUMMARY (For Hiring Managers)

MATCH SUMMARY REPORT

Job Position: Chief Financial Officer
Company: AMERIA Investment Consulting Company
Candidate ID: 16852973
Candidate Category: HR

OVERALL ASSESSMENT
----------------------------------------------------------------------
Similarity Score: 0.489
Alignment Level: Weak Match
Recommendation: Not Recommended

WHY THIS CANDIDATE IS A FIT
----------------------------------------------------------------------
Candidate demonstrates 48.9% semantic alignment with the job requirements. Skill overlap of 32.6% indicates relevant experience.

SKILLS ANALYSIS
----------------------------------------------------------------------
Matching Skills (10): skill, administration, procedure, leadership, year, law, analysis, budgeting, activity, office
Skill Match Ratio: 32.6%

GAPS IDENTIFIED
---------------------------------------------------------------