# Word Embedding from ChatGPT

Project Overview and Setup
This project aims to reduce recruitment bias by automatically parsing resumes (PDF and DOCX), ranking them using both traditional TF‑IDF and semantic similarity (Word2Vec), mitigating bias by removing demographic indicators, and providing detailed feedback for both recruiters and job seekers.

Features:

Improved Resume Parsing: Use advanced libraries (pdfplumber for PDFs, python‑docx with regex cleanup for DOCX) to extract text with better formatting.
Word Embeddings (Word2Vec): Train a Word2Vec model to capture semantic meaning in resumes and job descriptions.
Bias Mitigation: Detect and remove demographic indicators (like names and addresses) using spaCy’s NER.
Enhanced Feedback System: Generate detailed feedback – explaining ranking rationale for recruiters and improvement suggestions for candidates.

# 1.  Improved Resume Parsing and Cleaning

In [1]:
# Import necessary libraries for text extraction and regex cleanup
import re
import pdfplumber   # Advanced PDF extraction library
from docx import Document

def extract_text_from_file(file_path):
    """
    Extracts and cleans text from a PDF or DOCX file.
    For PDFs, uses pdfplumber to capture formatted text.
    For DOCX, uses python-docx and regex for cleanup.
    
    Parameters:
        file_path (str): Path to the resume file.
    
    Returns:
        text (str): The extracted and cleaned text.
    """
    text = ""
    if file_path.endswith(".pdf"):
        try:
            # Open the PDF with pdfplumber and extract text from each page.
            with pdfplumber.open(file_path) as pdf:
                for page in pdf.pages:
                    page_text = page.extract_text() or ""
                    text += page_text + "\n"
        except Exception as e:
            print(f"Error processing PDF {file_path}: {e}")
    elif file_path.endswith(".docx"):
        try:
            # Open DOCX file and extract paragraphs.
            doc = Document(file_path)
            # Join paragraphs and apply regex cleanup to remove extra spaces/newlines.
            raw_text = "\n".join([para.text for para in doc.paragraphs])
            # Regex cleanup: remove multiple spaces and trim whitespace.
            text = re.sub(r'\s+', ' ', raw_text).strip()
        except Exception as e:
            print(f"Error processing DOCX {file_path}: {e}")
    return text


# 2: Reading Resumes from Folder

In [3]:
import os

def read_resumes_from_folder(root_folder):
    """
    Recursively reads all PDF and DOCX files from a root folder and its subfolders.
    
    Parameters:
        root_folder (str): The directory containing resume files.
    
    Returns:
        resumes (list): List of extracted resume texts.
    """
    resumes = []
    total_files = 0
    # Walk through directory and subdirectories
    for dirpath, _, filenames in os.walk(root_folder):
        for filename in filenames:
            if filename.endswith(".pdf") or filename.endswith(".docx"):
                total_files += 1
                file_path = os.path.join(dirpath, filename)
                try:
                    text = extract_text_from_file(file_path)
                    resumes.append(text)  # Append the extracted text
                except Exception as e:
                    print(f"Error processing file {file_path}: {e}")
    print(f"Total files found: {total_files}")
    print(f"Total resumes processed: {len(resumes)}")
    return resumes

# Example usage: Change 'your_resume_directory' to your actual resume folder path.
dataset_path = "Datasets/data"  # Replace with your actual folder path
all_resumes = read_resumes_from_folder(dataset_path)


Total files found: 2484
Total resumes processed: 2484


# 3: Training the Word2Vec Model

In [4]:
# Import gensim for Word2Vec training and a simple preprocessor for tokenization
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

def tokenize_texts(texts):
    """
    Tokenizes a list of texts into lists of words using gensim's simple_preprocess.
    
    Parameters:
        texts (list): List of string texts.
    
    Returns:
        tokenized_texts (list): List of lists, where each sublist contains tokens.
    """
    return [simple_preprocess(text) for text in texts]

# Tokenize all resume texts
tokenized_resumes = tokenize_texts(all_resumes)

# Train the Word2Vec model on the tokenized resumes
# Parameters: size=100 dimensions, window=5 for context, min_count=2 to ignore rare words, and workers for parallelism.
w2v_model = Word2Vec(sentences=tokenized_resumes, vector_size=100, window=5, min_count=2, workers=4)
print("Word2Vec model trained on resume data.")

# Optional: Function to compute average Word2Vec similarity between two texts
import numpy as np

def compute_avg_w2v_similarity(text1, text2, model):
    """
    Computes the average cosine similarity between the word vectors of two texts.
    
    Parameters:
        text1 (str): First text.
        text2 (str): Second text.
        model (Word2Vec): Trained Word2Vec model.
    
    Returns:
        avg_similarity (float): Average cosine similarity between words of the two texts.
    """
    tokens1 = simple_preprocess(text1)
    tokens2 = simple_preprocess(text2)
    # Filter tokens that are in the model's vocabulary.
    tokens1 = [token for token in tokens1 if token in model.wv]
    tokens2 = [token for token in tokens2 if token in model.wv]
    if not tokens1 or not tokens2:
        return 0.0
    similarities = []
    # Compute similarity for each pair and average them.
    for token1 in tokens1:
        sim_scores = [model.wv.similarity(token1, token2) for token2 in tokens2]
        if sim_scores:
            similarities.append(np.mean(sim_scores))
    return np.mean(similarities) if similarities else 0.0


Word2Vec model trained on resume data.


# **Bias Mitigation Functions**

### **Bias Mitigation Functions**
****This section focuses on mitigating bias by detecting and removing demographic indicators (e.g., names, addresses) from the resume texts. We use spaCy’s Named Entity Recognition (NER) to identify entities such as PERSON and GPE (geopolitical entities) and then remove them from the text. This is a basic debiasing approach that can help reduce the influence of non-skill–related information.****

# 4: Bias Mitigation – Removing Demographic Indicators

In [5]:
import spacy

# Load the small English model in spaCy for NER.
# (Make sure to install spaCy and the model: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

def remove_demographic_indicators(text):
    """
    Removes demographic indicators such as names and geographical locations from the text.
    Uses spaCy's NER to detect entities labeled as PERSON and GPE.
    
    Parameters:
        text (str): The input text to clean.
    
    Returns:
        cleaned_text (str): The text after removing demographic indicators.
    """
    doc = nlp(text)
    # Build a list of tokens that are not demographic indicators.
    tokens = [token.text for token in doc if token.ent_type_ not in ["PERSON", "GPE"]]
    # Reconstruct text; simple join may not preserve punctuation perfectly.
    cleaned_text = " ".join(tokens)
    return cleaned_text

def debiased_text(text):
    """
    Applies bias mitigation techniques to the text.
    Currently, it removes demographic indicators.
    
    Parameters:
        text (str): Original text.
    
    Returns:
        text (str): Debiased text.
    """
    return remove_demographic_indicators(text)

# Example: Clean a sample resume text (uncomment to test)
# sample_text = "John Doe from New York has 5 years of experience in Python."
# print(debiased_text(sample_text))


  from .autonotebook import tqdm as notebook_tqdm


# **Enhanced Feedback System**

Enhanced Feedback System
Here we improve the feedback mechanism to provide detailed, twofold feedback:

    *For Recruiters: Explaining why a resume ranks high or low (e.g., matching skills, semantic similarity, and any missing key competencies).

    *For Job Seekers: Personalized suggestions for improvement based on missing skills or areas of weakness.

We combine insights from both TF‑IDF (for keyword matching) and Word2Vec (for semantic similarity) to create comprehensive feedback.

# 5: Enhanced Feedback Generation

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_detailed_feedback(job_description, resume, vectorizer, w2v_model):
    """
    Generates detailed feedback for both recruiters and job seekers.
    
    Parameters:
        job_description (str): The job description text.
        resume (str): The resume text.
        vectorizer (TfidfVectorizer): Pre-fitted TF-IDF vectorizer.
        w2v_model (Word2Vec): Trained Word2Vec model.
    
    Returns:
        feedback (dict): Dictionary containing 'Recruiter Feedback' and 'Candidate Feedback'.
    """
    # Transform texts into TF-IDF vectors.
    job_tfidf = vectorizer.transform([job_description])
    resume_tfidf = vectorizer.transform([resume])
    
    # Get feature names and extract keywords present in the job description and resume.
    feature_names = vectorizer.get_feature_names_out()
    job_keywords = set([feature_names[i] for i in job_tfidf.indices])
    resume_keywords = set([feature_names[i] for i in resume_tfidf.indices])
    
    # Identify missing skills/keywords.
    missing_skills = job_keywords - resume_keywords
    matching_skills = job_keywords.intersection(resume_keywords)
    
    # Calculate semantic similarity using Word2Vec.
    semantic_similarity = compute_avg_w2v_similarity(job_description, resume, w2v_model)
    
    # Build Recruiter Feedback.
    recruiter_feedback = (
        f"Semantic Similarity Score: {semantic_similarity:.2f}. "
        f"Matching Keywords: {', '.join(matching_skills) if matching_skills else 'None'}. "
        f"Missing Keywords: {', '.join(missing_skills) if missing_skills else 'None'}."
    )
    
    # Build Candidate Feedback with suggestions for improvement.
    candidate_feedback = (
        "Consider adding details to showcase the following skills/keywords: "
        f"{', '.join(missing_skills)}." if missing_skills else "Excellent match with the job description!"
    )
    
    return {
        "Recruiter Feedback": recruiter_feedback,
        "Candidate Feedback": candidate_feedback
    }

# Example usage (uncomment to test):
# job_desc = "Looking for a software engineer proficient in Python, machine learning, and data analysis."
# print(generate_detailed_feedback(job_desc, all_resumes[0], TfidfVectorizer(stop_words="english"), w2v_model))


# **Main Execution – Ranking and Saving Results**

***Main Execution: Ranking and Saving Results In this final section, we tie together all conmponents:***

* Read and clean resumes from a folder.
* Apply bias mitigation to remove demographic details.
* Compute TF‑IDF ranking against a given job description.
* Generate enhanced feedback using both TF‑IDF and Word2Vec similarity.
* Save the ranked resumes along with detailed feedback to a CSV file.


# **6: Main Execution Code**

In [8]:
import pandas as pd

# Define the job description for which resumes will be ranked.
job_description = """
We are seeking a software engineer with expertise in Python, machine learning, and data analysis.
The candidate should demonstrate strong problem-solving abilities and hands-on experience with real-world projects.
"""

# Initialize a TF-IDF Vectorizer (ensure consistency across resume processing)
tfidf_vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)

# Optionally, apply bias mitigation to all resumes
debiased_resumes = [debiased_text(resume) for resume in all_resumes]

# Fit the TF-IDF vectorizer on the debiased resumes.
tfidf_matrix = tfidf_vectorizer.fit_transform(debiased_resumes)

# Transform the job description into a TF-IDF vector.
job_tfidf = tfidf_vectorizer.transform([job_description])

# Compute cosine similarity between the job description and each resume.
similarities = cosine_similarity(job_tfidf, tfidf_matrix)[0]

# Rank resumes by similarity score (highest first)
ranked_indices = similarities.argsort()[::-1]

# Prepare data for saving results.
ranked_data = []
for idx in ranked_indices:
    # Get the original (debiased) resume text.
    resume_text = debiased_resumes[idx]
    # Generate detailed feedback using our enhanced feedback system.
    feedback = generate_detailed_feedback(job_description, resume_text, tfidf_vectorizer, w2v_model)
    rank = ranked_indices.tolist().index(idx) + 1
    ranked_data.append({
        "Rank": rank,
        "Resume Index": idx + 1,
        "Cosine Similarity": similarities[idx],
        "Recruiter Feedback": feedback["Recruiter Feedback"],
        "Candidate Feedback": feedback["Candidate Feedback"],
        # Optionally, include a snippet of the resume text.
        "Resume Snippet": resume_text[:500] + "..."
    })

# Convert to DataFrame and save to CSV.
df = pd.DataFrame(ranked_data)
output_csv = "CHATGPT_WORD_EMBERDDING_ranked_resumes_with_enhanced_feedback.csv"
df.to_csv(output_csv, index=False, encoding="utf-8")
print(f"Results saved to {output_csv}")


Results saved to CHATGPT_WORD_EMBERDDING_ranked_resumes_with_enhanced_feedback.csv


In [10]:
df.head()

Unnamed: 0,Rank,Resume Index,Cosine Similarity,Recruiter Feedback,Candidate Feedback,Resume Snippet
0,1,1465,0.145224,Semantic Similarity Score: 0.11. Matching Keyw...,Consider adding details to showcase the follow...,ENGINEERING AND QUALITY TECHNICIAN \n Career O...
1,2,1935,0.130395,Semantic Similarity Score: 0.09. Matching Keyw...,Consider adding details to showcase the follow...,HR REPRESENTATIVE \n Summary \n A motivated bu...
2,3,513,0.124345,Semantic Similarity Score: 0.10. Matching Keyw...,Consider adding details to showcase the follow...,DATA ANALYST \n Professional Summary \n Indust...
3,4,517,0.124105,Semantic Similarity Score: 0.11. Matching Keyw...,Consider adding details to showcase the follow...,Highlights \n Prog . Languages : C ( 5 + yrs )...
4,5,1178,0.117274,Semantic Similarity Score: 0.10. Matching Keyw...,Consider adding details to showcase the follow...,\n Summary \n Customer - oriented Principal Co...


In [14]:
df.iloc[3,3] # Recruiter feedback read

'Semantic Similarity Score: 0.11. Matching Keywords: python, expertise, real, hands, projects, software, experience, analysis, engineer, data, machine, learning. Missing Keywords: candidate, solving, strong, problem, demonstrate, abilities, world, seeking.'

In [15]:
df.iloc[3,4] # Candidate feedback read

'Consider adding details to showcase the following skills/keywords: candidate, solving, strong, problem, demonstrate, abilities, world, seeking.'