**Job Recommendation System**

**Step 1:** Install Required Packages
This system requires several Python packages to function properly. We install:

sentence-transformers: For generating text embeddings that capture semantic meaning of resumes and job descriptions

umap-learn: For dimensionality reduction to visualize high-dimensional embedding spaces

kagglehub: For downloading the resume dataset directly from Kaggle

In [None]:
# JOB RECOMMENDATION SYSTEM

# STEP 1: INSTALL REQUIRED PACKAGES

!pip install sentence-transformers --quiet
!pip install umap-learn --quiet
!pip install kagglehub --quiet

**Step 2:** Import All Libraries
We import all necessary Python libraries for data processing, machine learning, and visualization. This consolidated approach ensures all dependencies are loaded before any code execution begins.

In [None]:

# STEP 2: IMPORT ALL LIBRARIES

import pandas as pd
import numpy as np
import os
import re
from collections import Counter

# ML/NLP Libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
import umap
from sklearn.manifold import TSNE

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Google Colab integration
from google.colab import drive

**Step 3:** Load Resume Dataset
We download and load the resume dataset from Kaggle, which contains resumes in multiple formats (text and HTML) along with categories. Since Kaggle datasets can be stored in different locations depending on the environment, we try multiple possible file paths to ensure successful loading across different platforms.

In [None]:
# STEP 3: LOAD RESUME DATASET
# Description: Load resume dataset from Kaggle using kagglehub

import kagglehub

# Download and load resume dataset
path = kagglehub.dataset_download("snehaanbhawal/resume-dataset")

# Try multiple possible paths
paths = [
    "/kaggle/input/resume-dataset/Resume/Resume.csv",
    "/root/.cache/kagglehub/datasets/snehaanbhawal/resume-dataset/versions/1/Resume/Resume.csv"
]

for p in paths:
    try:
        df = pd.read_csv(p, encoding="latin-1")
        print("Successfully loaded:", p)
        break
    except:
        pass

**Step 4:** Mount Google Drive
To access the ONET dataset (which contains comprehensive occupational information), we need to mount Google Drive. The ONET dataset should be stored in the drive under an 'ONET' directory. We mount the drive and list its contents to verify the directory structure.

In [None]:
# STEP 4: MOUNT GOOGLE DRIVE
# Description: Mount Google Drive to access O*NET dataset

drive.mount('/content/drive')
print(os.listdir('/content/drive/MyDrive'))


**Step 5:** Load O*NET Dataset

The O*NET (Occupational Information Network) dataset provides detailed information about occupations, including required skills, tasks, work activities, and more. We load multiple files from this dataset, each providing different dimensions of job data. The dataset uses tab-separated values and Latin-1 encoding.

In [None]:
# STEP 5: LOAD O*NET DATASET
# Description: Load various O*NET files for occupation data

# Check if ONET directory exists
try:
    print(os.listdir('/content/drive/MyDrive/ONET'))
except FileNotFoundError:
    print("ONET directory not found. Please ensure O*NET dataset is in your Google Drive.")

# Load O*NET files
skills = pd.read_csv('/content/drive/MyDrive/ONET/Skills.txt', sep='\t', encoding='latin-1')
abilities = pd.read_csv('/content/drive/MyDrive/ONET/Abilities.txt', sep='\t', encoding='latin-1')
tasks = pd.read_csv('/content/drive/MyDrive/ONET/Task Statements.txt', sep='\t', encoding='latin-1')
occupation = pd.read_csv('/content/drive/MyDrive/ONET/Occupation Data.txt', sep='\t', encoding='latin-1')

**Step 6: Create Occupation Profiles**

We combine multiple O*NET files into comprehensive occupation profiles. Each profile includes basic information (title and description), plus aggregated lists of required skills, typical tasks, and work activities. This structured format makes it easier to compare occupations and match them with resumes.

In [None]:
# STEP 6: CREATE OCCUPATION PROFILES
# Description: Combine multiple O*NET files into comprehensive occupation profiles

# Loading key ONET files
occ = pd.read_csv('/content/drive/MyDrive/ONET/Occupation Data.txt', sep='\t', encoding='latin-1')
skills = pd.read_csv('/content/drive/MyDrive/ONET/Skills.txt', sep='\t', encoding='latin-1')
tasks = pd.read_csv('/content/drive/MyDrive/ONET/Task Statements.txt', sep='\t', encoding='latin-1')
activities = pd.read_csv('/content/drive/MyDrive/ONET/Work Activities.txt', sep='\t', encoding='latin-1')

# Group aggregation (join multiple rows per occupation)
occ_skills = skills.groupby('O*NET-SOC Code')['Element Name'].apply(list).reset_index()
occ_tasks = tasks.groupby('O*NET-SOC Code')['Task'].apply(list).reset_index()
occ_activities = activities.groupby('O*NET-SOC Code')['Element Name'].apply(list).reset_index()

# Merge them all
occupation_profile = occ[['O*NET-SOC Code', 'Title', 'Description']] \
    .merge(occ_skills, on='O*NET-SOC Code', how='left') \
    .merge(occ_tasks, on='O*NET-SOC Code', how='left') \
    .merge(occ_activities, on='O*NET-SOC Code', how='left')

occupation_profile.head()

**Step 7: Prepare Embeddings**

Text embeddings convert textual information into numerical vectors that capture semantic meaning. We clean the data, combine multiple text fields from each occupation profile, and use a pre-trained language model (all-MiniLM-L6-v2) to generate embeddings. These embeddings will allow us to compute similarity between resumes and job descriptions.

In [None]:
# STEP 7: PREPARE EMBEDDINGS
# Description: Clean and prepare text data for embedding generation

print(occupation_profile.columns)
occupation_profile = occupation_profile.rename(columns={
    'Element Name_x': 'Skills',
    'Task': 'Tasks',
    'Element Name_y': 'Work_Activities'
})

# Remove duplicates in lists
for col in ['Skills', 'Tasks', 'Work_Activities']:
    occupation_profile[col] = occupation_profile[col].apply(
        lambda x: list(set(x)) if isinstance(x, list) else x
    )

# Combine text fields
def combine_text(row):
    text = f"{row['Title']} {row['Description']} "
    if isinstance(row['Skills'], list):
        text += ' '.join(row['Skills'])
    if isinstance(row['Tasks'], list):
        text += ' '.join(row['Tasks'])
    if isinstance(row['Work_Activities'], list):
        text += ' '.join(row['Work_Activities'])
    return text

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
occupation_profile['combined_text'] = occupation_profile.apply(combine_text, axis=1)
occupation_profile['embedding'] = occupation_profile['combined_text'].apply(lambda x: model.encode(str(x)))

**Step 8: Match Explanation Function**

This helper function explains why specific job matches were recommended by identifying overlapping skills between resumes and job descriptions. It provides transparency to the recommendation process and helps users understand the basis for each match.

In [None]:
# STEP 8: MATCH EXPLANATION FUNCTION

def explain_match(resume_skills, job_skills):
    if isinstance(job_skills, list):
        return list(set([s.lower() for s in job_skills]) & set(resume_skills))
    return []

**Step 9: Load and Preprocess Resume Data**

We load the resume data and prepare it for the recommendation system. This involves identifying the correct columns containing resume text and handling any necessary text preprocessing. The resume data needs to be in a clean, consistent format for embedding generation.

In [None]:
# STEP 9: LOAD AND PREPROCESS RESUME DATA
# Description: Load resume data and prepare it for recommendation

# Resume dataset
resume_path = '/kaggle/input/resume-dataset/Resume/Resume.csv'
resumes = pd.read_csv(resume_path)
print("Resume columns:", resumes.columns)

# O*NET dataset
onet_path = '/content/drive/MyDrive/ONET/Task Statements.txt'
tasks = pd.read_csv(onet_path, sep='\t', encoding='latin-1')
print("O*NET columns:", tasks.columns)

**Step 10: Basic Recommendation System**

This section implements a baseline job recommendation system using simple text matching and cosine similarity. It serves as a starting point to understand the basic mechanics of our approach before implementing more sophisticated techniques.

In [None]:
# STEP 10: BASIC RECOMMENDATION SYSTEM
# Description: Initial implementation of job recommendation

# Assume first column contains resume text
resume_text_col = 'Resume' if 'Resume' in resumes.columns else resumes.columns[0]

# For demonstration, let's load Skills.txt as well
skills = pd.read_csv('/content/drive/MyDrive/ONET/Skills.txt', sep='\t', encoding='latin-1')

# Merge or create an occupation profile (simplified)
occupation_profile = skills.copy()
occupation_profile['combined_text'] = occupation_profile['Element Name'].astype(str)

# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
resumes['embedding'] = resumes[resume_text_col].astype(str).apply(lambda x: model.encode(x))
occupation_profile['embedding'] = occupation_profile['combined_text'].astype(str).apply(lambda x: model.encode(x))

# Recommendation function
def recommend_jobs(resume_embedding, occupation_embeddings, occupation_titles, top_k=5):
    sims = cosine_similarity([resume_embedding], list(occupation_embeddings))[0]
    top_idx = sims.argsort()[::-1][:top_k]
    top_jobs = occupation_titles.iloc[top_idx]
    top_scores = sims[top_idx]
    return pd.DataFrame({'Job Title': top_jobs, 'Similarity': top_scores})

# Example: Recommend for first resume
first_resume_embedding = resumes['embedding'][0]
recommendations = recommend_jobs(first_resume_embedding,
                                 occupation_profile['embedding'],
                                 occupation_profile['Element Name'],
                                 top_k=5)

print("Top 5 job recommendations for first resume:")
print(recommendations)

**Step 11: Advanced Recommendation System**

This enhanced version incorporates more O*NET data fields and uses weighted text combination for better matching. It employs a more powerful embedding model (all-mpnet-base-v2) and includes dimensionality reduction with UMAP for improved performance with large datasets.

In [None]:
# STEP 11: ADVANCED RECOMMENDATION SYSTEM
# Description: Enhanced version with more O*NET data and better text processing

# Load additional O*NET datasets
knowledge = pd.read_csv('/content/drive/MyDrive/ONET/Knowledge.txt', sep='\t', encoding='latin-1')
work_context = pd.read_csv('/content/drive/MyDrive/ONET/Work Context.txt', sep='\t', encoding='latin-1')

# Clean resume text
def clean_resume(text):
    text = re.sub(r'<.*?>', ' ', str(text))  # remove HTML
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

resume_text_col = 'Resume_str'
resumes['cleaned_resume'] = resumes[resume_text_col].apply(clean_resume)

# Aggregate text per occupation
def aggregate_text(df, group_col='O*NET-SOC Code', text_col='Element Name', new_col_name=None):
    agg = df.groupby(group_col)[text_col].apply(lambda x: ' '.join(x.astype(str))).reset_index()
    if new_col_name:
        agg = agg.rename(columns={text_col: new_col_name})
    return agg

skills_agg = aggregate_text(skills, text_col='Element Name', new_col_name='Skills')
work_agg = aggregate_text(activities, text_col='Element Name', new_col_name='Work_Activities')
tasks_agg = aggregate_text(tasks, text_col='Task', new_col_name='Tasks')
knowledge_agg = aggregate_text(knowledge, text_col='Element Name', new_col_name='Knowledge')
context_agg = aggregate_text(work_context, text_col='Element Name', new_col_name='Work_Context')

# Merge into occupation profile
occupation_profile = occupation_data.merge(skills_agg, on='O*NET-SOC Code', how='left')
occupation_profile = occupation_profile.merge(work_agg, on='O*NET-SOC Code', how='left')
occupation_profile = occupation_profile.merge(tasks_agg, on='O*NET-SOC Code', how='left')
occupation_profile = occupation_profile.merge(knowledge_agg, on='O*NET-SOC Code', how='left')
occupation_profile = occupation_profile.merge(context_agg, on='O*NET-SOC Code', how='left')

# Combine text fields (weighted)
occupation_profile['combined_text'] = (
    occupation_profile['Title'].astype(str) + ' ' +
    occupation_profile['Description'].astype(str) + ' ' +
    3*occupation_profile['Skills'].astype(str) + ' ' +       # weight skills higher
    2*occupation_profile['Tasks'].astype(str) + ' ' +       # weight tasks moderately
    occupation_profile['Work_Activities'].astype(str) + ' ' +
    occupation_profile['Knowledge'].astype(str) + ' ' +
    occupation_profile['Work_Context'].astype(str)
)

# Initialize embedding model (higher quality)
model = SentenceTransformer('all-mpnet-base-v2')

# Generate embeddings
print("Generating occupation embeddings...")
occupation_profile['embedding'] = occupation_profile['combined_text'].apply(lambda x: model.encode(str(x)))

print("Generating resume embeddings...")
resumes['embedding'] = resumes['cleaned_resume'].apply(lambda x: model.encode(str(x)))

# Dimensionality reduction (UMAP) to speed up similarity search
embeddings_matrix = list(occupation_profile['embedding'])
umap_model = umap.UMAP(n_components=256, random_state=42)
occupation_profile['embedding_reduced'] = list(umap_model.fit_transform(embeddings_matrix))

# Example: top 5 recommendations for first resume
first_resume_embedding = resumes['embedding'][0]
recommendations = recommend_jobs(first_resume_embedding,
                                 occupation_profile['embedding'],
                                 occupation_profile['Title'],
                                 top_k=5)
print("Top 5 job recommendations for first resume:")
print(recommendations)

# Loop over all resumes and save CSV
all_recommendations = []
for idx, row in resumes.iterrows():
    recs = recommend_jobs(row['embedding'], occupation_profile['embedding'], occupation_profile['Title'], top_k=5)
    recs['Resume_ID'] = row['ID']
    all_recommendations.append(recs)

final_recs = pd.concat(all_recommendations, ignore_index=True)
final_recs.to_csv('resume_job_recommendations_final.csv', index=False)
print("Saved all top 5 job recommendations to 'resume_job_recommendations_final.csv'.")

**Step 12: Visualization Section**

Visualizations help analyze and interpret the recommendation results. These plots provide insights into the distribution of similarity scores, identify the most frequently recommended jobs, and show how matches are distributed across different resumes.

In [None]:
# STEP 12: VISUALIZATION SECTION
# Description: Create visualizations to analyze results

# Histogram of similarity scores
plt.figure(figsize=(8, 5))
sns.histplot(final_recs['Similarity'], bins=20, kde=True)
plt.title("Distribution of Job Recommendation Similarity Scores")
plt.xlabel("Similarity Score")
plt.ylabel("Frequency")
plt.show()

# Bar chart: Top 10 most frequently recommended job titles
top_jobs = final_recs['Job Title'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_jobs.values, y=top_jobs.index)
plt.title("Top 10 Most Frequently Recommended Job Titles")
plt.xlabel("Count")
plt.ylabel("Job Title")
plt.show()

# Scatter plot: Resume index vs. similarity score
plt.figure(figsize=(10, 6))
plt.scatter(final_recs['Resume_ID'], final_recs['Similarity'], alpha=0.7)
plt.title("Similarity Score Distribution per Resume")
plt.xlabel("Resume ID")
plt.ylabel("Similarity")
plt.show()

# Heatmap of top job titles vs average similarity
top_jobs_sim = final_recs.groupby('Job Title')['Similarity'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
sns.heatmap(top_jobs_sim.to_frame(), annot=True, fmt=".3f", cmap="rocket")
plt.title("Average Similarity of Top 10 Recommended Job Titles")
plt.ylabel("Job Title")
plt.show()

# Save visual-ready summary
summary_stats = final_recs.groupby('Job Title')['Similarity'].agg(['count', 'mean', 'max']).sort_values(by='mean', ascending=False)
summary_stats.to_csv('job_similarity_summary.csv')
print("Visualization data saved to 'job_similarity_summary.csv'.")

**Step 13: Skills Distribution Analysis**

These visualizations analyze the distribution of skills and tasks across different occupations. They help us understand the complexity of various jobs and identify patterns in occupational requirements.

In [None]:
# STEP 13: SKILLS DISTRIBUTION ANALYSIS
# Distribution of number of skills per occupation
occupation_profile['num_skills'] = occupation_profile['Skills'].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)
plt.figure(figsize=(8,5))
sns.histplot(occupation_profile['num_skills'], bins=20, kde=True)
plt.title("Distribution of Number of Skills per Occupation")
plt.xlabel("Number of Skills")
plt.ylabel("Count of Occupations")
plt.show()

# Distribution of number of tasks per occupation
occupation_profile['num_tasks'] = occupation_profile['Tasks'].apply(lambda x: len(str(x).split('.')) if pd.notnull(x) else 0)
plt.figure(figsize=(8,5))
sns.histplot(occupation_profile['num_tasks'], bins=20, kde=True)
plt.title("Distribution of Number of Tasks per Occupation")
plt.xlabel("Number of Tasks")
plt.ylabel("Count of Occupations")
plt.show()

**Step 14: Embedding Visualization (t-SNE)**

t-SNE (t-Distributed Stochastic Neighbor Embedding) reduces high-dimensional embeddings to 2D for visualization. This helps identify clusters of similar occupations and provides insight into how the embedding space organizes different types of jobs.

In [None]:
# STEP 14: EMBEDDING VISUALIZATION (t-SNE)
# Extract embeddings and titles
X = np.vstack(occupation_profile['embedding'].values)
labels = occupation_profile['Title'].values

# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42, perplexity=40, n_iter=3000)
X_reduced = tsne.fit_transform(X)

# Create dataframe
tsne_df = pd.DataFrame()
tsne_df['X'] = X_reduced[:, 0]
tsne_df['Y'] = X_reduced[:, 1]
tsne_df['Job Title'] = labels

# Plot
plt.figure(figsize=(12, 8))
sns.scatterplot(data=tsne_df, x='X', y='Y', s=70)
plt.title("t-SNE Visualization of Occupation Embedding Clusters")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")

# Annotate a few prominent jobs
for i, title in enumerate(tsne_df['Job Title'].head(50)):  # annotate first 50
    plt.text(tsne_df['X'][i]+0.5, tsne_df['Y'][i]+0.5, title, fontsize=8)

plt.show()

**Step 15: Comprehensive EDA Visualizations**

This final section provides comprehensive exploratory data analysis with multiple visualization types. These plots offer insights into dataset composition, skill distributions, recommendation patterns, and hypothetical model performance comparisons.

In [None]:
# STEP 15: COMPREHENSIVE EDA VISUALIZATIONS
sns.set(style="whitegrid", palette="Set2")   # nice colorful palette

# Categories of resumes (if provided)
plt.figure(figsize=(8,5))
sns.countplot(y='Category', data=resumes,
              order=resumes['Category'].value_counts().index)
plt.title("Resume Categories Distribution")
plt.xlabel("Count")
plt.ylabel("Category")
plt.tight_layout()
plt.show()

# Resume length distribution
resumes['resume_length'] = resumes['cleaned_resume'].apply(lambda x: len(str(x).split()))
plt.figure(figsize=(8,5))
sns.histplot(resumes['resume_length'], bins=25, kde=True)
plt.title("Distribution of Resume Lengths (Words)")
plt.xlabel("Words")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

# Top skills in O*NET
all_skills = ' '.join(occupation_profile['Skills'].dropna().tolist()).split(',')
skill_counts = Counter([s.strip() for s in all_skills])
top_skills = pd.DataFrame(skill_counts.most_common(20), columns=['Skill','Count'])

plt.figure(figsize=(10,6))
sns.barplot(x='Count', y='Skill', data=top_skills)
plt.title("Top 20 Skills in O*NET Database")
plt.tight_layout()
plt.show()

# Most common skills found in resumes
resume_skills = ' '.join(resumes['cleaned_resume'].tolist()).split()
resume_skill_counts = Counter(resume_skills)
top_resume_skills = pd.DataFrame(resume_skill_counts.most_common(20),
                                 columns=['Skill','Count'])

plt.figure(figsize=(10,6))
sns.barplot(x='Count', y='Skill', data=top_resume_skills)
plt.title("Top 20 Most Frequent Words/Skills in Resumes")
plt.tight_layout()
plt.show()

# Skill Overlap Plot
overlap = list(set(top_resume_skills['Skill']).intersection(top_skills['Skill']))
overlap_df = pd.DataFrame({'Skill': overlap})

plt.figure(figsize=(6,4))
sns.stripplot(y='Skill', data=overlap_df, size=12)
plt.title("Overlap Between Resume Skills and O*NET Skills")
plt.tight_layout()
plt.show()

# Recommendation frequency
top_jobs = final_recs['Job Title'].value_counts().head(15)

plt.figure(figsize=(10,6))
sns.barplot(x=top_jobs.values, y=top_jobs.index)
plt.title("Top 15 Most Recommended Jobs")
plt.xlabel("Recommendations")
plt.ylabel("Job Title")
plt.tight_layout()
plt.show()

# Model Performance Visuals
model_metrics = pd.DataFrame({
    'Model': ['Baseline','Weighted Skills','Weighted Skills+Tasks','Weighted+MPNet'],
    'Precision': [0.52,0.57,0.60,0.63],
    'Recall': [0.48,0.55,0.59,0.61],
    'F1': [0.50,0.56,0.59,0.62]
})

# F1 Score bar chart
plt.figure(figsize=(8,5))
sns.barplot(x='Model', y='F1', data=model_metrics)
plt.title("F1 Score by Model")
plt.tight_layout()
plt.show()

# Precision vs Recall scatter
plt.figure(figsize=(8,5))
sns.scatterplot(x='Precision', y='Recall', hue='Model', s=150, data=model_metrics)
plt.title("Precision vs Recall Across Models")
plt.tight_layout()
plt.show()

# Multi-metric grouped bar chart
model_metrics_melt = model_metrics.melt(id_vars='Model',
                                        value_vars=['Precision','Recall','F1'],
                                        var_name='Metric',
                                        value_name='Value')

plt.figure(figsize=(10,6))
sns.barplot(x='Model', y='Value', hue='Metric', data=model_metrics_melt)
plt.title("Comparison of Precision, Recall, and F1 Across Models")
plt.tight_layout()
plt.show()