# Job Recommendation System: Uncovering Opportunities

This notebook presents the development of a job recommendation system designed to connect users with relevant career opportunities based on their skills and experience. Leveraging a dataset of diverse job descriptions, the system processes key features such as job titles, required skills, and years of experience to provide tailored recommendations.

The core of this system relies on **TF-IDF (Term Frequency-Inverse Document Frequency)** for text vectorization and **Cosine Similarity** to measure the resemblance between a user's query and available job descriptions. Through careful data preprocessing and feature engineering, we aim to build an effective tool for navigating the job market.

## Setup and Data Loading

We begin by importing the necessary libraries and configuring display options to ensure all content, especially lengthy text fields, is fully visible for review. We also load the dataset crucial for our analysis and recommendation engine.

In [None]:
import numpy as np # For numerical operations
import pandas as pd # For data manipulation and analysis
import os # For interacting with the operating system (e.g., listing files)
import warnings # To manage warning messages

from sklearn.feature_extraction.text import TfidfVectorizer # For text feature extraction
from sklearn.metrics.pairwise import cosine_similarity # For calculating similarity between vectors

# Suppress all warnings to maintain a clean output
warnings.filterwarnings('ignore')

# Set pandas display options to show full content in columns and all columns without truncation
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100) # Show more rows if needed

### Data Availability

First, let's inspect the files available in the input directory to ensure our dataset is present.

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Loading the Dataset

The `job_dataset.csv` file contains the job descriptions we will use. We load it into a pandas DataFrame.

In [None]:
df = pd.read_csv('/kaggle/input/job-descriptions-2025-tech-and-non-tech-roles/job_dataset.csv')

## Initial Data Exploration

Before diving into preprocessing, let's get a preliminary understanding of our dataset's structure, content, and key characteristics.

### Dataset Overview

A quick look at the first few rows helps us understand the data types and column content.

In [None]:
df.head()

### Dataset Dimensions

Checking the shape of the DataFrame reveals the total number of entries and features available.

In [None]:
df.shape

With over 20,000 entries and 8 features, this dataset provides a substantial foundation for building a robust recommendation system.

### Column Names

Understanding the existing column names is important for consistent referencing and subsequent cleaning.

In [None]:
df.columns = df.columns.str.lower()

In [None]:
df.columns

### Job Title Distribution

Let's examine the distribution of job titles to see the variety of roles present in the dataset.

In [None]:
df.title.value_counts()

### Experience Level Distribution (Before Cleaning)

We inspect the raw values for `experiencelevel` to identify inconsistencies that require cleaning.

In [None]:
df.experiencelevel.value_counts()

### Cross-Tabulation of Experience Levels and Years of Experience

This cross-tabulation helps us understand the relationship between categorical experience levels and the specified years of experience.

In [None]:
pd.crosstab(df['experiencelevel'], df['yearsofexperience'])

## Data Preprocessing and Feature Engineering

To prepare our data for the recommendation system, we perform several preprocessing steps, including standardizing column names, cleaning experience fields, and engineering new features from existing ones.

### Column Name Standardization

Converting all column names to lowercase ensures consistency and easier access.

In [None]:
df.columns = df.columns.str.lower()

### Cleaning `yearsofexperience`

The `yearsofexperience` column contains various formats. We standardize it by extracting numerical ranges and creating consistent bins.

In [None]:
df['yearsofexperience'] = df['yearsofexperience'].apply(lambda x : x.split(' ')[0].strip() if 'year' in x.lower() 
                                                        else x.lower().strip())

In [None]:
df['yearsofexperience'] = df['yearsofexperience'].apply(lambda x : str(x.split('–')[0]) + '-' + str(x.split('–')[1])
                                                       if '–' in x.lower() else x)

In [None]:
def binning(x):
    if '+' not in x and '-' not in x:
        return str(x)+ '-' + str(int(x)+3)
    elif '+' in x:
        x = x.split('+')[0].strip()
        x = str(x)
        return x + '-' + str(int(x)+1)
    return x
df['yearsofexperience'] = df['yearsofexperience'].apply(lambda x : binning(x))

### Cleaning `experiencelevel`

We standardize the `experiencelevel` column by removing redundant suffixes and converting values to lowercase.

In [None]:
df['experiencelevel'] = df['experiencelevel'].apply(lambda x : x.strip().split('-Level')[0].strip().split('Level')[0].lower().strip())

### Experience Level Distribution (After Cleaning)

After cleaning, the `experiencelevel` values are more consistent.

In [None]:
df.experiencelevel.value_counts()

### Removing `keywords` Column

The `keywords` column is not directly used in our current recommendation approach, so we drop it to simplify the DataFrame.

In [None]:
df.drop(columns='keywords', inplace=True)

### Preparing `skills` and `responsibilities`

We split the `skills` column into a list of individual skills and standardize the `responsibilities` column for easier text processing.

In [None]:
df["skills_split"] = df["skills"].str.split(";").apply(lambda x: [s.strip() for s in x])

In [None]:
df["responsibilities"] = df["responsibilities"].str.replace(";", ",")

### Enriching Skills with Experience Level

To make the recommendation more context-aware, we append the `experiencelevel` to the list of `skills` for each job. This integrates experience as a critical feature for similarity calculations.

In [None]:
def titleappend(x):
    # Append the experience level to the list of skills
    return x['skills_split'] + [x['experiencelevel']]

df['skills']  =df.apply(lambda x : titleappend(x), axis=1)

In [None]:
df = df.drop(columns=['skills_split'])

### Reviewing Enriched Skills

Let's check the `skills` column for an entry to confirm the `experiencelevel` has been successfully appended.

In [None]:
df['skills'].head(1)

### Combining Features for the Recommender

For the TF-IDF vectorization, we create a single `combined` text field by concatenating `title` and the processed `skills`. This unified text representation will be used to calculate job similarity.

In [None]:
df["skills_text"] = df["skills"].apply(lambda x: " ".join(x))

In [None]:
df["combined"] = df["title"] + " " + df["skills_text"]

In [None]:
df['combined'] = df['combined'].astype(str)

## Job Recommendation System Implementation

This section outlines the core logic of our recommendation system, which uses TF-IDF to convert text data into numerical vectors and cosine similarity to find the most similar jobs.

### TF-IDF Vectorization and Cosine Similarity

We initialize a `TfidfVectorizer` to transform the `combined` text feature into a matrix of TF-IDF features. This matrix quantifies the importance of words in each job description. Subsequently, `cosine_similarity` is used to compute the pairwise similarity between all job descriptions based on their TF-IDF vectors.

In [None]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the 'combined' text data and transform it into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(df["combined"])

# Compute the cosine similarity matrix from the TF-IDF matrix
cosine_sim = cosine_similarity(tfidf_matrix)

### Recommendation Function

The `recommend` function takes a user query (e.g., a set of skills) and returns the top `n` most relevant job recommendations. It vectorizes the query, calculates its similarity to all jobs, and then presents the details of the most similar jobs in a structured format.

In [None]:
def recommend(query, top_n=3):
    # Vectorize the query using the pre-trained TF-IDF vectorizer
    query_vec = vectorizer.transform([query])
    
    # Compute similarity between the query vector and all job TF-IDF vectors
    sim_scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    # Get the indices of the top_n most similar jobs
    top_indices = sim_scores.argsort()[-top_n:][::-1]
    
    # Print the details of the recommended jobs in a formatted manner
    for i, idx in enumerate(top_indices, 1):
        job = df.iloc[idx]
        print(f"{i}. JobID: {job['jobid']}")
        print(len(f'{{i}}. JobID: ')*'-' + '-'*len(job['jobid']))
        print(f"   ▲ Title: {job['title']}")
        # Format skills for display, excluding the last two (which are experience related and handled separately)
        print(f"   ▲ Skills: {', '.join(job['skills'][:-2]) + ' & ' + job['skills'][-2]}")
        print(f"   ▲ Experience: {job['experiencelevel']} with an experience of {job['yearsofexperience']} years")
        print(f"   ▲ Responsibilities:")
        print('   '+len("▲ Responsibilities:")*'-')
        j = 1
        for r in job['responsibilities'].split(','):
            if r.strip(): # Ensure no empty responsibility points are printed
                print(str(j) + '] ' + r.strip())
                j+=1
        print()
        print()
        print("-" * 146)
        print("*" * 146)
        print("-" * 146)
        print()
        print()
        print(' '*65+'Thank you! ☺♫')

## Demonstrating the Recommender

Let's put our recommendation system to the test with an example query. We'll simulate a user looking for jobs requiring specific skills and observe the top recommendations.

In [None]:
# Define a sample query representing a user's skills
skills ="numpy, python, pandas, machine learning, data analysis"

print(f"{' '*50}🔹 Recommendations for query = {skills}")
print(' '*52 + len(f"🔹 Recommendations for query = {skills}")*'=')

# Call the recommend function to get top 3 job recommendations
recommend(skills, top_n=3)

## Conclusion

This notebook outlines the creation of a job recommendation system, from initial data loading and comprehensive preprocessing to the implementation of TF-IDF and cosine similarity. The system effectively identifies and presents job opportunities that align with specified skills and experience.

This approach provides a solid foundation for job seekers to discover relevant roles and for recruiters to identify suitable candidates, streamlining the connection between talent and opportunity. The structured output of the `recommend` function ensures that users receive detailed and actionable insights into each suggested role.