# Introduction

A talent management company matches talented individuals with tech companies. Finding talented candidates is difficult because it requires the talent management company to understand:

1. What role the tech company wants to fill, as well as what the tech company wants in a candidate for that role.
2. What makes a candidate shine in the role.
3. Where to find talented candidates.

## Current Process

The talent management company uses a manual process to find talented candidates:

1. Determine role to recruit for.
2. Search LinkedIn-like system for potential candidates based on role-related keywords (HR Manager, Software Engineer etc.).
3. List and rank candidates.
4. Manually review and rerank candidates until point of satisfaction.
5. Send list to tech company.

This process is time and labor intensive.

## Improvements

To improve these bottlenecks, the talent management company wants two separate automated systems to:

1. Find potential candidates for a particular role.
2. Find and rank talented candidates based on their fitness, agnostic to a particular role.

## Scope

*   Sourcing candidates is out of scope. Talent management company gives list of candidates (data) to analyze and match for a particular role.
*  Solution should automate steps 1, 2 and 3 of the current process. Solution should also handle the reranking of candidates in step 4, assuming that we have results from manual review.
* Keywords that generated list of candidates are “Aspiring human resources” or “seeking human resources”. Assume that the tech company is hiring for HR positions (entry and non-entry level). The keywords indicate that the tech company is looking for candidates actively looking for work.




# Understanding the Data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# modify csv root if necessary
root = '/content/drive/MyDrive/Apziva/Project3'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
data = pd.read_csv(f'{root}/potential-talents.csv')
print(data.shape)
data.head()

(104, 5)


Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


## Duplicates and Search Terms

The dataset is the result of two search queries: Aspiring human resources and seeking human resources. Each row is a unique candidate, but without considering the id of each row, there are duplicate candidates.

In [3]:
# Drop the 'id' and 'fit' columns (fit is NaN for each row, which will mess up group by)
data_without_id = data.drop(columns=['id', 'fit'])

# Group by the remaining columns and count the occurrences
row_counts = data_without_id.groupby(['job_title', 'location', 'connection']).size().reset_index(name='count')
row_counts.sort_values(by='count', ascending=False)
row_counts.head()

Unnamed: 0,job_title,location,connection,count
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,7
1,Admissions Representative at Community medical...,"Long Beach, California",9,1
2,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,4
3,Always set them up for Success,Greater Los Angeles Area,500+,1
4,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,2


Since we don't have more information distinguishing the candidates, we should drop all duplicates for the sake of training models. The final model should provide a ranking of candidates, and duplicate candidates don't help.

In [4]:
# prompt: Reduce data_without_id to just all the unique rows.
# id and fit features add no value to our potential solution

data = data_without_id.drop_duplicates()
print(data.shape)
data.head()


(53, 3)


Unnamed: 0,job_title,location,connection
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,People Development Coordinator at Ryan,"Denton, Texas",500+
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


## Features


* job_title: current job title of candidate (text)
* location: current geographic location of candidate (text)
* connection: number of connections candidate as. 500+ means over 500 (text)




## Data Cleanup

We can get rid of location and connection columns. Company's recruiting objective doesn't specify location or number of connections. This may yield duplicate job titles again, so let's get rid of them.

In [5]:
data = data[['job_title']]
data = data.drop_duplicates()
print(data.shape)
data.head()

(52, 1)


Unnamed: 0,job_title
0,2019 C.T. Bauer College of Business Graduate (...
1,Native English Teacher at EPIK (English Progra...
2,Aspiring Human Resources Professional
3,People Development Coordinator at Ryan
4,Advisory Board Member at Celal Bayar University


Only one job title was dropped, so we aren't losing too much data.

In [6]:
# make copy for future use
data_original = data.copy()

## Basic Preprocessing

Get rid of punctuation, numbers, stop words, parentheses, lowercase text.

In [7]:
import re
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text, lemmatize=False):
    """
    Preprocesses the input text by:
    - Converting to lowercase
    - Removing numbers
    - Removing parentheses, punctuation, and the pipe character '|'
    - Removing extra spaces
    - Tokenizing the text
    - Removing stopwords and single-character words
    - Optionally lemmatizing the tokens

    Parameters:
    - text: The input text to preprocess.
    - lemmatize: A boolean flag to indicate whether to lemmatize the tokens (default is True).

    Returns:
    - A string of preprocessed text.
    """
    # Convert to lowercase
    text = text.lower()

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove parentheses, punctuation, and the pipe character '|'
    text = re.sub(r'[(){}[\]!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~|]', ' ', text)

    # Remove the pipe character separately if regex doesn't work
    text = text.replace('|', ' ')

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize the text using spaCy
    doc = nlp(text)

    if lemmatize:
        # Lemmatize tokens, remove stopwords, and filter out single-character words
        tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and len(token.lemma_) > 1]
    else:
        # Only remove stopwords and single-character words, no lemmatization
        tokens = [token.text for token in doc if not token.is_stop and not token.is_punct and len(token.text) > 1]

    return ' '.join(tokens)

In [8]:
data['job_title'] = data['job_title'].apply(preprocess_text)
data.head()

Unnamed: 0,job_title
0,c.t bauer college business graduate magna cum ...
1,native english teacher epik english program korea
2,aspiring human resources professional
3,people development coordinator ryan
4,advisory board member celal bayar university


In [9]:
# define search terms
aspiring = 'aspiring human resources'
seeking = 'seeking human resources'

# Unigram Vectorizations

A unigram is a single token. We want to see how similar each job title is to each search term. We can mathematically calculate similarity by comparing the unigram vectorizations (unigram vector representations) of the tokens in the search term and job title.

We will use the cosine similarity measure between the search terms and job titles, and rank the job titles by cosine similarity in descending order.


## Count Vectorization

Each job title is represented by a vector representing the count of each unique word in the entire corpus vocabulary.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the job titles
unigram_vectors = vectorizer.fit_transform(data['job_title'])

# Convert the sparse matrix to a DataFrame
unigram_df = pd.DataFrame(unigram_vectors.toarray(), columns=vectorizer.get_feature_names_out())

unigram_df.head()

Unnamed: 0,administration,administrative,admissions,advisory,america,analyst,analytics,army,arts,aspiring,...,travelers,undergraduate,university,victoria,wellington,western,westfield,woodlands,work,world
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


We should also get the count vectorization for each search term.

In [11]:
# prompt: get the count vectorization of each search term

# Get the count vectorization for each search term.
aspiring_vector = vectorizer.transform([preprocess_text(aspiring)])
seeking_vector = vectorizer.transform([preprocess_text(seeking)])

Now we can calculate the cosine similarity between the job titles and search terms, and rank the job titles for each search term.

In [12]:
# prompt: calculate the cosine similarity between the job title vectors and the aspiring search term vector. rank the job titles by cosine similarity descending order, and print the top 5

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calculate cosine similarity between job titles and aspiring search term
aspiring_similarity = cosine_similarity(unigram_vectors, aspiring_vector)

# Create DataFrames for similarity scores
aspiring_sim_df = pd.DataFrame({'job_title': data['job_title'], 'similarity': aspiring_similarity.flatten()})

# Rank job titles by cosine similarity in descending order
aspiring_sim_df = aspiring_sim_df.sort_values(by='similarity', ascending=False)

print("Top 5 job titles for 'aspiring human resources':")
print(aspiring_sim_df.head(5))

Top 5 job titles for 'aspiring human resources':
                                            job_title  similarity
2               aspiring human resources professional    0.866025
5                 aspiring human resources specialist    0.866025
72  aspiring human resources manager seeking inter...    0.833333
99  aspiring human resources manager graduating se...    0.700140
73                       human resources professional    0.666667


In [13]:
# Calculate cosine similarity between job titles and seeking search term
seeking_similarity = cosine_similarity(unigram_vectors, seeking_vector)

seeking_sim_df = pd.DataFrame({'job_title': data['job_title'], 'similarity': seeking_similarity.flatten()})

seeking_sim_df = seeking_sim_df.sort_values(by='similarity', ascending=False)

# Print the top 5 job titles
print("\nTop 5 job titles for 'seeking human resources':")
print(seeking_sim_df.head(5))


Top 5 job titles for 'seeking human resources':
                                            job_title  similarity
27              seeking human resources opportunities    0.866025
98                   seeking human resources position    0.866025
72  aspiring human resources manager seeking inter...    0.833333
9   seeking human resources hris generalist positions    0.707107
99  aspiring human resources manager graduating se...    0.700140


## TfIdf Vectorization

TF-IDF (Term Frequency-Inverse Document Frequency) is better than Count Vectorization because it not only considers the frequency of words in a document but also accounts for their importance across all documents, reducing the weight of commonly occurring words (e.g., "the," "is") that add little semantic value. This helps emphasize rare but meaningful terms that better capture the context of a document. As a result, TF-IDF often leads to improved performance in tasks like text classification and similarity measurement by prioritizing discriminative features.

In [14]:
# prompt: Modify the count vectorization section to use tfidf

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the job titles
tfidf_vectors = tfidf_vectorizer.fit_transform(data['job_title'])

# Convert the sparse matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

tfidf_df.head()

Unnamed: 0,administration,administrative,admissions,advisory,america,analyst,analytics,army,arts,aspiring,...,travelers,undergraduate,university,victoria,wellington,western,westfield,woodlands,work,world
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.207353,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.566898,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.426402,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.30151,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# Get the tfidf vectorization for each search term.
aspiring_vector = tfidf_vectorizer.transform([preprocess_text(aspiring)])
seeking_vector = tfidf_vectorizer.transform([preprocess_text(seeking)])

In [16]:
# Calculate cosine similarity between job titles and aspiring search term
aspiring_similarity = cosine_similarity(tfidf_vectors, aspiring_vector)

# Create DataFrames for similarity scores
aspiring_sim_df = pd.DataFrame({'job_title': data['job_title'], 'similarity': aspiring_similarity.flatten()})

# Rank job titles by cosine similarity in descending order
aspiring_sim_df = aspiring_sim_df.sort_values(by='similarity', ascending=False)

print("Top 5 job titles for 'aspiring human resources':")
print(aspiring_sim_df.head(5))

Top 5 job titles for 'aspiring human resources':
                                            job_title  similarity
2               aspiring human resources professional    0.756871
5                 aspiring human resources specialist    0.690848
72  aspiring human resources manager seeking inter...    0.629684
26  aspiring human resources management student se...    0.448937
71  business management major aspiring human resou...    0.446539


In [17]:
# Calculate cosine similarity between job titles and seeking search term
seeking_similarity = cosine_similarity(tfidf_vectors, seeking_vector)

seeking_sim_df = pd.DataFrame({'job_title': data['job_title'], 'similarity': seeking_similarity.flatten()})

seeking_sim_df = seeking_sim_df.sort_values(by='similarity', ascending=False)

# Print the top 5 job titles
print("\nTop 5 job titles for 'seeking human resources':")
print(seeking_sim_df.head(5))


Top 5 job titles for 'seeking human resources':
                                            job_title  similarity
98                   seeking human resources position    0.681586
27              seeking human resources opportunities    0.681586
72  aspiring human resources manager seeking inter...    0.639678
26  aspiring human resources management student se...    0.466685
9   seeking human resources hris generalist positions    0.438970


## Word Embeddings

Word embeddings, like Word2Vec or GloVe, are better than TF-IDF because they capture the semantic meaning of words by representing them as dense vectors in a continuous space, where similar words are placed closer together. Unlike TF-IDF, which treats words as independent and ignores word context, embeddings preserve relationships and context, making them more powerful for capturing meaning in text. This enables embeddings to generalize better across tasks, such as sentiment analysis or text similarity, even with unseen or rare words.

We will use GloVe embeddings. GloVe (Global Vectors for Word Representation) generates word embeddings by factorizing a word co-occurrence matrix, efficiently capturing semantic relationships based on global statistical information from the entire corpus. Its pre-trained embeddings can be used directly, making it ideal for scenarios with limited data and compute resources, as it eliminates the need for expensive training while leveraging knowledge from large-scale corpora.

In [18]:
# prompt: Modify the tfidf vectorization section to use glove word embeddings. The embedding file is at f'{root}/glove.6B.50d.txt'

import numpy as np

# Load GloVe embeddings
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# use the 50 dimensional embedding, since we have limited data and want to prevent overfitting
glove_embeddings = load_glove_embeddings(f'{root}/glove.6B.50d.txt')

# Function to generate sentence embeddings using GloVe
def sentence_embedding(sentence, embeddings, dim=50):
    words = sentence.split()
    sentence_vec = np.zeros(dim)
    for word in words:
      if word in embeddings:
          sentence_vec += embeddings[word]
    return sentence_vec / len(words) if len(words) > 0 else sentence_vec

In [19]:
# Generate embeddings for job titles and search terms
data['job_title_embedding'] = data['job_title'].apply(lambda x: sentence_embedding(x, glove_embeddings))
aspiring_embedding = sentence_embedding(preprocess_text(aspiring), glove_embeddings)
seeking_embedding = sentence_embedding(preprocess_text(seeking), glove_embeddings)

# Calculate cosine similarity using sentence embeddings
def cosine_sim(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [20]:
data['aspiring_similarity'] = data['job_title_embedding'].apply(lambda x: cosine_sim(x, aspiring_embedding))
# Rank job titles based on similarity
print("Top 5 job titles for 'aspiring human resources':")
print(data.sort_values(by='aspiring_similarity', ascending=False).head(5))

Top 5 job titles for 'aspiring human resources':
                                            job_title  \
5                 aspiring human resources specialist   
2               aspiring human resources professional   
72  aspiring human resources manager seeking inter...   
73                       human resources professional   
27              seeking human resources opportunities   

                                  job_title_embedding  aspiring_similarity  
5   [0.01979249343276024, 0.278462503105402, -0.01...             0.968802  
2   [-0.3007250092923641, 0.3483925051987171, -0.1...             0.958368  
72  [0.20360249280929565, 0.21167600248008966, -0....             0.940007  
73  [-0.2550933410724004, 0.18706999719142914, -0....             0.936910  
27  [0.38822999224066734, 0.12474699877202511, 0.0...             0.922567  


In [21]:
data['seeking_similarity'] = data['job_title_embedding'].apply(lambda x: cosine_sim(x, seeking_embedding))
print("\nTop 5 job titles for 'seeking human resources':")
print(data.sort_values(by='seeking_similarity', ascending=False).head(5))


Top 5 job titles for 'seeking human resources':
                                            job_title  \
27              seeking human resources opportunities   
98                   seeking human resources position   
72  aspiring human resources manager seeking inter...   
9   seeking human resources hris generalist positions   
73                       human resources professional   

                                  job_title_embedding  aspiring_similarity  \
27  [0.38822999224066734, 0.12474699877202511, 0.0...             0.922567   
98  [0.1772174946963787, 0.025101998820900917, 0.1...             0.888088   
72  [0.20360249280929565, 0.21167600248008966, -0....             0.940007   
9   [0.3031488296886285, -0.18419200802842775, 0.1...             0.891157   
73  [-0.2550933410724004, 0.18706999719142914, -0....             0.936910   

    seeking_similarity  
27            0.980290  
98            0.970924  
72            0.945909  
9             0.935886  
73            

# Learning to Rank (LTR) Models

Learning to Rank (LTR) is better than simply comparing cosine scores because it considers multiple features and interactions to optimize for a ranking objective, improving performance beyond simplistic similarity metrics like cosine scores. While cosine similarity provides a single measure of closeness in vector space, LTR models learn from labeled data to capture nuanced relationships and adapt to specific ranking tasks.



## Simulating Manual Selection

The cosine scores give a natural ranking for the job titles. However, the job titles with the highest cosine scores aren't necessarily the job titles that a recruiter would deem to be the best fit for a role. For example, for the 'aspiring human resources' search term, the titles that indicate entry level candidates should be at the top.

Recruiters can manually select the job titles they find most relevant. With that selection, we want to automatically rerank the candidates with the newly added information.

We will simulate manual selection (supervisory labels) by manually picking the job titles we deem to be the best match, and moving them to the top. We can do this because we only have 52 job titles to look through. However, if we had big data, we would need some heuristic to pick the top job titles.

In [22]:
# current order for aspiring
ahr_data = data[['job_title', 'job_title_embedding', 'aspiring_similarity']]
ahr_data = ahr_data.sort_values(by='aspiring_similarity', ascending=False)
ahr_data = ahr_data.rename(columns={'job_title_embedding': 'glove_embedding', 'aspiring_similarity': 'cosine_similarity'})
ahr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity
5,aspiring human resources specialist,"[0.01979249343276024, 0.278462503105402, -0.01...",0.968802
2,aspiring human resources professional,"[-0.3007250092923641, 0.3483925051987171, -0.1...",0.958368
72,aspiring human resources manager seeking inter...,"[0.20360249280929565, 0.21167600248008966, -0....",0.940007
73,human resources professional,"[-0.2550933410724004, 0.18706999719142914, -0....",0.93691
27,seeking human resources opportunities,"[0.38822999224066734, 0.12474699877202511, 0.0...",0.922567


In [23]:
# Step 1: Create a condition for 'aspiring human resources' in the job_title column
condition = ahr_data['job_title'].str.contains('aspiring human resources', case=False, na=False)

# Step 2: Sort the DataFrame based on the condition to bring 'aspiring human resources' entries to the top
# Then sort those entries by 'cosine_similarity' in descending order
ahr_data = (
    ahr_data.assign(is_aspiring_hr=condition)  # Create a helper column to track the condition
    .sort_values(by=['is_aspiring_hr', 'cosine_similarity'], ascending=[False, False])  # Sort based on the condition and cosine_similarity
    .drop(columns='is_aspiring_hr')  # Remove the helper column
)

# Step 3: Assign new scores based on the order they appear in the sorted DataFrame
# Using range to generate scores and normalize between 1 and 0
ahr_data['score'] = (len(ahr_data) - ahr_data.reset_index().index) / len(ahr_data)
ahr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity,score
5,aspiring human resources specialist,"[0.01979249343276024, 0.278462503105402, -0.01...",0.968802,1.0
2,aspiring human resources professional,"[-0.3007250092923641, 0.3483925051987171, -0.1...",0.958368,0.980769
72,aspiring human resources manager seeking inter...,"[0.20360249280929565, 0.21167600248008966, -0....",0.940007,0.961538
75,aspiring human resources professional passiona...,"[0.12243726578625766, 0.1708956394683231, -0.3...",0.920825,0.942308
26,aspiring human resources management student se...,"[0.06284284165927342, 0.33454572249736103, -0....",0.884618,0.923077


In [24]:
# current order of shr documents
shr_data = data[['job_title', 'job_title_embedding', 'seeking_similarity']]
shr_data = shr_data.sort_values(by='seeking_similarity', ascending=False)
shr_data = shr_data.rename(columns={'job_title_embedding': 'glove_embedding', 'seeking_similarity': 'cosine_similarity'})
shr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity
27,seeking human resources opportunities,"[0.38822999224066734, 0.12474699877202511, 0.0...",0.98029
98,seeking human resources position,"[0.1772174946963787, 0.025101998820900917, 0.1...",0.970924
72,aspiring human resources manager seeking inter...,"[0.20360249280929565, 0.21167600248008966, -0....",0.945909
9,seeking human resources hris generalist positions,"[0.3031488296886285, -0.18419200802842775, 0.1...",0.935886
73,human resources professional,"[-0.2550933410724004, 0.18706999719142914, -0....",0.922303


In [25]:
# Step 1: Create a condition for 'seeking human resources' in the job_title column
condition = shr_data['job_title'].str.contains('seeking human resources', case=False, na=False)

# Step 2: Sort the DataFrame based on the condition to bring 'seeking human resources' entries to the top
# Then sort those entries by 'cosine_similarity' in descending order
shr_data = (
    shr_data.assign(is_seeking_hr=condition)  # Create a helper column to track the condition
    .sort_values(by=['is_seeking_hr', 'cosine_similarity'], ascending=[False, False])  # Sort based on the condition and cosine_similarity
    .drop(columns='is_seeking_hr')  # Remove the helper column
)

# Step 3: Assign new scores based on the order they appear in the sorted DataFrame
# Using range to generate scores and normalize between 1 and 0
shr_data['score'] = (len(shr_data) - shr_data.reset_index().index) / len(shr_data)
shr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity,score
27,seeking human resources opportunities,"[0.38822999224066734, 0.12474699877202511, 0.0...",0.98029,1.0
98,seeking human resources position,"[0.1772174946963787, 0.025101998820900917, 0.1...",0.970924,0.980769
9,seeking human resources hris generalist positions,"[0.3031488296886285, -0.18419200802842775, 0.1...",0.935886,0.961538
93,seeking human resources opportunities open tra...,"[0.4780298460807119, 0.36307972030980246, 0.03...",0.914999,0.942308
74,nortia staffing seeking human resources payrol...,"[0.09531862242147326, -0.0778837502002716, 0.1...",0.864533,0.923077


## LTR Reranking Procedure -- RankNet

Now that we have supervisory labels, we can use LTR to rerank the documents, with capturing the relationships between them.

The first LTR model we'll use is called Rank-Net. It is a specialized neural network for **ranking problems**, where the goal is to order items based on some criterion (the manual starring in our case).

RankNet uses a **pairwise learning approach**. It takes two items at a time (e.g., two documents in a search result) and learns to predict which item should be ranked higher. The network minimizes a loss function that measures the inconsistency between predicted and actual pairwise rankings.

RankNet uses a **cross-entropy loss for pairwise comparisons**. It aims to minimize the probability of incorrect pairwise ordering between two items by using a sigmoid function to model the probability that one item should be ranked higher than the other.

The **output** is not a direct ranking but a score for each item. We run forward pass twice (once per document in a pair) and then compute pairwise loss for training.

The final ranking is derived by sorting the items based on the output scores for each doc.

The following PyTorch NN implements these concepts, and allows NN customization (layers, units, activation function).

In [26]:
import torch
import torch.nn as nn

class RankNet(nn.Module):
    def __init__(self, input_size, units_per_layer, activation):
        super(RankNet, self).__init__()

        self.layers = nn.ModuleList()
        previous_size = input_size
        for units in units_per_layer:
            self.layers.append(nn.Linear(previous_size, units))
            previous_size = units

        self.output_layer = nn.Linear(previous_size, 1)  # Final output layer
        self.activation = activation()  # Activation function (e.g., nn.ReLU)

    def forward(self, docs):
        scores = docs
        for layer in self.layers:
            scores = self.activation(layer(scores))
        scores = self.output_layer(scores)  # Raw output (before applying any sigmoid)
        return scores

    def pairwise_loss(self, doc_1_pred, doc_2_pred, doc_1_true, doc_2_true):
        """
        Compute pairwise loss using the predicted scores and true relevance scores.
        The loss is weighted to focus more on top-k ranking.

        Parameters:
        - doc_1_pred: Tensor of predicted scores for document 1.
        - doc_2_pred: Tensor of predicted scores for document 2.
        - doc_1_true: Tensor of true relevance scores for document 1.
        - doc_2_true: Tensor of true relevance scores for document 2.

        Returns:
        - loss: The weighted pairwise loss for RankNet.
        """
        # Score difference between document 1 and document 2
        S_ij = doc_1_pred - doc_2_pred

        # Target label: 1 if doc_1_true > doc_2_true, 0 if doc_1_true < doc_2_true, 0.5 if they are equal
        y_ij = torch.where(doc_1_true > doc_2_true, torch.ones_like(doc_1_true),
                        torch.where(doc_1_true < doc_2_true, torch.zeros_like(doc_1_true),
                                    0.5 * torch.ones_like(doc_1_true)))

        # Sigmoid of score difference to get predicted probability
        P_ij = torch.sigmoid(S_ij)

        # Standard pairwise cross-entropy loss
        loss = - (y_ij * torch.log(P_ij + 1e-10) + (1 - y_ij) * torch.log(1 - P_ij + 1e-10))

        return loss.mean()  # Return the mean loss over the batch

### Embeddings

First, we need to get the data we'll train the model on. The features are the embeddings, and the label is the reassigned score ('score' column). This will maintain the ideal ranking in the recruiter's eyes.



In [27]:
ahr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity,score
5,aspiring human resources specialist,"[0.01979249343276024, 0.278462503105402, -0.01...",0.968802,1.0
2,aspiring human resources professional,"[-0.3007250092923641, 0.3483925051987171, -0.1...",0.958368,0.980769
72,aspiring human resources manager seeking inter...,"[0.20360249280929565, 0.21167600248008966, -0....",0.940007,0.961538
75,aspiring human resources professional passiona...,"[0.12243726578625766, 0.1708956394683231, -0.3...",0.920825,0.942308
26,aspiring human resources management student se...,"[0.06284284165927342, 0.33454572249736103, -0....",0.884618,0.923077


In [33]:
# Expand the 'glove_embedding' column into multiple columns
glove_df = pd.DataFrame(ahr_data['glove_embedding'].tolist(), columns=[i for i in range(50)], index=ahr_data.index)

# Merge the expanded glove dimensions with the original DataFrame's 'score' column and retain the original index
ahr_embeddings = glove_df.merge(ahr_data[['score']], left_index=True, right_index=True)

# Display the result
ahr_embeddings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,score
5,0.019792,0.278463,-0.01335,0.204988,0.38181,0.196559,-0.146536,-0.423013,0.960185,0.117878,...,0.154264,0.603717,0.205988,0.4145,-0.06475,-0.104065,0.260287,-0.185917,0.124617,1.0
2,-0.300725,0.348393,-0.141115,0.161565,0.331378,0.173288,-0.270891,-0.23961,0.522808,0.310945,...,-0.087221,0.530722,0.300133,0.11922,-0.203755,-0.331287,0.281927,-0.16596,0.03567,0.980769
72,0.203602,0.211676,-0.048018,0.200427,0.395601,0.030184,-0.31714,-0.186384,0.650971,0.104414,...,-0.079076,0.276711,0.245048,0.051786,-0.23706,-0.288665,0.409448,-0.104936,0.1933,0.961538
75,0.122437,0.170896,-0.329673,-0.054715,0.273016,0.107945,-0.183598,-0.320755,0.213775,0.471265,...,-0.249843,0.320254,0.226864,0.149753,-0.070939,-0.258641,0.186002,-0.032392,0.268043,0.942308
26,0.062843,0.334546,-0.053827,-0.154298,0.181093,0.047464,-0.359962,-0.185856,0.465821,0.136992,...,-0.177543,0.271987,0.382987,0.009203,-0.47771,0.005384,0.456917,-0.153242,0.276551,0.923077


In [36]:
shr_data.head()

Unnamed: 0,job_title,glove_embedding,cosine_similarity,score
27,seeking human resources opportunities,"[0.38822999224066734, 0.12474699877202511, 0.0...",0.98029,1.0
98,seeking human resources position,"[0.1772174946963787, 0.025101998820900917, 0.1...",0.970924,0.980769
9,seeking human resources hris generalist positions,"[0.3031488296886285, -0.18419200802842775, 0.1...",0.935886,0.961538
93,seeking human resources opportunities open tra...,"[0.4780298460807119, 0.36307972030980246, 0.03...",0.914999,0.942308
74,nortia staffing seeking human resources payrol...,"[0.09531862242147326, -0.0778837502002716, 0.1...",0.864533,0.923077


In [34]:
# Expand the 'glove_embedding' column into multiple columns
glove_df = pd.DataFrame(shr_data['glove_embedding'].tolist(), columns=[i for i in range(50)], index=shr_data.index)

# Merge the expanded glove dimensions with the original DataFrame's 'score' column and retain the original index
shr_embeddings = glove_df.merge(shr_data[['score']], left_index=True, right_index=True)

# Display the result
shr_embeddings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,score
27,0.38823,0.124747,0.0787,0.385716,0.454788,0.069058,-0.105649,-0.222258,0.944758,0.337063,...,-0.20882,0.557427,0.35601,0.063675,-0.07568,-0.428491,0.496057,0.020828,-0.174535,1.0
98,0.177217,0.025102,0.11712,0.233994,0.62374,0.10827,0.107764,-0.0452,0.615033,-0.052917,...,-0.211957,0.353107,0.28006,-0.110782,0.000413,-0.42436,0.399099,0.045445,-0.157837,0.980769
9,0.303149,-0.184192,0.109257,0.018943,0.342281,0.043075,0.026244,-0.169713,0.444097,-0.049288,...,-0.173817,0.40613,0.509242,0.120688,-0.226563,-0.238679,0.504551,-0.053783,0.03778,0.961538
93,0.47803,0.36308,0.034663,0.374165,-0.023134,-0.186733,-0.405984,-0.064433,0.64747,-0.038168,...,-0.228993,0.586353,0.37271,0.034099,-0.069994,-0.230132,0.344218,-0.061116,-0.061074,0.942308
74,0.095319,-0.077884,0.164566,0.008481,0.156781,-0.14171,0.09536,-0.107096,0.592375,-0.322256,...,-0.109641,0.525044,0.483515,-0.265417,-0.33266,-0.280527,0.591073,-0.071581,0.284186,0.923077


### Training

The second step is to train the model on pairs of the data. We need to convert the data to tensors for the model to work.

In [37]:
def train_ranknet(model, doc_features, doc_scores, optimizer, epochs):
    """
    Efficiently train the RankNet model using full-batch training with vectorized pairwise loss computation.

    Parameters:
    - model: PyTorch RankNet model
    - doc_features: DataFrame of document features (num docs x features)
    - doc_scores: Series of true relevance scores for the documents
    - optimizer: Chosen optimizer for training
    - epochs: Number of epochs for training
    """
    model.train()  # Set the model to training mode

    # Convert data to tensors
    doc_tensors = torch.tensor(doc_features.values, dtype=torch.float32)
    score_tensor = torch.tensor(doc_scores.values, dtype=torch.float32).unsqueeze(1)  # True scores

    # Track the previous loss to detect if the loss increases
    previous_loss = float('inf')  # Initialize to a large number

    # Iterate over epochs
    for epoch in range(epochs):
        # Reset gradients before each epoch
        optimizer.zero_grad()

        # Forward pass through the model to get predictions for all documents
        docs_pred = model(doc_tensors)

        # Compute the pairwise loss using the model's pairwise_loss function
        loss = model.pairwise_loss(docs_pred.unsqueeze(1), docs_pred.unsqueeze(0),
                                   score_tensor.unsqueeze(1), score_tensor.unsqueeze(0))

        # Check if the loss has increased from the previous epoch
        if loss.item() > previous_loss:
            print(f'Warning: Loss increased from {previous_loss:.4f} to {loss.item():.4f} at epoch {epoch + 1}')

        # Backward pass to compute gradients
        loss.backward()

        # Update model weights
        optimizer.step()

        # Print epoch statistics
        print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}')

        # Update the previous loss for the next epoch
        previous_loss = loss.item()

### Testing

The third step is to test the model. We will use cross-validation to ensure the robustness of our model. There are three metrics we will consider evaluate our model with:

#### Kendall's Tau

- evaluates how good pairwise rankings are by subtracting the number of concordant and discordant pairs.
- 1: all pairs are concordant
- 0: No correlation between rankings
- -1: all pairs are discordant

We want a Tau close to 1, meaning that the model maintains a good relative order of document pairs.

#### Spearman's Rank Correlation

- person correlation between rank values of two variables
- 1: rankings are identical
- 0: rankings are reversed
- -1: No rank correlation

We want a correlation close to 1, meaning that both the true and predicted rankings have a similar monotonic relationship. This metric is more sensitive to large rank discrepancies between items than tau.

#### Normalized Discounted Cumulative Gain (NDCG)

- how well are the top results ranked (i.e. do the most relevant items appear at the top?)
- 1: perfect match
- 0: complete mismatch

We want an NDCG close to 1.


#### Metric Prioritization
1. High NDCG. Recruiter will only reach out to top candidates (how many is at their discretion).
2. High Spearman Correlation. It's important for the predicted ranking to have a similar order as the ideal ranking, but it's more important to make sure the top candidates in the predicted ranking match the top candidates in the ideal ranking.
3. High Tau. This is nice to have, but not as crucial as the other two metrics. Pairwise rankings generally don't matter because recruiter will reach out to the top candidates, rather than an individual candidate.

In [38]:
import torch
import random
from sklearn.model_selection import KFold
from scipy.stats import kendalltau, spearmanr
from sklearn.metrics import ndcg_score

def cross_val_ranknet(docs, epochs=100, lr=0.001, optimizer_name='adam',
                      n_splits=5, debug=False, seed=42):
    """
    Perform cross-validation and evaluate using a RankNet model.

    Parameters:
    - docs: DataFrame containing document features and a 'score' column.
    - epochs: Number of epochs for training the model.
    - lr: Learning rate for the optimizer.
    - optimizer_name: Name of the optimizer ('adam', 'adamw', 'sgd', 'adagrad').
    - n_splits: Number of folds for cross-validation (default: 5).
    - debug: If True, evaluate on the training set instead of the test set.
    - seed: for replicability
    """

    # set seeds for replicability
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.use_deterministic_algorithms(True)  # Make sure algorithms are deterministic
    torch.backends.cudnn.benchmark = False    # Disable auto-tuning
    torch.set_num_threads(1)                  # Ensure single-threaded execution

    # Initialize KFold cross-validation
    skf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)

    # Separate the features and labels within the DataFrame
    X = docs.drop('score', axis=1)  # Features
    y = docs['score']               # Labels (the score column)

    # Store Kendall's Tau, Spearman's Rank Correlation Coefficient, and NDCG
    kts, sprccs, ndcgs = [], [], []

    for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
        print(f"Fold {fold + 1}/{n_splits}")

        # Create training and testing sets using the indices provided by KFold
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # Initialize the RankNet model
        input_size = X_train.shape[1]  # Number of features
        units_per_layer = [128 for _ in range(7)]  # Example architecture, modify as needed
        model = RankNet(input_size=input_size, units_per_layer=units_per_layer, activation=nn.ReLU)

        # Prepare optimizer for the model
        optimizer = {
            'adam': torch.optim.Adam(model.parameters(), lr=lr, weight_decay=0.001),
            'adamw': torch.optim.AdamW(model.parameters(), lr=lr),
            'sgd': torch.optim.SGD(model.parameters(), lr=lr),
            'adagrad': torch.optim.Adagrad(model.parameters(), lr=lr)
        }.get(optimizer_name, None)

        if optimizer is None:
            raise ValueError(f"Unsupported optimizer: {optimizer_name}")

        # Train the model on the training set
        train_ranknet(model, X_train, y_train, optimizer, epochs)

        # Evaluate the model (on training set if debug is True, otherwise test set)
        model.eval()
        with torch.no_grad():
            if debug:
                # Debug mode: Evaluate on the training set
                y_pred = model(torch.tensor(X_train.values, dtype=torch.float32)).squeeze()  # Get predicted scores
                y_true = torch.tensor(y_train.values)  # True relevance scores
                print(f"Evaluating on the Training Set (Debug Mode)")
            else:
                # Default: Evaluate on the test set
                y_pred = model(torch.tensor(X_test.values, dtype=torch.float32)).squeeze()  # Get predicted scores
                y_true = torch.tensor(y_test.values)  # True relevance scores
                print(f"Evaluating on the Test Set")

            # Convert predictions to rankings
            ranking_pred = torch.argsort(y_pred, descending=True)  # Predicted ranking based on scores
            ranking_true = torch.argsort(y_true, descending=True)  # True ranking based on actual scores

            # Compute Kendall's Tau (comparing rankings based on actual scores)
            tau, _ = kendalltau(ranking_true.tolist(), ranking_pred.tolist())
            print(f"Kendall's Tau: {tau:.4f}")
            kts.append(tau)

            # Compute Spearman's Rank Correlation
            correlation, _ = spearmanr(ranking_true.tolist(), ranking_pred.tolist())
            print(f"Spearman's Rank Correlation: {correlation:.4f}")
            sprccs.append(correlation)

            # Compute NDCG using actual relevance scores and predicted scores
            ndcg = ndcg_score([y_true.numpy()], [y_pred.numpy()])
            print(f"NDCG: {ndcg:.4f}")
            ndcgs.append(ndcg)

    # Print average of the metrics across all folds
    avg_tau = sum(kts) / len(kts)
    avg_correlation = sum(sprccs) / len(sprccs)
    avg_ndcg = sum(ndcgs) / len(ndcgs)

    print(f"Average NDCG: {avg_ndcg:.4f}")
    print(f"Average Spearman's Rank Correlation: {avg_correlation:.4f}")
    print(f"Average Kendall's Tau: {avg_tau:.4f}")

### Final Ranking

The final step is to generate the final ranking of the documents. RankNet outputs scores of each documents, so we can simply order them in descending order.

In production, we would have new document data to order. For the purposes of this project, we will run RankNet on the existing document data we trained/tested on, and output a ranking.

## Putting it all together



1.   Train/test on aspiring glove dataset.
2.   Train on entire aspiring glove dataset for final model.
3. Test final model on seeking glove dataset (should yield similar results since aspiring/seeking are contextual synonyms).
4. Generate final ranking on original dataset.



In [39]:
cross_val_ranknet(ahr_embeddings, debug=False)

Fold 1/5
Epoch [1/100], Loss: 0.6932
Epoch [2/100], Loss: 0.6931
Epoch [3/100], Loss: 0.6930
Epoch [4/100], Loss: 0.6929
Epoch [5/100], Loss: 0.6928
Epoch [6/100], Loss: 0.6927
Epoch [7/100], Loss: 0.6926
Epoch [8/100], Loss: 0.6925
Epoch [9/100], Loss: 0.6923
Epoch [10/100], Loss: 0.6920
Epoch [11/100], Loss: 0.6917
Epoch [12/100], Loss: 0.6912
Epoch [13/100], Loss: 0.6906
Epoch [14/100], Loss: 0.6896
Epoch [15/100], Loss: 0.6883
Epoch [16/100], Loss: 0.6864
Epoch [17/100], Loss: 0.6837
Epoch [18/100], Loss: 0.6800
Epoch [19/100], Loss: 0.6750
Epoch [20/100], Loss: 0.6686
Epoch [21/100], Loss: 0.6607
Epoch [22/100], Loss: 0.6514
Epoch [23/100], Loss: 0.6407
Epoch [24/100], Loss: 0.6281
Epoch [25/100], Loss: 0.6131
Epoch [26/100], Loss: 0.5949
Epoch [27/100], Loss: 0.5733
Epoch [28/100], Loss: 0.5480
Epoch [29/100], Loss: 0.5169
Epoch [30/100], Loss: 0.4792
Epoch [31/100], Loss: 0.4377
Epoch [32/100], Loss: 0.3945
Epoch [33/100], Loss: 0.3549
Epoch [34/100], Loss: 0.3202
Epoch [35/100]

In [40]:
cross_val_ranknet(shr_embeddings, debug=False)

Fold 1/5
Epoch [1/100], Loss: 0.6931
Epoch [2/100], Loss: 0.6929
Epoch [3/100], Loss: 0.6928
Epoch [4/100], Loss: 0.6926
Epoch [5/100], Loss: 0.6923
Epoch [6/100], Loss: 0.6919
Epoch [7/100], Loss: 0.6914
Epoch [8/100], Loss: 0.6907
Epoch [9/100], Loss: 0.6897
Epoch [10/100], Loss: 0.6883
Epoch [11/100], Loss: 0.6863
Epoch [12/100], Loss: 0.6834
Epoch [13/100], Loss: 0.6793
Epoch [14/100], Loss: 0.6739
Epoch [15/100], Loss: 0.6670
Epoch [16/100], Loss: 0.6586
Epoch [17/100], Loss: 0.6484
Epoch [18/100], Loss: 0.6367
Epoch [19/100], Loss: 0.6231
Epoch [20/100], Loss: 0.6070
Epoch [21/100], Loss: 0.5875
Epoch [22/100], Loss: 0.5636
Epoch [23/100], Loss: 0.5344
Epoch [24/100], Loss: 0.4993
Epoch [25/100], Loss: 0.4579
Epoch [26/100], Loss: 0.4113
Epoch [27/100], Loss: 0.3624
Epoch [28/100], Loss: 0.3157
Epoch [29/100], Loss: 0.2760
Epoch [30/100], Loss: 0.2452
Epoch [31/100], Loss: 0.2222
Epoch [32/100], Loss: 0.2062
Epoch [33/100], Loss: 0.1966
Epoch [34/100], Loss: 0.1850
Epoch [35/100]

Observations:
- Excellent results for all metrics for both terms (NDCG is 98%+ for both terms)
- Model limitation: separate models needed for aspiring and seeking, because two search terms imply two separate recruiting objectives (i.e. two target variables). We would need to use LLM fine tuning, but that's beyond the scope of this project because we don't have enough compute resources or data.



# Conclusion

The final deliverable will be the LTR neural network trained on GloVe job title embeddings (50 dimensional) and their cosine similarities to each search term. Inherently, this means we will have to train a separate model per search term.

The solution is not scalable when we have multiple search terms the recruiter wants to hire for (the recruiter can't run 100 different models for 100 search terms). To solve for this, we would need to fine tune an LLM.

Furthermore, the recruiter will likely want to find the best candidates based on longer job descriptions, not just short search terms. We would need more data on each candidate (job title alone will likely lead to overfitting). As we expand the data we have for the candidates and get more compute resources, we can deliver a solution that will be even more useful to the recruiter in the long run. For now, this is out of the scope of the project.