# Potential Talents – Candidate Ranking Demo

**Objective.** Rank semi-sourced human-resources candidates so recruiters can focus their manual review on the most promising profiles.

**Data.** Google Sheet export with `id`, `job_title`, `location`, and `connection` counts; the real `fit` label is hidden to mimic a cold-start search.

**Success criteria.** Compare fast lexical and semantic ranking approaches, then demonstrate how starring a candidate reshapes the list.

## Notebook roadmap
1. Setup & data download
2. Quick data health check
3. Keyword ranking with TF-IDF
4. Semantic ranking with MPNet embeddings
5. Re-ranking once a recruiter stars a candidate
6. Findings & next steps

## 1. Setup & data access
All work runs in standard Python 3 with scikit-learn for TF-IDF and `sentence-transformers` for pretrained embeddings.

### Challenge overview
Talent partners supply keywords such as "Aspiring Human Resources" or "Seeking Human Resources" and expect a ranked list of promising candidates. This notebook walks through the current experiment setup so it is easy to present or iterate on later.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

### Data download
The source sheet is exported to CSV on the fly so the notebook always reflects the current sourcing list.

In [9]:
sheet_id = "117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU"
csv_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"
df = pd.read_csv(csv_url)

### Clean & preview
Drop the provided `fit` label (reserved for evaluation) and inspect the first few records.

In [11]:
df = df.drop(columns=['fit'])
df.head()

Unnamed: 0,id,job_title,location,connection
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


In [12]:
df.shape

(104, 4)

The dataset is compact (fewer than 200 rows), so re-ranking and experimentation remain interactive.

## 2. Keyword ranking with TF-IDF

TF-IDF offers a lightweight lexical baseline that highlights distinctive tokens within job titles. We score each record against two provided keywords to understand how well exact phrasing performs.

In [13]:
tfidf = TfidfVectorizer(
    token_pattern=r'(?u)\b\w+\b',
    strip_accents='unicode',
    stop_words='english',
    lowercase=True,
    # ngram_range=(1,2)
)

combined_sparse = tfidf.fit_transform(df['job_title'])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

keyword = "Aspiring Human Resources"

query_vec = tfidf.transform([keyword])

sims = cosine_similarity(query_vec, combined_sparse).flatten()

df_tfidf_k1 = df.copy()
df_tfidf_k1['fit'] = sims

df_sorted = df_tfidf_k1.sort_values(by='fit', ascending=False).reset_index(drop=True)

df_sorted.head(10)

Unnamed: 0,id,job_title,location,connection,fit
0,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
1,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.753591
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
3,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
4,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
5,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
6,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.753591
7,36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.695679
8,24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.695679
9,49,Aspiring Human Resources Specialist,Greater New York City Area,1,0.695679


In [15]:
keyword = "Seeking Human Resources"

query_vec = tfidf.transform([keyword])

sims = cosine_similarity(query_vec, combined_sparse).flatten()

df_tfidf_k2 = df.copy()
df_tfidf_k2['fit'] = sims

df_sorted = df_tfidf_k2.sort_values(by='fit', ascending=False).reset_index(drop=True)

df_sorted.head(10)

Unnamed: 0,id,job_title,location,connection,fit
0,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.664933
1,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.664933
2,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.644851
3,73,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,0.620589
4,62,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.488364
5,40,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.488364
6,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.488364
7,53,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.488364
8,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.476206
9,27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.476206


## 3. Semantic ranking with MPNet embeddings

A pretrained MPNet encoder captures richer context (e.g., "talent partner" vs "HR specialist") than token overlap alone. We reuse cosine similarity so the scores stay comparable to TF-IDF.

In [16]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

df_mpnet_k1 = df.copy()

titles = df_mpnet_k1['job_title'].fillna("").tolist()
keyword = "Aspiring Human Resources"

title_embeddings = model.encode(titles)  
keyword_embedding = model.encode([keyword])

similarities = model.similarity(keyword_embedding, title_embeddings)[0]

df_mpnet_k1['fit'] = similarities
df_ranked = df_mpnet_k1.sort_values('fit', ascending=False).reset_index(drop=True)

df_ranked.head(20)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,job_title,location,connection,fit
0,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
1,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
2,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
3,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.879234
4,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
5,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
6,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.879234
7,49,Aspiring Human Resources Specialist,Greater New York City Area,1,0.864964
8,6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.864964
9,60,Aspiring Human Resources Specialist,Greater New York City Area,1,0.864964


In [19]:
keyword = "Seeking Human Resources"

keyword_embedding = model.encode([keyword])

similarities = model.similarity(keyword_embedding, title_embeddings)[0]

df_mpnet_k2 = df.copy()
df_mpnet_k2['fit'] = similarities
df_ranked = df_mpnet_k2.sort_values('fit', ascending=False).reset_index(drop=True)

df_ranked.head(20)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,job_title,location,connection,fit
0,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.880791
1,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.880791
2,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.820258
3,67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500+,0.745163
4,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.735197
5,40,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.735197
6,62,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.735197
7,53,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,0.735197
8,60,Aspiring Human Resources Specialist,Greater New York City Area,1,0.726518
9,24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.726518


MPNet keeps cosine similarity as the distance metric yet rewards synonymous phrases. For example, titles like "Aspiring Human Resources Specialist" and "Aspiring HR Professional" surface together even when the closing noun differs.

## 4. Re-ranking after starring a candidate

Once a reviewer flags a favorite candidate, we treat that title as the new query vector and compute similarities again. This simulates the human-in-the-loop loop without rebuilding the ranking stack.

In [22]:
starred = df.iloc[73]
starred

id                                      74
job_title     Human Resources Professional
location               Greater Boston Area
connection                              16
Name: 73, dtype: object

In [23]:
starred.job_title

'Human Resources Professional'

In [24]:
search_keyword = starred.job_title

keyword_embedding = model.encode([search_keyword])

similarities = model.similarity(keyword_embedding, title_embeddings)[0]

df_starred = df.copy()
df_starred ['fit'] = similarities
df_ranked = df_starred .sort_values('fit', ascending=False).reset_index(drop=True)

df_ranked.head(20)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,job_title,location,connection,fit
0,74,Human Resources Professional,Greater Boston Area,16,1.0
1,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
2,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.84059
3,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
4,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
5,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
6,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
7,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.84059
8,24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.82942
9,36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.82942


## 5. Findings & next steps
- TF-IDF is fast and transparent for literal keyword searches, making it a solid screening baseline.
- MPNet embeddings surface paraphrased HR roles and therefore reduce manual scanning.
- Re-ranking from a starred profile leverages the same embedding space, hinting at a lightweight feedback loop for recruiters.
- Next steps: log ranking metrics such as precision@k, add filters (location, connection thresholds), and export the flow to a lightweight UI.