##### **Background:**

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

##### **Data Description:**

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

##### **Attributes:**
**id:** unique identifier for candidate (numeric)

**job_title:** job title for candidate (text)

**location:** geographical location for candidate (text)

**connections:** number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
**fit:** how fit the candidate is for the role? (numeric, probability between 0-1)

**Keywords:** “Aspiring human resources” or “seeking human resources”

##### **Goal(s):**

Predict how fit the candidate is based on their available information (variable fit)

##### **Success Metric(s):**

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

##### **Bonus(es):**

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

nltk.download('wordnet')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Diego\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
df = pd.read_excel('potential-talents.xlsx').set_index('id')
df.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 104 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   104 non-null    object 
 1   location    104 non-null    object 
 2   connection  104 non-null    object 
 3   fit         0 non-null      float64
dtypes: float64(1), object(3)
memory usage: 4.1+ KB


In [4]:
df.replace('500+ ', '501', inplace=True)

In [5]:
df['connection'] = pd.to_numeric(df['connection'])

In [6]:
df.value_counts('job_title')

job_title
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 7
Student at Humber College and Aspiring Human Resources Generalist                                                        7
Aspiring Human Resources Professional                                                                                    7
People Development Coordinator at Ryan                                                                                   6
Native English Teacher at EPIK (English Program in Korea)                                                                5
HR Senior Specialist                                                                                                     5
Aspiring Human Resources Specialist                                                                                      5
Seeking Human Resources HRIS and Generalist Positions                                                                    4
Adviso

In [7]:
df.drop_duplicates(inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53 entries, 1 to 104
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   job_title   53 non-null     object 
 1   location    53 non-null     object 
 2   connection  53 non-null     int64  
 3   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 2.1+ KB


#### Remove Acronyms and Lemmatize the Job Titles

In [9]:
lemmatizer = WordNetLemmatizer()
def remove_acroynms(text):
    return re.sub(r'\b[A-Z\.]{2,}\b', ' ', text)
def lemmatize_sentence(sentence):
    words = sentence.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)
def softmax(x):
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)

df['job_title'] = df['job_title'].apply(remove_acroynms)
df['job_title'] = df['job_title'].apply(lemmatize_sentence)

##### TF-IDF

In [10]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (1, 2))
docs_tfidf = vectorizer.fit_transform(df["job_title"])


In [11]:
query1 = "Aspiring human resources"
query2 = "seeking human resources"
query1_tfidf = vectorizer.transform([query1])
query2_tfidf = vectorizer.transform([query2])

##### Assign a fitness score for Candidates based on Cosine Similarity to the Queries

In [16]:
cosine_similarities1 = cosine_similarity(query1_tfidf, docs_tfidf).flatten()
cosine_similarities2 = cosine_similarity(query2_tfidf, docs_tfidf).flatten()
combined_similarities = (cosine_similarities1 + cosine_similarities2) / 2
df['fit'] = combined_similarities
df_sorted = df.sort_values('fit', ascending=False)
df_sorted[['job_title', 'fit']].head(10)

Unnamed: 0_level_0,job_title,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1
97,Aspiring Human Resources Professional,0.488087
3,Aspiring Human Resources Professional,0.488087
99,Seeking Human Resources Position,0.448175
28,Seeking Human Resources Opportunities,0.43912
73,"Aspiring Human Resources Manager, seeking inte...",0.435764
6,Aspiring Human Resources Specialist,0.419663
10,Seeking Human Resources and Generalist Positions,0.358232
74,Human Resources Professional,0.317996
27,Aspiring Human Resources Management student se...,0.310035
100,Aspiring Human Resources Manager | Graduating ...,0.264456


##### Assign a Fitness Score based on the Cosine Similarity of the Embedded Queries and Job Title Embeddings Using BERT Model

In [15]:
model = SentenceTransformer('bert-base-nli-mean-tokens')
job_title_embeddings = model.encode(df['job_title'].tolist())
query1_embedding = model.encode([query1])[0]
query2_embedding = model.encode([query2])[0]
cosine_similarities1 = cosine_similarity([query1_embedding], job_title_embeddings)[0]
cosine_similarities2 = cosine_similarity([query2_embedding], job_title_embeddings)[0]
combined_similarities = (cosine_similarities1 + cosine_similarities2) / 2
df['fit'] = combined_similarities
df_sorted = df.sort_values('fit', ascending=False)
df_sorted[['job_title', 'fit']].head(10)

Unnamed: 0_level_0,job_title,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1
99,Seeking Human Resources Position,0.905945
28,Seeking Human Resources Opportunities,0.901761
10,Seeking Human Resources and Generalist Positions,0.843888
6,Aspiring Human Resources Specialist,0.82919
3,Aspiring Human Resources Professional,0.818317
97,Aspiring Human Resources Professional,0.818317
74,Human Resources Professional,0.788866
89,Director Human Resources at,0.744817
67,"Human Resources, Staffing and Recruiting Profe...",0.732581
94,Seeking Human Resources Opportunities. Open to...,0.71306
