# Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

# Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

# Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

# Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

# Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

# Dependencies

In [9]:
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer, util

import requests
import urllib.parse

import geopy.distance
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

# Reading the data

In [2]:
df = pd.read_csv('potential-talents.csv')
df

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


In [4]:
#using openstreetmap api to get the lattitude and longitude of the location - which will be later used for ranking
def location(city_name):
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(city_name) +'?format=json'
    response = requests.get(url).json()
    lat = float(response[0]['lat'])
    lon = float(response[0]['lon'])
    return lat,lon

In [5]:
#changing 3 location names as they weren't being identified by the API
df['location'] = df['location'].replace('Greater New York City Area', 'New York')
df['location'] = df['location'].replace('Greater Grand Rapids, Michigan Area', 'Michigan')
df['location'] = df['location'].replace('Greater Los Angeles Area', 'Los Angeles')

In [17]:
#ranking function
def ranking(df, job_query, location_query = None):
    
    #loading model and word embeddings
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    #compute embeddings of job_title
    input_embeddings = model.encode(df['job_title'], convert_to_tensor=True)
    
    #compute embeddings for query
    query_embedding = model.encode(job_query, convert_to_tensor=True)

    #initializing the column for job rankings
    if 'job_title_ranking' not in df:
        df['job_title_ranking'] = np.nan 
    
    #initalizing the column for location rankings
    if location_query != None:
        if 'location_ranking' not in df:
            df['location_ranking'] = np.nan 
        
    #calculating cosine similarity and geograhical distance using openstreetmap api 
    df['job_title_ranking'] = util.cos_sim(query_embedding, input_embeddings)[0]
    if location_query != None:
        for idx in range(len(df)):
            df['location_ranking'].loc[idx] = geopy.distance.distance(location(location_query), location(df['location'][idx])).km
    
    #ranking logic
    #if job title ranking is greater than 0.5, both job title and location have equal contribution to the fit rank
    #if job title ranking is less than 0.5, then job title accounts for 90% to the fit rank
    if location_query != None:
        min_max_scaler = MinMaxScaler()
        df['location_ranking'] = 1 - min_max_scaler.fit_transform(df[['location_ranking']]) #ranking the geographical distance b/w 0 and 1
        for idx in range(len(df)):
            if df['job_title_ranking'].loc[idx] > 0.5:
                df['fit'].loc[idx] = 0.5 * df['job_title_ranking'].loc[idx] + 0.5 * df['location_ranking'].loc[idx]
            else:
                df['fit'].loc[idx] = 0.9 * df['job_title_ranking'].loc[idx] + 0.1 * df['location_ranking'].loc[idx]
    else:
        df['fit'] = df['job_title_ranking'] #if location is not included in query
        
    return df

In [18]:
#initial ranking
job_query = 'seeking human resources'
location_query = 'texas'
initial_rank = ranking(df, job_query, location_query)
initial_rank

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.482357,0.427339,9.775191e-01
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.271673,0.224822,6.933263e-01
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.799572,0.772701,8.264443e-01
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,0.438381,0.377238,9.886667e-01
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.208569,0.231743,1.110223e-16
...,...,...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.792245,0.676011,9.084785e-01
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,0.727830,0.629215,8.264443e-01
101,102,Business Intelligence and Analytics at Travelers,New York,49,0.209035,0.146150,7.750032e-01
102,103,Always set them up for Success,Los Angeles,500+,0.215943,0.145582,8.491880e-01


In [19]:
#top 50
initial_rank.sort_values(by='fit', ascending=False)[0:50]

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
98,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.887956,0.904125,0.871787
29,30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.882224,0.899172,0.865276
27,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.882224,0.899172,0.865276
66,67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500+,0.822713,0.728724,0.916702
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.81923,0.772701,0.86576
81,82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,0.810313,0.620626,1.0
93,94,Seeking Human Resources Opportunities. Open t...,Amerika Birleşik Devletleri,415,0.803369,0.679118,0.92762
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.799572,0.772701,0.826444
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.799572,0.772701,0.826444
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.799572,0.772701,0.826444


In [20]:
#rank 50-100
initial_rank.sort_values(by='fit', ascending=False)[50:100]

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
82,83,HR Manager at Endemol Shine North America,"Los Angeles, California",268,0.516759,0.479823,0.849188
80,81,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,0.515512,0.474647,0.8832888
74,75,"Nortia Staffing is seeking Human Resources, Pa...","San Jose, California",500+,0.508964,0.475273,0.8121889
7,8,HR Senior Specialist,San Francisco Bay Area,500+,0.505659,0.472147,0.8072641
37,38,HR Senior Specialist,San Francisco Bay Area,500+,0.505659,0.472147,0.8072641
25,26,HR Senior Specialist,San Francisco Bay Area,500+,0.505659,0.472147,0.8072641
60,61,HR Senior Specialist,San Francisco Bay Area,500+,0.505659,0.472147,0.8072641
50,51,HR Senior Specialist,San Francisco Bay Area,500+,0.505659,0.472147,0.8072641
73,74,Human Resources Professional,Greater Boston Area,16,0.488397,0.727106,0.2496882
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.482357,0.427339,0.9775191


In [21]:
#suppose from the above we notice that id 82 is the ideal candidate (6th in top 50)
job_query = initial_rank['job_title'].loc[81]
location_query = initial_rank['location'].loc[81]
re_rank = ranking(initial_rank, job_query, location_query)
re_rank

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.762975,0.548467,9.774830e-01
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.272326,0.228571,6.661207e-01
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,8.217797e-01
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,0.438696,0.379807,9.686977e-01
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.227434,0.252705,1.110223e-16
...,...,...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,0.801518,0.706828,8.962081e-01
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,0.667025,0.512269,8.217797e-01
101,102,Business Intelligence and Analytics at Travelers,New York,49,0.147973,0.079071,7.680892e-01
102,103,Always set them up for Success,Los Angeles,500+,0.208359,0.141295,8.119406e-01


In [22]:
#top 50
initial_rank.sort_values(by='fit', ascending=False)[0:50]

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
81,82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,1.0,1.0,1.0
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.85093,0.848949,0.852911
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
57,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.835364,0.848949,0.82178
65,66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,0.825092,0.650185,1.0
66,67,"Human Resources, Staffing and Recruiting Profe...","Jackson, Mississippi Area",500+,0.807625,0.699659,0.91559


In [23]:
#rank 50-100
initial_rank.sort_values(by='fit', ascending=False)[50:100]

Unnamed: 0,id,job_title,location,connection,fit,job_title_ranking,location_ranking
25,26,HR Senior Specialist,San Francisco Bay Area,500+,0.67048,0.570681,0.7702797
50,51,HR Senior Specialist,San Francisco Bay Area,500+,0.67048,0.570681,0.7702797
69,70,"Retired Army National Guard Recruiter, office ...","Virginia Beach, Virginia",82,0.669077,0.542304,0.7958493
68,69,"Director of Human Resources North America, Gro...",Michigan,500+,0.6676,0.514202,0.8209985
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,0.667025,0.512269,0.8217797
24,25,Student at Humber College and Aspiring Human R...,Kanada,61,0.661357,0.656592,0.6661207
51,52,Student at Humber College and Aspiring Human R...,Kanada,61,0.661357,0.656592,0.6661207
36,37,Student at Humber College and Aspiring Human R...,Kanada,61,0.661356,0.656592,0.6661207
6,7,Student at Humber College and Aspiring Human R...,Kanada,61,0.661356,0.656592,0.6661207
49,50,Student at Humber College and Aspiring Human R...,Kanada,61,0.661356,0.656592,0.6661207


Success Metrics:

Rank candidates based on a fitness score - Done using cosine similarity and location ranking

Re-rank candidates when a candidate is starred - Done (same approach to the above)

Overall my recommendation is to go with BERT-based ranking algorithm due to its high relevancy (measured through manual inspection) when compared to Glove. 

In the case of Glove, we can see that certain candidates who should not be in the list is included in top 50. For example - candidates with job title 'People Development Coordinator at Ryan' and 'Director Of Administration at Excellence Logging'. In the case of BERT, we can see that almost all of the human resources related candidates are ranked higher than others. This might be due to the fact that BERT is more suited for understanding the semantic relationship in sentences than Glove. 

Bonus:

1. We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action?
- I have used word embeddings and location api to rank the candidates based on search query, and each time a starring action is performed, the candidate list is updated using the ideal candidate's job title and location as the search query.

2. How can we filter out candidates which in the first place should not be in this list?
- We can pick a ranking threshold of 0.5 to filter out the candidates who should not be in the list. This method works for BERT than Glove. 

3. Can we determine a cut-off point that would work for other roles without losing high potential candidates? 
- See above answer

4. Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?
- Currently, we are using human intervention to find ideal candidates. In the future, we could potentially use human expertise to create a  labelled dataset and utilize a 'Learning to Rank' algorithm to train a neural network, which could then be used in production for ranking these candidates, thus limiting any human bias (Note: labelling should be done using a diverse group of subject matter experts). Creating a labelled dataset could be as simple as adding an option to rank each result as relevant, somewhat relevant and not relevant. A similar system is explained here in this article - https://embracingtherandom.com/machine-learning/tensorflow/ranking/deep-learning/learning-to-rank-part-1/