### Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

### Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

### Attributes:

id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

### Output (desired target):

fit - how fit the candidate is for the role? (numeric, probability between 0-1)

## Keywords: “Aspiring human resources” or “seeking human resources”


### Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

### Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

### Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [1]:
# import modules
import random
import os
import warnings
# ignore warnings
warnings.filterwarnings('ignore')

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
nlp = spacy.load('en_core_web_sm')
# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import WordNetLemmatizer
# scipy
from scipy.spatial.distance import cosine

%matplotlib inline 



In [2]:
# load the dataset
df = pd.read_csv('potential-talents.csv')

In [3]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [5]:
df.drop(['fit','id'], axis = 1, inplace = True)             


In [6]:
df = df.drop_duplicates(ignore_index=True) 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   job_title   53 non-null     object
 1   location    53 non-null     object
 2   connection  53 non-null     object
dtypes: object(3)
memory usage: 1.4+ KB


In [8]:
df.head()

Unnamed: 0,job_title,location,connection
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,People Development Coordinator at Ryan,"Denton, Texas",500+
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


In [9]:
def is_token_allowed(token):
#    '''
#         Only allow valid tokens which are not stop words
#         and punctuation symbols.
#     '''
     if (not token or not token.text.strip() or
         token.is_stop or token.is_punct):
         return False
     return True

def preprocess_token(token):
     # Reduce token to its lowercase lemma form
     return token.lemma_.strip().lower()

#complete_filtered_tokens = [preprocess_token(token)
#for token in df_doc if is_token_allowed(token)]

In [10]:
df['nlp_job_title'] = df ['job_title']

In [11]:
for i in range (0, len(df)):
    df_doc = nlp(df.job_title[i])
    df['nlp_job_title'][i] = [preprocess_token(token)for token in df_doc if is_token_allowed(token)]
    


In [12]:
complete_filtered_tokens = [preprocess_token(token)
for token in df_doc if is_token_allowed(token)]
complete_filtered_tokens

['director', 'administration', 'excellence', 'logging']

In [13]:
#df['nlp_job_title'][3] = complete_filtered_tokens

In [14]:
df.head()

Unnamed: 0,job_title,location,connection,nlp_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,"[2019, c.t., bauer, college, business, graduat..."
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,"[native, english, teacher, epik, english, prog..."
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,"[aspiring, human, resources, professional]"
3,People Development Coordinator at Ryan,"Denton, Texas",500+,"[people, development, coordinator, ryan]"
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,"[advisory, board, member, celal, bayar, univer..."


## Vectorize with TF-IDF

In [16]:
# Convert job_title column into a list
nlp_title_list = list(df['nlp_job_title'])                    

# Vectorize job_title_list
vectorizer = TfidfVectorizer()                                 
vectorizer.fit(nlp_title_list)                    
X = vectorizer.transform(nlp_title_list)                   

AttributeError: 'list' object has no attribute 'lower'

In [17]:
nlp_title_list

[['2019',
  'c.t.',
  'bauer',
  'college',
  'business',
  'graduate',
  'magna',
  'cum',
  'laude',
  'aspire',
  'human',
  'resources',
  'professional'],
 ['native', 'english', 'teacher', 'epik', 'english', 'program', 'korea'],
 ['aspiring', 'human', 'resources', 'professional'],
 ['people', 'development', 'coordinator', 'ryan'],
 ['advisory', 'board', 'member', 'celal', 'bayar', 'university'],
 ['aspiring', 'human', 'resources', 'specialist'],
 ['student',
  'humber',
  'college',
  'aspiring',
  'human',
  'resources',
  'generalist'],
 ['hr', 'senior', 'specialist'],
 ['seek', 'human', 'resources', 'hris', 'generalist', 'positions'],
 ['student', 'chapman', 'university'],
 ['svp',
  'chro',
  'marketing',
  'communications',
  'csr',
  'officer',
  '|',
  'engie',
  '|',
  'houston',
  '|',
  'woodlands',
  '|',
  'energy',
  '|',
  'gphr',
  '|',
  'sphr'],
 ['human',
  'resources',
  'coordinator',
  'intercontinental',
  'buckhead',
  'atlanta'],
 ['aspiring',
  'human',
  

In [18]:
df['cleaned_job_title'] = df['job_title']

In [19]:
df.head()

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,"[2019, c.t., bauer, college, business, graduat...",2019 C.T. Bauer College of Business Graduate (...
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,"[native, english, teacher, epik, english, prog...",Native English Teacher at EPIK (English Progra...
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,"[aspiring, human, resources, professional]",Aspiring Human Resources Professional
3,People Development Coordinator at Ryan,"Denton, Texas",500+,"[people, development, coordinator, ryan]",People Development Coordinator at Ryan
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,"[advisory, board, member, celal, bayar, univer...",Advisory Board Member at Celal Bayar University


In [20]:
# Removing unnecssary characters like pipes, comma, dot etc. for cleaning the data.
df['cleaned_job_title'] = df['cleaned_job_title'].str.replace(r"[^a-zA-Z0-9\' ']", "")
df['cleaned_job_title'] = df['cleaned_job_title'].str.lower()
df = df.reset_index(drop = True)
df.head()

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,"[2019, c.t., bauer, college, business, graduat...",2019 ct bauer college of business graduate mag...
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,"[native, english, teacher, epik, english, prog...",native english teacher at epik english program...
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,"[aspiring, human, resources, professional]",aspiring human resources professional
3,People Development Coordinator at Ryan,"Denton, Texas",500+,"[people, development, coordinator, ryan]",people development coordinator at ryan
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,"[advisory, board, member, celal, bayar, univer...",advisory board member at celal bayar university


In [21]:
# Remove stop words and lemmatize words in cleaned job title
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

for i in range (0, len(df)):
    word_tokens = word_tokenize(df['cleaned_job_title'][i])
    filtered_sentence = [w for w in word_tokens if w not in stop_words]
    lemmatized_sentence = []
    for word in filtered_sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(word))
    df['cleaned_job_title'][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)

df.head()

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,"[2019, c.t., bauer, college, business, graduat...",2019 ct bauer college business graduate magna ...
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,"[native, english, teacher, epik, english, prog...",native english teacher epik english program korea
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,"[aspiring, human, resources, professional]",aspiring human resource professional
3,People Development Coordinator at Ryan,"Denton, Texas",500+,"[people, development, coordinator, ryan]",people development coordinator ryan
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,"[advisory, board, member, celal, bayar, univer...",advisory board member celal bayar university


In [22]:
# Convert job_title column into a list
cleaned_title_list = list(df['cleaned_job_title'])                    

# Vectorize job_title_list
vectorizer = TfidfVectorizer()                                 
vectorizer.fit(cleaned_title_list)                    
X = vectorizer.transform(cleaned_title_list)                   

In [23]:
# Get feature names in all the documents
feature_names = vectorizer.get_feature_names()                  
print("Number of unique features: ", len(feature_names))        

Number of unique features:  180


In [24]:
# Convert job titles into arrays
tfidf_vector = X.toarray()                                  
print("Shape of Tfidf vector: ", tfidf_vector.shape)        

Shape of Tfidf vector:  (53, 180)


## Cosine Similarity

In [25]:
# Searched phrase
phrase = 'Aspiring human resources'

In [26]:
# Define cosine similarity function (cosine gives cosine distance)
def cosine_similarity(vector_1, vector_2):
    cos_sim = 1 - cosine(vector_1, vector_2)        
    return cos_sim

In [27]:
# Lemmatize searched phrase
lemmatized_phrase = []
tokenized_phrase = word_tokenize(phrase.lower())
for word in tokenized_phrase:
    lemmatized_phrase.append(lemmatizer.lemmatize(word))
lemmatized_phrase = TreebankWordDetokenizer().detokenize(lemmatized_phrase)

# Convert search phrase into a vector
X1 = vectorizer.transform([lemmatized_phrase])                    
X1_vector = X1.toarray()
print("Shape of search phrase vector:", X1_vector.shape)

Shape of search phrase vector: (1, 180)


In [28]:
# Calcualte Tfidf cosine similarity and add it to pt dataframe
sim_score_list = []
for x in range (0, len(df)):
    sim_score_list.append(cosine_similarity(tfidf_vector[x], X1_vector))

df['tfidf_sim_score'] = sim_score_list
df.sort_values(by ='tfidf_sim_score', ascending = False).head(20)

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,"[aspiring, human, resources, professional]",aspiring human resource professional,0.774586
45,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,"[aspiring, human, resources, professional]",aspiring human resource professional,0.774586
5,Aspiring Human Resources Specialist,Greater New York City Area,1,"[aspiring, human, resources, specialist]",aspiring human resource specialist,0.684514
21,"Aspiring Human Resources Manager, seeking inte...","Houston, Texas Area",7,"[aspiring, human, resources, manager, seek, in...",aspiring human resource manager seeking intern...,0.626699
12,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,"[aspiring, human, resources, management, stude...",aspiring human resource management student see...,0.437541
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,"[business, management, major, aspiring, human,...",business management major aspiring human resou...,0.435239
22,Human Resources Professional,Greater Boston Area,16,"[human, resources, professional]",human resource professional,0.427058
14,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,"[experienced, retail, manager, aspire, human, ...",experienced retail manager aspiring human reso...,0.413698
6,Student at Humber College and Aspiring Human R...,Kanada,61,"[student, humber, college, aspiring, human, re...",student humber college aspiring human resource...,0.399351
30,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,"[aspiring, human, resources, professional, |, ...",aspiring human resource professional energetic...,0.382502
