## Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

#### Goal(s):
Predict how fit the candidate is based on their available information (variable fit)

#### Success Metric(s):
* Rank candidates based on a fitness score.
* Re-rank candidates when a candidate is starred.

#### Bonus(es):
* We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.
* How can we filter out candidates which in the first place should not be in this list?
* Can we determine a cut-off point that would work for other roles without losing high potential candidates?
* Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import re
import string
from collections import Counter

import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer


[nltk_data] Downloading package stopwords to C:\Users\Dorcas
[nltk_data]     Taiwo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Dorcas
[nltk_data]     Taiwo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Dorcas
[nltk_data]     Taiwo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Load and Explore Dataset

In [2]:
# Read in Dataset

Data = pd.read_csv('C:\\Users\\Public\\potential-talents - Aspiring human resources - seeking human resources.csv')
Dataset = Data.copy()

Dataset.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
Dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [4]:
#some basic stats for categorical columns

Dataset.describe(include=object)

Unnamed: 0,job_title,location,connection
count,104,104,104
unique,52,41,33
top,2019 C.T. Bauer College of Business Graduate (...,Kanada,500+
freq,7,12,44


In [5]:
Dataset.shape

(104, 5)

In [6]:
len(Dataset['id'].unique())

104

In [7]:
Dataset = Dataset.set_index('id')
Dataset.head()

Unnamed: 0_level_0,job_title,location,connection,fit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
4,People Development Coordinator at Ryan,"Denton, Texas",500+,
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [8]:
#drop fit column

Dataset.drop('fit',axis=1,inplace=True)
Dataset.head()

Unnamed: 0_level_0,job_title,location,connection
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
2,Native English Teacher at EPIK (English Progra...,Kanada,500+
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
4,People Development Coordinator at Ryan,"Denton, Texas",500+
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


In [9]:
#count duplicated rows in the total dataset

Dataset.loc[Dataset.duplicated(), :]
print(Dataset.duplicated().sum())

51


In [10]:
#drop duplicated rows in the total dataset
Dataset.drop_duplicates(inplace=True)
Dataset.shape

(53, 3)

In [11]:
#Verify if ther are some null values 

Dataset.isnull().sum()

job_title     0
location      0
connection    0
dtype: int64

In [12]:
Dataset.job_title.value_counts()

Aspiring Human Resources Professional                                                                                    2
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 1
Lead Official at Western Illinois University                                                                             1
Senior Human Resources Business Partner at Heil Environmental                                                            1
Aspiring Human Resources Professional | An energetic and Team-Focused Leader                                             1
HR Manager at Endemol Shine North America                                                                                1
Human Resources professional for the world leader in GIS software                                                        1
RRP Brand Portfolio Executive at JTI (Japan Tobacco International)                                                       1
Information Syst

## Data Cleansing / Text Preprocessing

* Define preprocessing function to clean the text columns

In [13]:
# Load spacy
nlp = spacy.load("en_core_web_sm")

def clean_string(text, stem="None"):

    final_string = ""
    
    # Remove all the special characters
    text = re.sub(r'\W', ' ', str(text))

    # remove all single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)

    # Remove single characters from the start
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)

    # Removing prefixed 'b'
    text = re.sub(r'^b\s+', '', text)
    
    # Make text lower case
    text = text.lower()
        
    # Remove stop words
    text = text.split()
    useless_words = nltk.corpus.stopwords.words("english")
   
    text_filtered = [word for word in text if not word in useless_words]

    # Remove numbers
    text_filtered = [re.sub(r'\w*\d\w*', '', w) for w in text_filtered]
    
    # Stem or Lemmatize
    if stem == 'Stem':
        stemmer = PorterStemmer() 
        text_stemmed = [stemmer.stem(y) for y in text_filtered]
    elif stem == 'Lem':
        lem = WordNetLemmatizer()
        text_stemmed = [lem.lemmatize(y) for y in text_filtered]
    elif stem == 'Spacy':
        text_filtered = nlp(' '.join(text_filtered))
        text_stemmed = [y.lemma_ for y in text_filtered]
    else:
        text_stemmed = text_filtered

    final_string = ' '.join(text_stemmed)
    
    # Removing whitespace from the start of the string
    final_string = final_string.lstrip()
    
    # Substituting multiple spaces with single space
    final_string = re.sub(r'\s+', ' ', final_string, flags=re.I)
    
    return final_string

In [14]:
Dataset['job_title'] = Dataset['job_title'].apply(lambda x: clean_string(x, stem='Spacy'))

In [15]:
#Example after text preprocessing

print(Data.iloc[0]['job_title'])
Dataset.iloc[0]['job_title']

2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional


'bauer college business graduate magna cum laude aspire human resource professional'

In [16]:
Dataset.job_title.value_counts()

aspire human resource professional                                                                2
bauer college business graduate magna cum laude aspire human resource professional                1
lead official western illinois university                                                         1
senior human resource business partner heil environmental                                         1
aspire human resource professional energetic team focus leader                                    1
hr manager endemol shine north america                                                            1
human resource professional world leader gis software                                             1
rrp brand portfolio executive jti japan tobacco international                                     1
information system specialist programmer love datum organization                                  1
bachelor science biology victoria university wellington                                           1


### Observations
* Reviewing the job_title field after cleansing, there are some abbreviations to be expanded and also spacing to adjust.

In [17]:
#replace abreviations by there real texts
Dataset.replace({'job_title' : { 'chro' : 'chief human resources officer', 'svp' : 'senior vice president'
        ,'gphr' : 'global professional in human resources','hris' : 'human resources information system'
        , 'csr' : 'corporate social responsibility', 'sphr' : 'Senior Professional in Human Resources'
        , 'hr' : 'human resources', 'rrp': 'recommended retail price'}}, regex=True, inplace=True)

In [18]:
Dataset.job_title.value_counts()

aspire human resource professional                                                                                                                                                                                                 2
bauer college business graduate magna cum laude aspire human resource professional                                                                                                                                                 1
lead official western illinois university                                                                                                                                                                                          1
senior human resource business partner heil environmental                                                                                                                                                                          1
aspire human resource professional energetic team focus leader                      

In [19]:
words_counts = Counter()
for i in Dataset.job_title:
    for word in i.split(" "):
        words_counts[word] += 1

print('There are', len(words_counts) ,'words in the job title column')
words_counts.most_common()

There are 185 words in the job title column


[('human', 39),
 ('resource', 34),
 ('aspire', 13),
 ('professional', 11),
 ('seek', 10),
 ('manager', 7),
 ('university', 6),
 ('student', 6),
 ('management', 6),
 ('business', 5),
 ('generalist', 5),
 ('resources', 5),
 ('specialist', 4),
 ('position', 4),
 ('senior', 3),
 ('information', 3),
 ('system', 3),
 ('opportunity', 3),
 ('director', 3),
 ('major', 3),
 ('college', 2),
 ('english', 2),
 ('coordinator', 2),
 ('officer', 2),
 ('in', 2),
 ('internship', 2),
 ('retail', 2),
 ('staff', 2),
 ('north', 2),
 ('america', 2),
 ('leader', 2),
 ('bauer', 1),
 ('graduate', 1),
 ('magna', 1),
 ('cum', 1),
 ('laude', 1),
 ('native', 1),
 ('teacher', 1),
 ('epik', 1),
 ('program', 1),
 ('korea', 1),
 ('people', 1),
 ('development', 1),
 ('ryan', 1),
 ('advisory', 1),
 ('board', 1),
 ('member', 1),
 ('celal', 1),
 ('bayar', 1),
 ('humber', 1),
 ('chapman', 1),
 ('vice', 1),
 ('president', 1),
 ('chief', 1),
 ('marketing', 1),
 ('communication', 1),
 ('corporate', 1),
 ('social', 1),
 ('respo

### Observation:
The most frequent words in the job title column are: 'human', 'resource', 'aspire', 'professional' and 'seek'.

## Ranking/Score candidates based on keywords.

### Apply TF-IDF
* TF-IDF (Term frequency-inverse Document Frequency): The idea of TF-IDF is to reflect the importance of a word to its document or sentence by normalizing the words which occur frequently in the collection of documents.

* Term frequency (TF): number of times a term has appeared in a document. The term frequency is a measure of how frequently or how common a word is for a given sentence.

* Inverse Document Frequency (IDF): The inverse document frequency (IDF ) is a measure of how rare a word is in a document. Words like “the”,” a” show up in all the documents but rare words will not occur in all the documents of the corpus.

If a word appears in almost every document means it’s not significant for the classification.

In [20]:
df=Dataset.copy()

In [21]:
def rank_candidates_tfidf(keyword,df,feature_name):
    #store cleaned rows into a list
    corpus=df[feature_name].tolist()
    #apply preprocessing on keyword
    keyword=clean_string(keyword,stem='Spacy')
    #create an instance of tfidfVectorizer
    tfvectoriser = TfidfVectorizer()
    #Append the cleaned keyword to th corpus list 
    corpus.append(keyword)
    #fit the Tfidf vectorizer on the cleaned data
    key_tf=tfvectoriser.fit_transform(corpus)
    #create a dataframe that contains tfidf values of each token for each row in th data
    key_df=pd.DataFrame(key_tf.toarray(), columns=tfvectoriser.get_feature_names())
    shape=key_df.shape
    #calculate the cosine similarity between the keyword and the data's rows
    cosine_sim = cosine_similarity(key_tf.toarray()[:shape[0]-1], key_tf.toarray()[shape[0]-1].reshape(1, -1))
    #Add cosine resluts into a new column fit
    df['fit_tfidf']=cosine_sim
    #store the ranked candidates based on the fit score
    final_df=df.sort_values('fit_tfidf',ascending=False)
    #remove the keyword added to reuse the function again on other different keywords
    corpus.pop()
    return final_df

In [22]:
keyword='Aspiring human resources'
df=rank_candidates_tfidf(keyword,df,"job_title")



In [24]:
#number of ranked candidates based on the keyword
print('There are',df[df['fit_tfidf']!=0].shape[0],'ranked candidates for the job',keyword)

There are 35 ranked candidates for the job Aspiring human resources


In [25]:
#Explore the new column 
df

Unnamed: 0_level_0,job_title,location,connection,fit_tfidf
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,aspire human resource professional,"Raleigh-Durham, North Carolina Area",44,0.772731
97,aspire human resource professional,"Kokomo, Indiana Area",71,0.772731
6,aspire human resource specialist,Greater New York City Area,1,0.670497
73,aspire human resource manager seek internship ...,"Houston, Texas Area",7,0.609912
27,aspire human resource management student seek ...,"Houston, Texas Area",500+,0.429037
72,business management major aspire human resourc...,"Monroe, Louisiana Area",5,0.426692
74,human resource professional,Greater Boston Area,16,0.416265
66,experience retail manager aspire human resourc...,"Austin, Texas Area",57,0.402334
7,student humber college aspire human resource g...,Kanada,61,0.386788
100,aspire human resource manager graduating may s...,"Cape Girardeau, Missouri",103,0.348935


### Apply Bert
BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

In [26]:
#Load the pretrained model 
bert = SentenceTransformer('all-MiniLM-L6-v2')

In [27]:
def rank_candidates_bert(keyword,df,feature_name):
    #load bert model 
    bert = SentenceTransformer('all-MiniLM-L6-v2')
    #store cleaned rows into a list
    corpus=df[feature_name].tolist()
    #apply preprocessing on keyword
    keyword=clean_string(keyword, stem='Spacy')
    #Append the cleaned keyword to the corpus list 
    corpus.append(keyword)
    #encode our sentences
    sentence_embeddings = bert.encode(corpus)
    print("the shape of our encoded sentences is",sentence_embeddings.shape)
    shape=sentence_embeddings.shape
    #calculate the cosine similarity between the keyword and the data's rows
    cosine_sim = cosine_similarity(sentence_embeddings[:shape[0]-1], sentence_embeddings[shape[0]-1].reshape(1, -1))
    #Add cosine resluts into a new column fit
    df['fit_bert']=cosine_sim
    #store the ranked candidates based on the fit score
    final_df=df.sort_values('fit_bert',ascending=False)
    #remove the keyword added to reuse the function again on other different keywords
    corpus.pop()
    return final_df

In [28]:
keyword='Aspiring human resources'
df=rank_candidates_bert(keyword,df,"job_title")

the shape of our encoded sentences is (54, 384)


In [29]:
#number of ranked candidates based on the keyword
print('There are',df[df['fit_bert']!=0].shape[0],'ranked candidates for the job',keyword)

There are 53 ranked candidates for the job Aspiring human resources


In [30]:
#Explore the new column 
df

Unnamed: 0_level_0,job_title,location,connection,fit_tfidf,fit_bert
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,aspire human resource professional,"Raleigh-Durham, North Carolina Area",44,0.772731,0.924379
97,aspire human resource professional,"Kokomo, Indiana Area",71,0.772731,0.924379
6,aspire human resource specialist,Greater New York City Area,1,0.670497,0.913294
74,human resource professional,Greater Boston Area,16,0.416265,0.725905
66,experience retail manager aspire human resourc...,"Austin, Texas Area",57,0.402334,0.707049
73,aspire human resource manager seek internship ...,"Houston, Texas Area",7,0.609912,0.689495
72,business management major aspire human resourc...,"Monroe, Louisiana Area",5,0.426692,0.683848
100,aspire human resource manager graduating may s...,"Cape Girardeau, Missouri",103,0.348935,0.663966
7,student humber college aspire human resource g...,Kanada,61,0.386788,0.663035
99,seek human resource position,"Las Vegas, Nevada Area",48,0.284455,0.661434


### Observation: 
From the result above, BERT technique appears to be more effective for ranking the talents and would be adopted.
It is important to note that BERT's performance is due to it's pretrained large corpus of unlabelled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words).

## Re-ranking Candidates after starring
Here, after starring a candidate(s) by entering the id, the job titles of that candidate(s) becomes the keyword and then the pointwise technique (cosine similarity) would be used to re-rank the talent list to select the potential talents.

* Re-rank using the job title starred as keyword and finding its cosine similarity to all the other job titles.
* For this task Bert Embeddings vectors will be used.

In [31]:
#df["starred"]=df["fit_bert"]

In [32]:
index = df.index
print('Candidate Ids list')
for i in index:
    print(i,end=' ')

Candidate Ids list
3 97 6 74 66 73 72 100 7 99 76 78 82 28 67 13 27 8 89 83 79 88 10 71 68 101 69 12 84 81 70 4 75 1 77 94 90 86 104 80 102 91 11 92 96 98 5 87 93 2 95 85 103 

In [48]:
starred_id = input("Please, enter id of starred candidate(s): ")

Please, enter id of starred candidate(s): 99


In [49]:
# Build BERT_base model and get Bert vectors

bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
rerank_key = df[df.index.isin([int(starred_id)])]['job_title'].tolist()
bert_vectors_rerank = bert_model.encode(list(df['job_title']))
rerank_vectors = bert_model.encode(rerank_key)

In [50]:
# Cosine similarity
RerankScore = []
for i in range(len(bert_vectors_rerank)) :
    re_rank_score = cosine_similarity(np.reshape(bert_vectors_rerank[i],(1,-1)), np.reshape(rerank_vectors[0],(1,-1)))
    RerankScore.append(re_rank_score[0][0])
    
df['Rerank Score'] = RerankScore

In [51]:
df.sort_values(by=['Rerank Score'],ascending=False)

Unnamed: 0_level_0,job_title,location,connection,fit_tfidf,fit_bert,Rerank Score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
99,seek human resource position,"Las Vegas, Nevada Area",48,0.284455,0.661434,1.0
28,seek human resource opportunity,"Chicago, Illinois",390,0.275043,0.638006,0.976697
10,seek human resource human resources informatio...,Greater Philadelphia Area,500+,0.226925,0.598351,0.89431
74,human resource professional,Greater Boston Area,16,0.416265,0.725905,0.866465
6,aspire human resource specialist,Greater New York City Area,1,0.670497,0.913294,0.863791
97,aspire human resource professional,"Kokomo, Indiana Area",71,0.772731,0.924379,0.843415
3,aspire human resource professional,"Raleigh-Durham, North Carolina Area",44,0.772731,0.924379,0.843415
89,director human resource ey,Greater Atlanta Area,349,0.225287,0.61377,0.802768
68,human resource specialist luxottica,Greater New York City Area,500+,0.230374,0.583683,0.790024
94,seek human resource opportunity open travel re...,Amerika Birleşik Devletleri,415,0.150942,0.382453,0.776818


### Conclusion: 
The result above shows that, the ranked list based on BERT's model is being reranked successfully. We can see, the starred candidate with id 99 rightly labeled as 1. Then, we can say that our model reaches strongly the goals of this project.

The goal of this project is to predict how fit the candidates are based on their available information. The project is divided into 2 steps: Rank candidates based on a fitness score and then Re-rank candidates when a candidate is starred.