# Objective - To rank potential talents sources using search keywords and re-rank when a candidate is starred

## Background

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

Import relevant libraries

In [6]:
import warnings
import os
import pandas as pd
import numpy as np
from numpy import dot
from numpy.linalg import norm
# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
# nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import WordNetLemmatizer
# scipy
from scipy.spatial.distance import cosine
# gensim
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors, Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec
# BERT
from sentence_transformers import SentenceTransformer
# Ignore warnings
warnings.filterwarnings('ignore')

In [2]:
# Initial settings
os.chdir(r"C:\Users\Anant Gupta\Apziva\Potential Talents")
pd.set_option('display.max_colwidth', None)

# Exploratory Data Analysis

Exploring dataset to understand the data

In [3]:
pt = pd.read_csv('potential-talents - Aspiring human resources - seeking human resources.csv')
print("Shape of dataframe:", pt.shape)
pt.head()

Shape of dataframe: (104, 5)


Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


Check for duplicate rows

In [4]:
# Drop fit column and explore duplicate rows
pt.drop('fit', axis = 1, inplace = True)             
pt_dup = pt.drop(['id'], axis = 1)                    
print("Number of duplicate entries:", pt_dup.duplicated().sum())                            

Number of duplicate entries: 51


In [5]:
# See duplicate entries
pt[pt_dup.duplicated(keep=False)]                     

Unnamed: 0,id,job_title,location,connection
0,1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85
1,2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+
...,...,...,...,...
60,61,HR Senior Specialist,San Francisco Bay Area,500+
61,62,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+
62,63,Student at Chapman University,"Lake Forest, California",2
63,64,"SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR","Houston, Texas Area",500+


Number of duplicate rows are 51. We'll now remove duplicate entries.

In [6]:
# Drop duplicate rows and rebuild dataframe
pt_dup = pt_dup.drop_duplicates()                                    
pt = pd.concat([pt['id'], pt_dup], axis = 1).dropna(axis = 0)   
print("Shape of non-duplicated dataframe:", pt.shape)                                                         

Shape of non-duplicated dataframe: (53, 4)


We have now removed duplicate rows from the dataframe.

In [7]:
# Information about the dataframe
pt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 0 to 103
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          53 non-null     int64 
 1   job_title   53 non-null     object
 2   location    53 non-null     object
 3   connection  53 non-null     object
dtypes: int64(1), object(3)
memory usage: 2.1+ KB


In [8]:
# Summary of the dataframe
pt.describe(include = 'all')

Unnamed: 0,id,job_title,location,connection
count,53.0,53,53,53
unique,,52,41,33
top,,Aspiring Human Resources Professional,"Houston, Texas Area",500+
freq,,2,4,18
mean,65.132075,,,
std,35.117954,,,
min,1.0,,,
25%,28.0,,,
50%,78.0,,,
75%,91.0,,,


There are no null values in the input columns. Total rows in the dataset are 53.

# Data cleaning

Clean data to make it easy to process and more robust

In [9]:
# Text normalization - Replace 'HR' by 'Human Resources' to make our model smarter
pt['cleaned_job_title'] = pt['job_title'].str.replace("HR", " Human Resources ")   

# Removing unnecssary characters like pipes, comma, dot etc. for cleaning the data.
pt['cleaned_job_title'] = pt['cleaned_job_title'].str.replace(r"[^a-zA-Z0-9\' ']", "")
pt['cleaned_job_title'] = pt['cleaned_job_title'].str.lower()
pt = pt.reset_index(drop = True)
pt.head()

Unnamed: 0,id,job_title,location,connection,cleaned_job_title
0,1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,2019 ct bauer college of business graduate magna cum laude and aspiring human resources professional
1,2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+,native english teacher at epik english program in korea
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,people development coordinator at ryan
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,advisory board member at celal bayar university


In [10]:
# Remove stop words and lemmatize words in cleaned job title
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

for i in range (0, len(pt)):
    word_tokens = word_tokenize(pt['cleaned_job_title'][i])
    filtered_sentence = [w for w in word_tokens if w not in stop_words]
    lemmatized_sentence = []
    for word in filtered_sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(word))
    pt['cleaned_job_title'][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)

pt.head()

Unnamed: 0,id,job_title,location,connection,cleaned_job_title
0,1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,2019 ct bauer college business graduate magna cum laude aspiring human resource professional
1,2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+,native english teacher epik english program korea
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,people development coordinator ryan
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,advisory board member celal bayar university


# Vectorize job title using TF-IDF

Convert cleaned job titles into vectors 

In [11]:
# Convert job_title column into a list
job_title_list = list(pt['cleaned_job_title'])                    

# Vectorize job_title_list
vectorizer = TfidfVectorizer()                                 
vectorizer.fit(job_title_list)                    
X = vectorizer.transform(job_title_list)                   

In [12]:
# Get feature names in all the documents
feature_names = vectorizer.get_feature_names()                  
print("Number of unique features: ", len(feature_names))        

Number of unique features:  177


In [13]:
# Convert job titles into arrays
tfidf_vector = X.toarray()                                  
print("Shape of Tfidf vector: ", tfidf_vector.shape)        
tfidf_vector[0]

Shape of Tfidf vector:  (53, 177)


array([0.32869998, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.17980684,
       0.        , 0.        , 0.        , 0.32869998, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.24463863, 0.        , 0.        , 0.        ,
       0.        , 0.29767544, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.32869998, 0.32869998,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.32869998,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.10754048, 0.        , 0.     

# Cosine Similarity

Calculate cosine similarity between document and the searched phrase

In [14]:
# Searched phrase
phrase = 'Aspiring human resources'

In [15]:
# Define cosine similarity function (cosine gives cosine distance)
def cosine_similarity(vector_1, vector_2):
    cos_sim = 1 - cosine(vector_1, vector_2)        
    return cos_sim

In [16]:
# Lemmatize searched phrase
lemmatized_phrase = []
tokenized_phrase = word_tokenize(phrase.lower())
for word in tokenized_phrase:
    lemmatized_phrase.append(lemmatizer.lemmatize(word))
lemmatized_phrase = TreebankWordDetokenizer().detokenize(lemmatized_phrase)

# Convert search phrase into a vector
X1 = vectorizer.transform([lemmatized_phrase])                    
X1_vector = X1.toarray()
print("Shape of search phrase vector:", X1_vector.shape)

Shape of search phrase vector: (1, 177)


In [17]:
# Calcualte Tfidf cosine similarity and add it to pt dataframe
sim_score_list = []
for x in range (0, len(pt)):
    sim_score_list.append(cosine_similarity(tfidf_vector[x], X1_vector))

pt['tfidf_sim_score'] = sim_score_list
pt.sort_values(by ='tfidf_sim_score', ascending = False)

Unnamed: 0,id,job_title,location,connection,cleaned_job_title,tfidf_sim_score
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional,0.766333
45,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resource professional,0.766333
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resource specialist,0.674875
21,73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resource manager seeking internship human resource,0.609698
12,27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",500+,aspiring human resource management student seeking internship,0.428269
20,72,Business Management Major and Aspiring Human Resources Manager,"Monroe, Louisiana Area",5,business management major aspiring human resource manager,0.425993
14,66,Experienced Retail Manager and aspiring Human Resources Professional,"Austin, Texas Area",57,experienced retail manager aspiring human resource professional,0.404718
22,74,Human Resources Professional,Greater Boston Area,16,human resource professional,0.397849
8,10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,seeking human resource human resource generalist position,0.391072
6,7,Student at Humber College and Aspiring Human Resources Generalist,Kanada,61,student humber college aspiring human resource generalist,0.390565


# Word Embedding

Word embedding "embeds" a word in a cluster of words based on how similar it is to all of the words in the cluster. We'll look at two types of word embedding models (GloVe and Word2Vec) and compute similarity scores between the searched phrase and the documents.

## a. GloVe (Global Vectors)

In GloVe, we measure the similarity of the hidden factors between words to predict their co-occurrence count. I am using a corpus from https://nlp.stanford.edu/projects/glove/ website.

In [18]:
# Create temp file and save converted embedding into it
target_file = get_tmpfile('word2vec.6B.50d.txt')
glove2word2vec('glove.6B.50d.txt', target_file) 

# Load the converted embedding into memory
glove_model = KeyedVectors.load_word2vec_format(target_file)

# Save as binary data
glove_model.save_word2vec_format('word2vec.6B.50d.bin.gz', binary=True)

In [19]:
def doc_token_vectors(sentence, model, sentence_vector_list, vector_dimensions):
    word_tokens = word_tokenize(sentence)
    filtered_words = [w for w in word_tokens if w in model.key_to_index.keys()]
    
    for j in range(0, len(word_tokens)):
        if word_tokens[j] in filtered_words:
            token_vector = model[word_tokens[j]]
        else:
            token_vector = np.zeros(vector_dimensions)
        sentence_vector_list.append(token_vector)
    
    return sentence_vector_list

In [20]:
# Vectorize job title using GloVe model
glove_vectors = []
for i in range(0, len(pt)):
    glove_sentence_vector = []
    doc_token_vectors(pt['cleaned_job_title'][i], glove_model, glove_sentence_vector, 50)
    glove_vectors.append(glove_sentence_vector)
    
glove_vectors[0:2]

[[array([ 0.29147 ,  0.034791,  0.12969 , -0.20702 , -1.1551  , -0.37773 ,
          0.048003, -0.22573 , -0.15784 ,  0.17262 ,  0.36289 ,  0.13456 ,
          0.51093 ,  0.56072 ,  1.3861  ,  0.62435 ,  0.18149 ,  0.73681 ,
         -0.79169 ,  0.45962 ,  0.72966 , -0.86147 ,  0.58159 , -0.053018,
          0.32392 ,  0.89441 ,  0.86256 , -0.26343 , -0.47104 ,  0.70734 ,
          0.20118 ,  0.89692 , -0.32302 ,  0.21936 ,  0.2381  , -0.15296 ,
          0.21473 ,  0.81703 ,  0.07275 , -1.1414  , -1.2233  , -1.377   ,
          0.80457 ,  0.4554  , -1.3315  ,  0.26122 ,  1.0741  ,  0.55246 ,
          0.074761,  0.072005], dtype=float32),
  array([ 1.0561   , -0.18373  ,  0.7577   ,  0.60339  , -0.41714  ,
         -0.34824  ,  0.68804  , -1.2084   ,  0.18644  ,  0.12866  ,
          0.962    ,  0.3861   ,  0.61741  ,  0.23326  , -1.2275   ,
          1.6706   , -1.4296   , -0.34089  , -1.7715   ,  0.47763  ,
          0.38193  , -0.070358 ,  0.98955  ,  0.02929  , -0.58777  ,
       

In [21]:
# Vectorize searched phrase using GloVe model
glove_search_phrase_vector = []
doc_token_vectors(lemmatized_phrase, glove_model, glove_search_phrase_vector, 50)

[array([-4.3762e-01,  8.3236e-01,  2.9638e-01, -7.8180e-01,  2.4432e-01,
         3.2892e-01, -8.2750e-01,  1.2692e-01, -3.9246e-01,  8.6643e-01,
         2.4300e-01, -6.6664e-04,  5.2488e-01,  1.5963e+00,  3.8362e-01,
        -3.2021e-01,  1.0324e+00,  7.0235e-01,  4.1679e-01, -2.0262e-01,
         2.8378e-01,  9.4354e-01, -3.4941e-01,  1.4981e+00,  3.2488e-01,
        -6.6963e-01, -4.2229e-01, -1.3033e+00, -5.9413e-01, -7.0466e-02,
         7.0193e-01,  3.0914e-01, -4.0663e-01, -6.4972e-01,  7.7279e-01,
         3.3412e-01, -8.0891e-01,  2.4861e-02, -5.1049e-01, -8.1757e-01,
        -1.0251e-01,  5.6904e-01,  6.7411e-01, -1.9030e-01,  4.0583e-01,
        -6.1531e-01,  2.2660e-01, -1.1493e-03, -6.5260e-01,  1.8631e-01],
       dtype=float32),
 array([ 0.61854 ,  0.11915 , -0.46786 ,  0.31368 ,  1.0334  ,  0.95964 ,
         0.87803 , -1.0346  ,  1.6322  ,  0.29347 ,  0.80844 , -0.058903,
         0.021251,  0.40986 ,  0.54443 , -0.33311 ,  0.53712 , -0.35823 ,
         0.29374 ,  0.09

In [22]:
# Calculate cosine similarity between searched phrase and job title
glove_similarity =[]
for i in range(0, len(pt)):
    sim_score = cosine_similarity(np.mean(glove_vectors[i], axis = 0), np.mean(glove_search_phrase_vector, axis =0))
    glove_similarity.append(sim_score)

# Add GloVe similarity score to the pt dataframe
pt['GloVe_sim_score'] = glove_similarity
pt.sort_values(by ='GloVe_sim_score', ascending = False)

Unnamed: 0,id,job_title,location,connection,cleaned_job_title,tfidf_sim_score,GloVe_sim_score
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resource specialist,0.674875,0.969614
45,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resource professional,0.766333,0.958512
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional,0.766333,0.958512
21,73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resource manager seeking internship human resource,0.609698,0.934672
22,74,Human Resources Professional,Greater Boston Area,16,human resource professional,0.397849,0.933199
8,10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,seeking human resource human resource generalist position,0.391072,0.922696
24,76,Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment,"New York, New York",212,aspiring human resource professional passionate helping create inclusive engaging work environment,0.25628,0.906919
30,82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,aspiring human resource professional energetic teamfocused leader,0.373958,0.895589
13,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resource opportunity,0.267743,0.894528
49,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,human resource generalist loparex,0.228028,0.891505


## b. Word2vec

It is a popular word embedding technique using a 2-layer neural network.It works in a way such that a well-trained set of word vectors will place similar words close to each other in that space.

There are two main training algorithims within Word2Vec: Continuous Bag of Words(CBOW) and Skip-Gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word. 

In [23]:
# Use a pre-trained word2vec model
w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True)

In [24]:
# Vectorize job title using Word2Vec model
w2v_vectors = []
for i in range(0, len(pt)):
    w2v_sentence_vector = []
    doc_token_vectors(pt['cleaned_job_title'][i], w2v_model, w2v_sentence_vector, 300)
    w2v_vectors.append(w2v_sentence_vector)
    
w2v_vectors[0:2]

[[array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [25]:
# Vectorize search phrase using Word2Vec model
w2v_search_phrase = []
doc_token_vectors(lemmatized_phrase, w2v_model, w2v_search_phrase, 300)

[array([-1.40625000e-01,  6.83593750e-02,  1.09252930e-02, -1.72851562e-01,
         1.35742188e-01,  2.55859375e-01,  3.78417969e-02,  4.22363281e-02,
        -4.27734375e-01,  1.05468750e-01,  4.24804688e-02,  6.78710938e-02,
         1.92382812e-01,  1.25732422e-02, -1.43554688e-01,  2.73437500e-01,
         1.30859375e-01,  2.48046875e-01,  4.73632812e-02, -2.97851562e-02,
         1.32812500e-01,  2.77343750e-01, -3.18359375e-01, -1.91406250e-01,
        -4.00390625e-02, -1.38549805e-02, -2.62451172e-03,  9.03320312e-02,
         9.13085938e-02,  2.08740234e-02,  1.93359375e-01, -1.52343750e-01,
        -3.83300781e-02, -7.12890625e-02,  3.78417969e-02, -8.25195312e-02,
         1.29882812e-01, -4.85839844e-02,  1.61132812e-01, -9.96093750e-02,
         2.49023438e-02, -1.13769531e-01, -9.33837891e-03, -1.46484375e-01,
        -1.22070312e-01,  6.88476562e-02, -1.98242188e-01,  2.26562500e-01,
         8.83789062e-02,  2.78320312e-02, -1.68945312e-01,  1.44531250e-01,
         1.6

In [26]:
# Cosine Similarity score for Word2Vec model
w2v_similarity_score = []
for i in range(0, len(pt)):
    sim_score = cosine_similarity(np.mean(w2v_search_phrase, axis = 0), 
                                  np.mean(w2v_vectors[i], axis = 0))
    w2v_similarity_score.append(sim_score)

# Add Word2Vec similarity score
pt['Word2Vec_sim_score'] = w2v_similarity_score
pt.sort_values(by = 'Word2Vec_sim_score', ascending = False)

Unnamed: 0,id,job_title,location,connection,cleaned_job_title,tfidf_sim_score,GloVe_sim_score,Word2Vec_sim_score
45,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resource professional,0.766333,0.958512,0.950395
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional,0.766333,0.958512,0.950395
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resource specialist,0.674875,0.969614,0.912262
21,73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resource manager seeking internship human resource,0.609698,0.934672,0.875945
22,74,Human Resources Professional,Greater Boston Area,16,human resource professional,0.397849,0.933199,0.874494
6,7,Student at Humber College and Aspiring Human Resources Generalist,Kanada,61,student humber college aspiring human resource generalist,0.390565,0.835481,0.829068
30,82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,aspiring human resource professional energetic teamfocused leader,0.373958,0.895589,0.827266
8,10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,seeking human resource human resource generalist position,0.391072,0.922696,0.815463
48,100,Aspiring Human Resources Manager | Graduating May 2020 | Seeking an Entry-Level Human Resources Position in St. Louis,"Cape Girardeau, Missouri",103,aspiring human resource manager graduating may 2020 seeking entrylevel human resource position st louis,0.348567,0.79941,0.79129
12,27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",500+,aspiring human resource management student seeking internship,0.428269,0.881137,0.783973


# BERT (Bidirectional Encoded Representations from Transformers)

BERT is a modern NLP technique developed by Google in 2018. It is used to solve a varierty of NLP problems including similarity of sentences. We'll use this technique to get similarity scores for our job title.

In [27]:
# Build BERT_base model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

In [28]:
# Convert job titles into BERT embedded vectors
bert_job_title_embeddings = bert_model.encode(job_title_list)
bert_job_title_embeddings.shape

(53, 768)

In [29]:
# Convert search phrase into a BERT embedded vector
bert_search_phrase_embedding = bert_model.encode(lemmatized_phrase)
bert_search_phrase_embedding.shape

(768,)

In [31]:
# Calculate cosine similarity between job title and search phrase vectors
bert_cosine_similarity = []
for i in range(0, len(pt)):
    cos_sim = cosine_similarity(bert_job_title_embeddings[i], bert_search_phrase_embedding)
    bert_cosine_similarity.append(cos_sim)
    
# Add BERT_cosine_similarity column in the pt dataframe
pt['BERT_sim_score'] = bert_cosine_similarity
pt.sort_values(by = 'BERT_sim_score', ascending = False)

Unnamed: 0,id,job_title,location,connection,cleaned_job_title,tfidf_sim_score,GloVe_sim_score,Word2Vec_sim_score,BERT_sim_score
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resource specialist,0.674875,0.969614,0.912262,0.955138
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional,0.766333,0.958512,0.950395,0.948828
45,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resource professional,0.766333,0.958512,0.950395,0.948828
30,82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,aspiring human resource professional energetic teamfocused leader,0.373958,0.895589,0.827266,0.86791
47,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,seeking human resource position,0.276993,0.869983,0.728513,0.849294
13,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resource opportunity,0.267743,0.894528,0.756767,0.841447
22,74,Human Resources Professional,Greater Boston Area,16,human resource professional,0.397849,0.933199,0.874494,0.809261
8,10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,seeking human resource human resource generalist position,0.391072,0.922696,0.815463,0.775801
24,76,Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment,"New York, New York",212,aspiring human resource professional passionate helping create inclusive engaging work environment,0.25628,0.906919,0.7109,0.75489
21,73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resource manager seeking internship human resource,0.609698,0.934672,0.875945,0.752413


In [32]:
# Calculate mean of Tfidf, Glove, Word2Vec and BERT similarity scores and add as a column in pt dataframe
pt['mean_sim_score'] = pt.loc[:, 'tfidf_sim_score': 'BERT_sim_score'].mean(axis=1)
pt.sort_values(by = 'mean_sim_score', ascending = False)

Unnamed: 0,id,job_title,location,connection,cleaned_job_title,tfidf_sim_score,GloVe_sim_score,Word2Vec_sim_score,BERT_sim_score,mean_sim_score
45,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,aspiring human resource professional,0.766333,0.958512,0.950395,0.948828,0.906017
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resource professional,0.766333,0.958512,0.950395,0.948828,0.906017
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resource specialist,0.674875,0.969614,0.912262,0.955138,0.877972
21,73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,aspiring human resource manager seeking internship human resource,0.609698,0.934672,0.875945,0.752413,0.793182
22,74,Human Resources Professional,Greater Boston Area,16,human resource professional,0.397849,0.933199,0.874494,0.809261,0.753701
30,82,Aspiring Human Resources Professional | An energetic and Team-Focused Leader,"Austin, Texas Area",174,aspiring human resource professional energetic teamfocused leader,0.373958,0.895589,0.827266,0.86791,0.741181
8,10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,seeking human resource human resource generalist position,0.391072,0.922696,0.815463,0.775801,0.726258
12,27,Aspiring Human Resources Management student seeking an internship,"Houston, Texas Area",500+,aspiring human resource management student seeking internship,0.428269,0.881137,0.783973,0.702358,0.698934
6,7,Student at Humber College and Aspiring Human Resources Generalist,Kanada,61,student humber college aspiring human resource generalist,0.390565,0.835481,0.829068,0.712709,0.691956
13,28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,seeking human resource opportunity,0.267743,0.894528,0.756767,0.841447,0.690121
