### Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

### Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

### Attributes:

id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

### Output (desired target):

fit - how fit the candidate is for the role? (numeric, probability between 0-1)

## Keywords: “Aspiring human resources” or “seeking human resources”


### Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

### Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

### Bonus(es):

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?


In [1]:
# import modules
import random
import os
import warnings
# ignore warnings
warnings.filterwarnings('ignore')

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
nlp = spacy.load('en_core_web_sm')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import WordNetLemmatizer
# scipy
from scipy.spatial.distance import cosine
# gensim
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors, Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec

from tqdm import tqdm

# BERT
from sentence_transformers import SentenceTransformer


%matplotlib inline 



In [2]:
# load the dataset
df = pd.read_csv('potential-talents.csv')

In [3]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [5]:
df.drop(['fit','id'], axis = 1, inplace = True)             


In [6]:
df = df.drop_duplicates(ignore_index=True) 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   job_title   53 non-null     object
 1   location    53 non-null     object
 2   connection  53 non-null     object
dtypes: object(3)
memory usage: 1.4+ KB


In [8]:
df.head()

Unnamed: 0,job_title,location,connection
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85
1,Native English Teacher at EPIK (English Progra...,Kanada,500+
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
3,People Development Coordinator at Ryan,"Denton, Texas",500+
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


## Create a function for preprocessing with "spaCy" NLP library

In [9]:
def is_token_allowed(token):
#    '''
#         Only allow valid tokens which are not stop words
#         and punctuation symbols.
#     '''
     if (not token or not token.text.strip() or
         token.is_stop or token.is_punct):
         return False
     return True

def preprocess_token(token):
     # Reduce token to its lowercase lemma form
     return token.lemma_.strip().lower()

#complete_filtered_tokens = [preprocess_token(token)
#for token in df_doc if is_token_allowed(token)]

In [10]:
# Create a new column as a copy of job_title, to be updated with preprocessing
df['nlp_job_title'] = df ['job_title']

In [11]:
for i in range (0, len(df)):
    df_doc = nlp(df.job_title[i])
    preproc_tokens = [preprocess_token(token)for token in df_doc if is_token_allowed(token)]
    df['nlp_job_title'][i] = " ".join(preproc_tokens)


In [12]:
complete_filtered_tokens = [preprocess_token(token)
for token in df_doc if is_token_allowed(token)]
complete_filtered_tokens

['director', 'administration', 'excellence', 'logging']

In [13]:
df.head(10)

Unnamed: 0,job_title,location,connection,nlp_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,2019 c.t. bauer college business graduate magn...
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,native english teacher epik english program korea
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional
3,People Development Coordinator at Ryan,"Denton, Texas",500+,people development coordinator ryan
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,advisory board member celal bayar university
5,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist
6,Student at Humber College and Aspiring Human R...,Kanada,61,student humber college aspiring human resource...
7,HR Senior Specialist,San Francisco Bay Area,500+,hr senior specialist
8,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,seek human resources hris generalist positions
9,Student at Chapman University,"Lake Forest, California",2,student chapman university


## Create a similar procedure for preprocessing with NLTK

In [14]:
#Create another job title column for comparing preprocessing using the "nltk" NLP library
df['cleaned_job_title'] = df['job_title']

In [15]:
# Removing unnecssary characters like pipes, comma, dot etc. for cleaning the data.
df['cleaned_job_title'] = df['cleaned_job_title'].str.replace(r"[^a-zA-Z0-9\' ']", "")
df['cleaned_job_title'] = df['cleaned_job_title'].str.lower()
df = df.reset_index(drop = True)

# Remove stop words and lemmatize words in cleaned job title
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

for i in range (0, len(df)):
    word_tokens = word_tokenize(df['cleaned_job_title'][i])
    filtered_sentence = [w for w in word_tokens if w not in stop_words]
    lemmatized_sentence = []
    for word in filtered_sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(word))
    df['cleaned_job_title'][i] = TreebankWordDetokenizer().detokenize(lemmatized_sentence)

df.head(10)

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title
0,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,2019 c.t. bauer college business graduate magn...,2019 ct bauer college business graduate magna ...
1,Native English Teacher at EPIK (English Progra...,Kanada,500+,native english teacher epik english program korea,native english teacher epik english program korea
2,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,aspiring human resources professional,aspiring human resource professional
3,People Development Coordinator at Ryan,"Denton, Texas",500+,people development coordinator ryan,people development coordinator ryan
4,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,advisory board member celal bayar university,advisory board member celal bayar university
5,Aspiring Human Resources Specialist,Greater New York City Area,1,aspiring human resources specialist,aspiring human resource specialist
6,Student at Humber College and Aspiring Human R...,Kanada,61,student humber college aspiring human resource...,student humber college aspiring human resource...
7,HR Senior Specialist,San Francisco Bay Area,500+,hr senior specialist,hr senior specialist
8,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500+,seek human resources hris generalist positions,seeking human resource hris generalist position
9,Student at Chapman University,"Lake Forest, California",2,student chapman university,student chapman university


## It appears the preprocessing produced the same result with both methods, except spaCy is preserving the periods ( c.t. bauer vs ct bauer, inc. vs inc). When converting to a list for vectorization, the number of features is 183 vs 180.

## Let's check the one that keeps more features and try to identify if the "periods" are causing there to be 3 more features 

In [16]:
# Convert job_title column into a list
nlp_title_list = list(df['nlp_job_title'])                    
cleaned_title_list = list(df['cleaned_job_title'])                    

# Vectorize job_title_list
vectorizer = TfidfVectorizer()                                 
# Specify which list to vectorize
vectorizer.fit(nlp_title_list)                    
X = vectorizer.transform(nlp_title_list)    

#vectorizer.fit(cleaned_title_list)                    
#XX = vectorizer.transform(cleaned_title_list)    


In [17]:
nlp_title_list

['2019 c.t. bauer college business graduate magna cum laude aspire human resources professional',
 'native english teacher epik english program korea',
 'aspiring human resources professional',
 'people development coordinator ryan',
 'advisory board member celal bayar university',
 'aspiring human resources specialist',
 'student humber college aspiring human resources generalist',
 'hr senior specialist',
 'seek human resources hris generalist positions',
 'student chapman university',
 'svp chro marketing communications csr officer | engie | houston | woodlands | energy | gphr | sphr',
 'human resources coordinator intercontinental buckhead atlanta',
 'aspiring human resources management student seek internship',
 'seek human resources opportunities',
 'experienced retail manager aspire human resources professional',
 'human resources staffing recruiting professional',
 'human resources specialist luxottica',
 'director human resources north america groupe beneteau',
 'retired army 

In [18]:
# Get feature names in all the documents
feature_names = vectorizer.get_feature_names()                  
print("Number of unique features: ", len(feature_names))        

Number of unique features:  183


In [19]:
# Convert job titles into arrays
tfidf_vector = X.toarray()                                  
print("Shape of Tfidf vector: ", tfidf_vector.shape)        

Shape of Tfidf vector:  (53, 183)


## Cosine Similarity

In [20]:
feature_names

['2019',
 '2020',
 '2621',
 '408',
 '709',
 'administration',
 'administrative',
 'admissions',
 'advisory',
 'america',
 'analyst',
 'analytics',
 'army',
 'arts',
 'aspire',
 'aspiring',
 'assistant',
 'atlanta',
 'bachelor',
 'bauer',
 'bayar',
 'beach',
 'benefits',
 'beneteau',
 'biology',
 'board',
 'brand',
 'buckhead',
 'business',
 'care',
 'celal',
 'center',
 'chapman',
 'chro',
 'college',
 'communications',
 'community',
 'compensation',
 'conflict',
 'coordinator',
 'create',
 'csr',
 'cum',
 'customer',
 'datum',
 'delphi',
 'development',
 'director',
 'employment',
 'endemol',
 'energetic',
 'energy',
 'engaging',
 'engie',
 'engineer',
 'english',
 'entry',
 'environment',
 'environmental',
 'epik',
 'excellence',
 'executive',
 'experienced',
 'ey',
 'focus',
 'generalist',
 'gis',
 'gphr',
 'graduate',
 'groupe',
 'guard',
 'hardware',
 'heil',
 'help',
 'houston',
 'hr',
 'hris',
 'human',
 'humber',
 'illinois',
 'inc',
 'inclusive',
 'indiana',
 'information',
 '

In [21]:
tfidf_vector[0]

array([0.34239355, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.28714724,
       0.        , 0.        , 0.        , 0.        , 0.34239355,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.25483022, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.31007653,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.34239355, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.31007653, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

## Use alternative suggested search phrases to compare against the vector

In [22]:
# Searched phrase
phrase = 'full stack software engineer'

In [23]:
# Define cosine similarity function (cosine gives cosine distance)
def cosine_similarity(vector_1, vector_2):
    cos_sim = 1 - cosine(vector_1, vector_2)        
    return cos_sim

In [24]:
phrase

'full stack software engineer'

In [25]:
# Lemmatize searched phrase
lemmatized_phrase = []
tokenized_phrase = word_tokenize(phrase.lower())
for word in tokenized_phrase:
    lemmatized_phrase.append(lemmatizer.lemmatize(word))
lemmatized_phrase = TreebankWordDetokenizer().detokenize(lemmatized_phrase)

# Convert search phrase into a vector
X1 = vectorizer.transform([lemmatized_phrase])                    
X1_vector = X1.toarray()
print("Shape of search phrase vector:", X1_vector.shape)

Shape of search phrase vector: (1, 183)


In [26]:
# Calcualte Tfidf cosine similarity and add it to dataframe
sim_score_list = []
for x in range (0, len(df)):
    sim_score_list.append(cosine_similarity(tfidf_vector[x], X1_vector))

df['tfidf_sim_score'] = sim_score_list
df.sort_values(by ='tfidf_sim_score', ascending = False).head(20)

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256
27,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,liberal arts major aspire human resources analyst,liberal art major aspiring human resource analyst,0.0
29,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,senior human resources business partner heil e...,senior human resource business partner heil en...,0.0
30,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,aspiring human resources professional | energe...,aspiring human resource professional energetic...,0.0
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,hr manager endemol shine north america,hr manager endemol shine north america,0.0
33,RRP Brand Portfolio Executive at JTI (Japan To...,Greater Philadelphia Area,500+,rrp brand portfolio executive jti japan tobacc...,rrp brand portfolio executive jti japan tobacc...,0.0
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0
35,Bachelor of Science in Biology from Victoria U...,"Baltimore, Maryland",40,bachelor science biology victoria university w...,bachelor science biology victoria university w...,0.0
36,Human Resources Management Major,"Milpitas, California",18,human resources management major,human resource management major,0.0


In [27]:
# Convert job_title column into a list
#nlp_title_list = list(df['nlp_job_title'])                    
#cleaned_title_list = list(df['cleaned_job_title'])                    

# Vectorize job_title_list
vectorizer = TfidfVectorizer()                                 
# Specify which list to vectorize
#vectorizer.fit(nlp_title_list)                    
#X = vectorizer.transform(nlp_title_list)    

vectorizer.fit(cleaned_title_list)                    
XX = vectorizer.transform(cleaned_title_list)    


In [28]:
cleaned_title_list

['2019 ct bauer college business graduate magna cum laude aspiring human resource professional',
 'native english teacher epik english program korea',
 'aspiring human resource professional',
 'people development coordinator ryan',
 'advisory board member celal bayar university',
 'aspiring human resource specialist',
 'student humber college aspiring human resource generalist',
 'hr senior specialist',
 'seeking human resource hris generalist position',
 'student chapman university',
 'svp chro marketing communication csr officer engie houston woodland energy gphr sphr',
 'human resource coordinator intercontinental buckhead atlanta',
 'aspiring human resource management student seeking internship',
 'seeking human resource opportunity',
 'experienced retail manager aspiring human resource professional',
 'human resource staffing recruiting professional',
 'human resource specialist luxottica',
 'director human resource north america groupe beneteau',
 'retired army national guard rec

In [29]:
# Get feature names in all the documents
feature_names = vectorizer.get_feature_names()                  
print("Number of unique features: ", len(feature_names))        

Number of unique features:  180


In [30]:
# Convert job titles into arrays
tfidf_vector2 = XX.toarray()                                  
print("Shape of Tfidf vector: ", tfidf_vector2.shape)        

Shape of Tfidf vector:  (53, 180)


In [31]:
# Searched phrase
phrase = phrase
# Define cosine similarity function (cosine gives cosine distance)
def cosine_similarity(vector_1, vector_2):
    cos_sim = 1 - cosine(vector_1, vector_2)        
    return cos_sim
# Lemmatize searched phrase
lemmatized_phrase = []
tokenized_phrase = word_tokenize(phrase.lower())
for word in tokenized_phrase:
    lemmatized_phrase.append(lemmatizer.lemmatize(word))
lemmatized_phrase = TreebankWordDetokenizer().detokenize(lemmatized_phrase)

# Convert search phrase into a vector
X2 = vectorizer.transform([lemmatized_phrase])                    
X2_vector = X2.toarray()
print("Shape of search phrase vector:", X2_vector.shape)

Shape of search phrase vector: (1, 180)


In [32]:
# Calcualte Tfidf cosine similarity and add it to dataframe
sim_score_list = []
for x in range (0, len(df)):
    sim_score_list.append(cosine_similarity(tfidf_vector2[x], X2_vector))

df['tfidf_sim_score2'] = sim_score_list
df.sort_values(by ='tfidf_sim_score2', ascending = False).head(20)

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score,tfidf_sim_score2
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103,0.335942
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256,0.328256
27,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,liberal arts major aspire human resources analyst,liberal art major aspiring human resource analyst,0.0,0.0
29,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,senior human resources business partner heil e...,senior human resource business partner heil en...,0.0,0.0
30,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,aspiring human resources professional | energe...,aspiring human resource professional energetic...,0.0,0.0
31,HR Manager at Endemol Shine North America,"Los Angeles, California",268,hr manager endemol shine north america,hr manager endemol shine north america,0.0,0.0
33,RRP Brand Portfolio Executive at JTI (Japan To...,Greater Philadelphia Area,500+,rrp brand portfolio executive jti japan tobacc...,rrp brand portfolio executive jti japan tobacc...,0.0,0.0
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0,0.0
35,Bachelor of Science in Biology from Victoria U...,"Baltimore, Maryland",40,bachelor science biology victoria university w...,bachelor science biology victoria university w...,0.0,0.0
36,Human Resources Management Major,"Milpitas, California",18,human resources management major,human resource management major,0.0,0.0


## While comparing the scores, we can also see some preprocessing differences in lemmatization. For example "[13] opportunities vs opportunity", "[27] aspire vs aspiring" . Some of the scores appear to be significantly different.

# Pre-trained word embedding - various sources

## GloVe (Global Vectors) - Select a publicy available corpus to vectorize and compare with our search phrase. 

In [33]:
%%time
# Create temp file and save converted embedding into it
#target_file = get_tmpfile('word2vec.6B.300d.txt')
#target_file = get_tmpfile('word2vec.42B.300d.txt')
#glove2word2vec('glove.6B.300d.txt', target_file) 
#glove2word2vec('glove.42B.300d.txt', target_file) 


# Load the converted embedding into memory
#glove_model = KeyedVectors.load_word2vec_format(target_file)
glove_model = KeyedVectors.load_word2vec_format('word2vec.6B.300d.bin.gz', binary=True)

# Save as binary data
#glove_model.save_word2vec_format('word2vec.6B.300d.bin.gz', binary=True)
#glove_model.save_word2vec_format('word2vec.42B.300d.bin.gz', binary=True)

CPU times: user 13.8 s, sys: 1.41 s, total: 15.2 s
Wall time: 17.3 s


In [34]:
# Define a function to vectorize a sentence's words
def doc_token_vectors(sentence, model, sentence_vector_list, vector_dimensions):
    word_tokens = word_tokenize(sentence)
    filtered_words = [w for w in word_tokens if w in model.key_to_index.keys()]
    
    for j in range(0, len(word_tokens)):
        if word_tokens[j] in filtered_words:
            token_vector = model[word_tokens[j]]
        else:
            token_vector = np.zeros(vector_dimensions)
        sentence_vector_list.append(token_vector)
    
    return sentence_vector_list

In [35]:
glove_model['king']

array([ 0.0033901, -0.34614  ,  0.28144  ,  0.48382  ,  0.59469  ,
        0.012965 ,  0.53982  ,  0.48233  ,  0.21463  , -1.0249   ,
       -0.34788  , -0.79001  , -0.15084  ,  0.61374  ,  0.042811 ,
        0.19323  ,  0.25462  ,  0.32528  ,  0.05698  ,  0.063253 ,
       -0.49439  ,  0.47337  , -0.16761  ,  0.045594 ,  0.30451  ,
       -0.35416  , -0.34583  , -0.20118  ,  0.25511  ,  0.091111 ,
        0.014651 , -0.017541 , -0.23854  ,  0.48215  , -0.9145   ,
       -0.36235  ,  0.34736  ,  0.028639 , -0.027065 , -0.036481 ,
       -0.067391 , -0.23452  , -0.13772  ,  0.33951  ,  0.13415  ,
       -0.1342   ,  0.47856  , -0.1842   ,  0.10705  , -0.45834  ,
       -0.36085  , -0.22595  ,  0.32881  , -0.13643  ,  0.23128  ,
        0.34269  ,  0.42344  ,  0.47057  ,  0.479    ,  0.074639 ,
        0.3344   ,  0.10714  , -0.13289  ,  0.58734  ,  0.38616  ,
       -0.52238  , -0.22028  , -0.072322 ,  0.32269  ,  0.44226  ,
       -0.037382 ,  0.18324  ,  0.058082 ,  0.26938  ,  0.3620

In [36]:
word_tokens

['director', 'of', 'administration', 'at', 'excellence', 'logging']

In [37]:
# Vectorize job title using GloVe model
glove_vectors = []
for i in range(0, len(df)):
    glove_sentence_vector = []
    doc_token_vectors(df['cleaned_job_title'][i], glove_model, glove_sentence_vector, 300)
    glove_vectors.append(glove_sentence_vector)

In [38]:
lemmatized_phrase

'full stack software engineer'

In [39]:
# Vectorize searched phrase using GloVe model
glove_search_phrase_vector = []
doc_token_vectors(lemmatized_phrase, glove_model, glove_search_phrase_vector, 300)

[array([-3.1241e-01, -1.3576e-02, -4.0825e-02, -1.6829e-01, -1.1099e-01,
         4.2393e-01,  9.3001e-02, -3.0863e-01,  1.8309e-01, -1.5829e+00,
        -9.7536e-02,  3.8782e-01, -1.0337e-01,  1.4471e-02, -5.3984e-01,
         2.4077e-01,  1.8719e-01,  1.4655e-01,  8.1469e-02,  1.3690e-01,
         2.0060e-01, -3.1688e-02,  6.8364e-04,  1.1491e-01,  2.7969e-02,
         2.7588e-01, -1.4344e-01, -2.7636e-01,  3.3841e-01,  1.0303e-01,
        -6.8495e-02, -1.0502e-01, -4.9064e-02,  5.4676e-02, -8.3859e-01,
         2.8603e-01,  4.4684e-02, -4.4537e-02, -3.0361e-01,  1.2041e-01,
        -1.9970e-01, -1.3480e-01, -1.4405e-01, -1.5962e-01,  4.3908e-01,
         4.4993e-02,  2.2916e-01,  2.6439e-01,  2.5709e-02, -4.7517e-01,
        -9.0170e-02,  2.1946e-01,  4.3330e-02,  1.0401e-01, -2.6996e-01,
         3.8791e-01,  2.3419e-01,  1.0629e-01,  2.3689e-01,  2.4975e-02,
        -5.0640e-03, -2.5399e-01,  1.1019e-01, -3.1195e-01, -4.8532e-01,
        -3.8653e-01,  4.5221e-01, -3.3440e-01,  2.5

In [40]:
# Calculate cosine similarity between searched phrase and job title
glove_similarity =[]
for i in range(0, len(df)):
    sim_score = cosine_similarity(np.mean(glove_vectors[i], axis = 0), np.mean(glove_search_phrase_vector, axis =0))
    glove_similarity.append(sim_score)

# Add GloVe similarity score to the dataframe
df['GloVe_sim_score'] = glove_similarity
df.sort_values(by ='GloVe_sim_score', ascending = False).head(15)

Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score,tfidf_sim_score2,GloVe_sim_score
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256,0.328256,0.658723
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0,0.0,0.648725
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103,0.335942,0.569041
44,Student at Indiana University Kokomo - Busines...,"Lafayette, Indiana",19,student indiana university kokomo business man...,student indiana university kokomo business man...,0.0,0.0,0.52863
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,business management major aspiring human resou...,business management major aspiring human resou...,0.0,0.0,0.525942
14,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,experienced retail manager aspire human resour...,experienced retail manager aspiring human reso...,0.0,0.0,0.492002
48,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,aspiring human resources manager | graduate 20...,aspiring human resource manager graduating may...,0.0,0.0,0.47484
18,"Retired Army National Guard Recruiter, office ...","Virginia Beach, Virginia",82,retired army national guard recruiter office m...,retired army national guard recruiter office m...,0.0,0.0,0.465067
40,Seeking employment opportunities within Custom...,"Torrance, California",64,seek employment opportunity customer service p...,seeking employment opportunity within customer...,0.0,0.0,0.462334
36,Human Resources Management Major,"Milpitas, California",18,human resources management major,human resource management major,0.0,0.0,0.450756


## Word2vec - This uses gensim model with Google News corpus

In [41]:
%%time
# Using a pre-trained word2vec model
w2v_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True)


CPU times: user 1min 37s, sys: 7.53 s, total: 1min 44s
Wall time: 1min 55s


In [42]:
# Vectorize job title using Word2Vec model
w2v_vectors = []
for i in range(0, len(df)):
    w2v_sentence_vector = []
    doc_token_vectors(df['cleaned_job_title'][i], w2v_model, w2v_sentence_vector, 300)
    w2v_vectors.append(w2v_sentence_vector)

In [43]:
# Vectorize search phrase using Word2Vec model
w2v_search_phrase = []
doc_token_vectors(lemmatized_phrase, w2v_model, w2v_search_phrase, 300)

[array([ 0.1328125 , -0.10986328, -0.19726562,  0.14941406, -0.11376953,
        -0.10058594,  0.22460938, -0.25585938, -0.00149536,  0.06591797,
        -0.06933594, -0.16796875,  0.06835938,  0.01599121, -0.140625  ,
         0.03015137,  0.20703125, -0.10302734,  0.13671875, -0.15136719,
        -0.20605469, -0.07421875, -0.17773438,  0.21191406, -0.00165558,
         0.04174805, -0.08642578,  0.02026367,  0.31835938,  0.03222656,
        -0.12890625,  0.07080078, -0.2890625 ,  0.21386719,  0.00830078,
        -0.0859375 , -0.10498047, -0.17871094, -0.11279297,  0.09228516,
         0.13574219, -0.11669922,  0.05761719,  0.01031494, -0.19433594,
         0.03613281, -0.01635742, -0.08105469, -0.02600098,  0.04589844,
         0.26953125,  0.06933594,  0.04418945,  0.13867188,  0.04638672,
        -0.03393555, -0.07666016, -0.11621094,  0.09765625, -0.10449219,
        -0.20703125,  0.02185059, -0.19335938, -0.20800781,  0.19726562,
        -0.00686646,  0.28125   ,  0.06445312,  0.0

In [44]:
# Cosine Similarity score for Word2Vec model
w2v_similarity_score = []
for i in range(0, len(df)):
    sim_score = cosine_similarity(np.mean(w2v_search_phrase, axis = 0), 
                                  np.mean(w2v_vectors[i], axis = 0))
    w2v_similarity_score.append(sim_score)

# Add Word2Vec similarity score
df['Word2Vec_sim_score'] = w2v_similarity_score
df['mean_sim_score'] = df.loc[:, 'tfidf_sim_score': 'Word2Vec_sim_score'].mean(axis=1)
df.sort_values(by = 'mean_sim_score', ascending = False).head()


Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score,tfidf_sim_score2,GloVe_sim_score,Word2Vec_sim_score,mean_sim_score
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256,0.328256,0.658723,0.596684,0.47798
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103,0.335942,0.569041,0.393964,0.408512
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0,0.0,0.648725,0.495401,0.286032
44,Student at Indiana University Kokomo - Busines...,"Lafayette, Indiana",19,student indiana university kokomo business man...,student indiana university kokomo business man...,0.0,0.0,0.52863,0.439216,0.241961
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,business management major aspiring human resou...,business management major aspiring human resou...,0.0,0.0,0.525942,0.368448,0.223597


## Use fasttext pre-trained model, but implement with word2vec to get Keyed Vectors

In [45]:
%%time
# Using a pre-trained fasttext model with word2vec 
fasttext_model = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec", binary = False)


CPU times: user 6min 43s, sys: 9.05 s, total: 6min 52s
Wall time: 7min 8s


In [46]:
# Vectorize job title using Word2Vec model on fasttext pre-trained data
fasttext_vectors = []
for i in range(0, len(df)):
    fasttext_sentence_vector = []
    doc_token_vectors(df['cleaned_job_title'][i], fasttext_model, fasttext_sentence_vector, 300)
    fasttext_vectors.append(fasttext_sentence_vector)

In [47]:
# Vectorize search phrase using Word2Vec model on fasttext pre-trained data
fasttext_search_phrase = []
doc_token_vectors(lemmatized_phrase, fasttext_model, fasttext_search_phrase, 300)

[array([-1.023e-01, -9.040e-02, -2.010e-02,  8.100e-03,  1.229e-01,
         3.180e-02,  7.600e-03,  6.300e-02,  2.240e-02,  3.910e-02,
         6.500e-03, -2.460e-02, -5.220e-02, -1.588e-01,  4.300e-03,
        -4.820e-02, -3.630e-02, -3.550e-02, -1.462e-01, -7.260e-02,
         5.020e-02, -7.490e-02,  6.020e-02, -3.100e-03,  1.900e-02,
        -8.900e-03,  7.800e-02, -2.660e-02, -8.500e-03,  1.298e-01,
        -3.200e-03, -1.440e-02,  1.338e-01, -4.020e-02,  4.740e-02,
         4.560e-02,  3.080e-02,  6.510e-02,  5.200e-03,  6.220e-02,
         6.500e-03, -5.220e-02, -6.320e-02, -1.305e-01,  2.800e-02,
        -1.018e-01,  1.680e-02, -3.390e-02, -7.280e-02,  6.100e-03,
         8.000e-04,  1.682e-01, -6.259e-01,  5.160e-02,  1.180e-02,
        -9.750e-02, -1.470e-02,  5.700e-03, -6.120e-02, -9.100e-03,
        -4.810e-02,  1.191e-01, -2.400e-03, -3.410e-02, -3.360e-02,
         1.750e-02,  5.480e-02,  3.440e-02, -1.400e-02, -1.700e-03,
        -1.510e-02,  2.540e-02, -1.400e-02,  1.1

In [48]:
# Cosine Similarity score for fasttext model using word2vec
fasttext_similarity_score = []
for i in range(0, len(df)):
    sim_score = cosine_similarity(np.mean(fasttext_search_phrase, axis = 0), 
                                  np.mean(fasttext_vectors[i], axis = 0))
    fasttext_similarity_score.append(sim_score)

# Add Word2Vec similarity score
df['fasttext_sim_score'] = fasttext_similarity_score
df['mean_sim_score'] = df.loc[:, 'tfidf_sim_score': 'fasttext_sim_score'].mean(axis=1)
df.sort_values(by = 'mean_sim_score', ascending = False).head(20)


Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score,tfidf_sim_score2,GloVe_sim_score,Word2Vec_sim_score,mean_sim_score,fasttext_sim_score
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256,0.328256,0.658723,0.596684,0.532624,0.805846
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103,0.335942,0.569041,0.393964,0.465235,0.748846
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0,0.0,0.648725,0.495401,0.367604,0.775469
44,Student at Indiana University Kokomo - Busines...,"Lafayette, Indiana",19,student indiana university kokomo business man...,student indiana university kokomo business man...,0.0,0.0,0.52863,0.439216,0.329633,0.767994
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,business management major aspiring human resou...,business management major aspiring human resou...,0.0,0.0,0.525942,0.368448,0.308324,0.731955
18,"Retired Army National Guard Recruiter, office ...","Virginia Beach, Virginia",82,retired army national guard recruiter office m...,retired army national guard recruiter office m...,0.0,0.0,0.465067,0.325941,0.286324,0.729181
14,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,experienced retail manager aspire human resour...,experienced retail manager aspiring human reso...,0.0,0.0,0.492002,0.304303,0.284664,0.712602
48,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,aspiring human resources manager | graduate 20...,aspiring human resource manager graduating may...,0.0,0.0,0.47484,0.295622,0.279571,0.71435
27,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,liberal arts major aspire human resources analyst,liberal art major aspiring human resource analyst,0.0,0.0,0.408712,0.319425,0.272865,0.727017
29,Senior Human Resources Business Partner at Hei...,"Chattanooga, Tennessee Area",455,senior human resources business partner heil e...,senior human resource business partner heil en...,0.0,0.0,0.435209,0.297116,0.272541,0.719842


In [49]:
#import io

#def load_vectors(fname):
#    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
#    n, d = map(int, fin.readline().split())
#    data = {}
#    for line in fin:
#        tokens = line.rstrip().split(' ')
#        data[tokens[0]] = map(float, tokens[1:])
#    return data

In [50]:
#load_vectors("wiki-news-300d-1M.vec")

In [51]:
#import fasttext

In [52]:
#model_load = fasttext.load_model("fasttext_cbow.bin")

In [53]:
#model_load

In [54]:
#model_load['king']

In [55]:
#fasttext_model['king']

In [56]:
# Build BERT_base model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')

In [58]:
# Convert job titles into BERT embedded vectors
bert_job_title_embeddings = bert_model.encode(cleaned_title_list)
bert_job_title_embeddings.shape

(53, 768)

In [59]:
bert_job_title_embeddings[0].shape

(768,)

In [61]:
# Convert search phrase into a BERT embedded vector
bert_search_phrase_embedding = bert_model.encode(lemmatized_phrase)
bert_search_phrase_embedding.shape

(768,)

In [64]:
# Calculate cosine similarity between job title and search phrase vectors
bert_cosine_similarity = []
for i in range(0, len(df)):
    cos_sim = cosine_similarity(bert_job_title_embeddings[i], bert_search_phrase_embedding)
    bert_cosine_similarity.append(cos_sim)
    
# Add BERT_cosine_similarity column in the pt dataframe
df['BERT_sim_score'] = bert_cosine_similarity
df['mean_sim_score'] = df.loc[:, 'tfidf_sim_score': 'BERT_sim_score'].mean(axis=1)
df.sort_values(by = 'mean_sim_score', ascending = False).head()


Unnamed: 0,job_title,location,connection,nlp_job_title,cleaned_job_title,tfidf_sim_score,tfidf_sim_score2,GloVe_sim_score,Word2Vec_sim_score,mean_sim_score,fasttext_sim_score,BERT_sim_score
28,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,junior mes engineer| information systems,junior me engineer information system,0.328256,0.328256,0.658723,0.596684,0.545556,0.805846,0.557185
32,Human Resources professional for the world lea...,"Highland, California",50,human resources professional world leader gis ...,human resource professional world leader gi so...,0.335103,0.335942,0.569041,0.393964,0.496613,0.748846,0.600703
34,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,information systems specialist programmer love...,information system specialist programmer love ...,0.0,0.0,0.648725,0.495401,0.422359,0.775469,0.621406
20,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5,business management major aspiring human resou...,business management major aspiring human resou...,0.0,0.0,0.525942,0.368448,0.342497,0.731955,0.432905
44,Student at Indiana University Kokomo - Busines...,"Lafayette, Indiana",19,student indiana university kokomo business man...,student indiana university kokomo business man...,0.0,0.0,0.52863,0.439216,0.337368,0.767994,0.289337
