# Text to Softskills Profile with BERT

# Step 1: Creating Softskills Dictionary

1 Dictionary per Candidate with keys = softskills and values = similarity scores. 

### Imports

In [30]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### Open Softskills CSV

The softskills CSV provides 8 types of worker personalities. Each has a list of words, although the number of words is variable.

In [31]:
softskillsDataFrame = pd.read_csv("Soft Skills.csv")
softskillsDataFrame.head()

Unnamed: 0,Dominance,Unnamed: 1,Unnamed: 2,Extraversion,Unnamed: 4,Unnamed: 5,Patience,Unnamed: 7,Unnamed: 8,Conscientious,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,Team Leader,Team Player,,Thinker,Talker,,Steady,Fast Paced,,Detail Oriented,...,,,,,,,,,,
1,Independent,Collaborative,,Detail Oriented,Connects with others,,Stability,Variety,,Meticulous,...,,,,,,,,,,
2,Assertive,Cooperative,,Factual,Influencer,,Stable work,Handle multiple priorities,,Thorough,...,,,,,,,,,,
3,Drive,Harmony-seeking,,Analytical,Sales,,Organized,Demanding,,Punctual,...,,,,,,,,,,
4,Determined,Team commitment,,Matter-of-fact,Customer service,,Steady pace,target driven,,Reliable,...,,,,,,,,,,


### Drop Empty Columns

In [32]:
softskillsDataFrame.drop(['Unnamed: 2', 'Unnamed: 5','Unnamed: 8'], inplace=True, axis=1)
softskillsDataFrame.head()

Unnamed: 0,Dominance,Unnamed: 1,Extraversion,Unnamed: 4,Patience,Unnamed: 7,Conscientious,Unnamed: 10,Unnamed: 11,Unnamed: 12,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,Team Leader,Team Player,Thinker,Talker,Steady,Fast Paced,Detail Oriented,Innovative,,,...,,,,,,,,,,
1,Independent,Collaborative,Detail Oriented,Connects with others,Stability,Variety,Meticulous,Freedom from rules,,,...,,,,,,,,,,
2,Assertive,Cooperative,Factual,Influencer,Stable work,Handle multiple priorities,Thorough,Ingenuity,,,...,,,,,,,,,,
3,Drive,Harmony-seeking,Analytical,Sales,Organized,Demanding,Punctual,Creative,,,...,,,,,,,,,,
4,Determined,Team commitment,Matter-of-fact,Customer service,Steady pace,target driven,Reliable,Ideator,,,...,,,,,,,,,,


### Rename Columns

In [57]:
softskillsDataFrame.rename(columns = {'Unnamed: 1': 'Dominance', 'Unnamed: 4': 'Extraversion', 'Unnamed: 7': 'Patience', 'Unnamed: 10': "Conscientious"}, inplace=True)
softskillsDataFrame.head()

Unnamed: 0,Dominance,Dominance.1,Extraversion,Extraversion.1,Patience,Patience.1,Conscientious,Conscientious.1
0,Team Leader,Team Player,Thinker,Talker,Steady,Fast Paced,Detail Oriented,Innovative
1,Independent,Collaborative,Detail Oriented,Connects with others,Stability,Variety,Meticulous,Freedom from rules
2,Assertive,Cooperative,Factual,Influencer,Stable work,Handle multiple priorities,Thorough,Ingenuity
3,Drive,Harmony-seeking,Analytical,Sales,Organized,Demanding,Punctual,Creative
4,Determined,Team commitment,Matter-of-fact,Customer service,Steady pace,target driven,Reliable,Ideator


### Integer index to just 8 columns

In [58]:
softskillsDataFrame = softskillsDataFrame.iloc[0:54, 0:8]
softskillsDataFrame.head()

Unnamed: 0,Dominance,Dominance.1,Extraversion,Extraversion.1,Patience,Patience.1,Conscientious,Conscientious.1
0,Team Leader,Team Player,Thinker,Talker,Steady,Fast Paced,Detail Oriented,Innovative
1,Independent,Collaborative,Detail Oriented,Connects with others,Stability,Variety,Meticulous,Freedom from rules
2,Assertive,Cooperative,Factual,Influencer,Stable work,Handle multiple priorities,Thorough,Ingenuity
3,Drive,Harmony-seeking,Analytical,Sales,Organized,Demanding,Punctual,Creative
4,Determined,Team commitment,Matter-of-fact,Customer service,Steady pace,target driven,Reliable,Ideator


### Create a soft skills dictionary with the non null soft skill words as keys

In [35]:
softskillDictionary = {}
# count = 0
for (columnName, columnData) in softskillsDataFrame.iteritems():
    boolSeries = pd.notnull(columnData)
    newSeries = columnData[boolSeries]
    for string in newSeries:
        # print (string)
        softskillDictionary[string] = 0
        #count = count + 1
#print (count) is 262
print (softskillDictionary.keys()) #248 count for some reason

dict_keys(['Team Leader', 'Independent', 'Assertive', 'Drive', 'Determined', 'Self Moviated', 'Management', 'Leadership', 'Initiative', 'Self-confident', 'Autonomy', 'Control', 'Coaching', 'Teaching', 'Instructor', 'Motivated', 'Spearhead', 'Develop others', 'Mentor', 'Fast-paced', 'Strong-willed', 'Inspirational', 'Questioning', 'Goal-oriented', 'Blunt', 'Bottom-line', 'Skeptical', 'Focused', 'Forceful', 'Risk-taker', 'Competitive', 'Concrete', 'Outspoken', 'Straightforward', 'Action-oriented', 'Swift', 'Freedom', 'Strong', 'Decisive ', 'Success-oriented', 'Delegates', 'Empowers', 'Stands up', 'Monitor', 'Supervising', 'Team Player', 'Collaborative', 'Cooperative', 'Harmony-seeking', 'Team commitment', 'Open Minded', 'Cross-functional teams', 'Faclitating cooperation', 'Mediation', 'Team setting', 'Team environment', 'Coordianting efforts', 'Catalyze the best for others', 'Collaborative Groups', 'Working closely with users', 'Managing conflict', 'Recognize team success', 'Teammwork', 

# Start of BERT Application
Sentence transformer is used to vectorize words or phrases, but perhaps Word2Vec is better for that. Not sure exactly what this BERT model is but I know that this encodes a given text to vectors. Since it scans the corpus bidirectionally, it has better understanding of the word's meaning and role in the sentence.

### Import SentenceTransformers

Convert every key in the softksills dictionary to a vector. This is a key embedding!

In [36]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
#Encoding:
keyList = list(softskillDictionary.keys())
key_embeddings = model.encode(keyList)
print(key_embeddings.shape)

(248, 768)


# Candidate Data Extraction

### Read candidate data
This is their strengths, skills, experience, and job details.

In [37]:
candidates = pd.read_csv("./candidate_data.csv")
candidates = candidates.iloc[:,[38,39,42,74]]
candidates = candidates.rename(columns = {list(candidates.columns)[0]: "Strengths",list(candidates.columns)[1]: "Skills",list(candidates.columns)[2]: "Experience",list(candidates.columns)[3]: "Job Details"})
candidates.head()

Unnamed: 0,Strengths,Skills,Experience,Job Details
0,"I am good at developing ""Big Picture"" thinkin...","I am a non conventional thinker, and a creativ...",,I'v consulted Harvard university in respect of...
1,I found the initiative to understand how to sp...,Perhaps because of my passion for writing and ...,,"As the name implies, I am tasked with sacking ..."
2,I am resourceful and determined to catalyze th...,I am a versatile design specialist with a back...,,Optimized clients' business objectives with be...
3,I am good at improving my work quality so that...,Mathematics and story writing. I went through ...,I've done good work at the Tibetan center and ...,Stocked selves. Priced inventory. Checked item...
4,My biggest strengths would be my ability to hy...,I excel at community building and bridging com...,I am and have been a full stack engineer at Wa...,I built and maintained complex tools and data ...


### Candidate Blobs
The strengths, the skills, the experience, and job details are combined into one string per candidate.

In [38]:
candidateBlobs = []
for index, row in candidates.iterrows():
    candidateBlobs.append(row.str.cat())
candidateBlobs[0]

'I am good at developing  "Big Picture" thinking about complex technology trends and markets, highly useful in my research, I am a team player and often find at ease in coordinating team efforts though a project.I am a non conventional thinker, and a creative person, I am also a good motivator and mentor to my peers, I enjoy exercising design and lateral thinking to develop effective solutions to complex challenges. I enjoy designing urban innovation strategies.I\'v consulted Harvard university in respect of current built environment digital technologies and alternative financing models over 3,5,10 years horizons and prepared by mentoring , lecturing and participating in workshops, master degrees students for future practices.'

### Find the softskills profile for first candidate blob

In [39]:
testCandidate = candidateBlobs[0]

### Clean this blob of text
Remove stopwords, words less than 3 characters, and lemmatize using NLTK. Further cleaning can be implemented with Part-Of-Speech tagging (only using verbs or adjectives, etc.) to narrow key words of the candidate data.

In [59]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
testCandidate = re.sub(r'[^\w\s]',' ', testCandidate)
word_tokens = word_tokenize(testCandidate)

filteredParagraph = []

for w in word_tokens:
    if w not in stop_words and len(w) > 2: #To remove most of the punctuation and sped words
        filteredParagraph.append(lemmatizer.lemmatize(w))
        #print(w + ": " + lemmatizer.lemmatize(w))

print(filteredParagraph)

['good', 'developing', 'Big', 'Picture', 'thinking', 'complex', 'technology', 'trend', 'market', 'highly', 'useful', 'research', 'team', 'player', 'often', 'find', 'ease', 'coordinating', 'team', 'effort', 'though', 'project', 'non', 'conventional', 'thinker', 'creative', 'person', 'also', 'good', 'motivator', 'mentor', 'peer', 'enjoy', 'exercising', 'design', 'lateral', 'thinking', 'develop', 'effective', 'solution', 'complex', 'challenge', 'enjoy', 'designing', 'urban', 'innovation', 'strategy', 'consulted', 'Harvard', 'university', 'respect', 'current', 'built', 'environment', 'digital', 'technology', 'alternative', 'financing', 'model', 'year', 'horizon', 'prepared', 'mentoring', 'lecturing', 'participating', 'workshop', 'master', 'degree', 'student', 'future', 'practice']


### Create unique encoding for test candidate blob
Print the cosine similarity score between the first key (which is a soft skill word like "Team Leader") and every word in the filtered paragraph encoded (filteredEmbed).

In [41]:
filteredEmbed = model.encode(filteredParagraph)
cosResult = cosine_similarity([key_embeddings[0]], filteredEmbed)
print(cosResult)

[[0.62626487 0.6886418  0.5721537  0.5995661  0.64255106 0.5192255
  0.6472932  0.64143527 0.4505285  0.6249796  0.6420294  0.6134052
  0.70776844 0.69671685 0.6254381  0.59740454 0.5476612  0.7529969
  0.7077684  0.69679976 0.50594187 0.6856246  0.3649695  0.455686
  0.6347934  0.7154338  0.70346665 0.59265095 0.62626487 0.7502358
  0.77227485 0.67053133 0.6055478  0.6887841  0.65282255 0.54020005
  0.64255106 0.66573113 0.6849568  0.64469486 0.5192254  0.6720971
  0.6055478  0.6788463  0.45199347 0.67177486 0.7173019  0.637907
  0.37137687 0.44803134 0.64391685 0.61405426 0.6503183  0.5859797
  0.47889125 0.6472932  0.5665082  0.6583452  0.6675593  0.49904186
  0.5767889  0.7271948  0.7662436  0.5769757  0.7346028  0.6561968
  0.69754356 0.6031715  0.6241054  0.53362453 0.70732653]]


### Update Soft Skills Dictionary
For each key, find its list of similarity scores with the test candidate blob and set this as its value.

In [42]:
index = 0
for key in keyList:
    value = cosine_similarity([key_embeddings[index]], filteredEmbed)
    softskillDictionary[key] = value[0] #removes list of lists
    index += 1

In [43]:
print(softskillDictionary[keyList[0]])
print(softskillDictionary[keyList[1]])
print(softskillDictionary[keyList[2]])

[0.62626487 0.6886418  0.5721537  0.5995661  0.64255106 0.5192255
 0.6472932  0.64143527 0.4505285  0.6249796  0.6420294  0.6134052
 0.70776844 0.69671685 0.6254381  0.59740454 0.5476612  0.7529969
 0.7077684  0.69679976 0.50594187 0.6856246  0.3649695  0.455686
 0.6347934  0.7154338  0.70346665 0.59265095 0.62626487 0.7502358
 0.77227485 0.67053133 0.6055478  0.6887841  0.65282255 0.54020005
 0.64255106 0.66573113 0.6849568  0.64469486 0.5192254  0.6720971
 0.6055478  0.6788463  0.45199347 0.67177486 0.7173019  0.637907
 0.37137687 0.44803134 0.64391685 0.61405426 0.6503183  0.5859797
 0.47889125 0.6472932  0.5665082  0.6583452  0.6675593  0.49904186
 0.5767889  0.7271948  0.7662436  0.5769757  0.7346028  0.6561968
 0.69754356 0.6031715  0.6241054  0.53362453 0.70732653]
[0.6438643  0.49032176 0.4374673  0.5388757  0.5869322  0.54367304
 0.54381704 0.54239345 0.437105   0.55687165 0.6223043  0.48066062
 0.6249034  0.59112746 0.49678397 0.5917739  0.7081559  0.59055084
 0.6249034  0.60

### Create new avgskillsDictionary
Here, every key maps to the average of the list it maps to in softskillsdictionary. This is like an overall similarity score between that word and testCandidate blob of text. Maybe I can implement some other way like median which may be more accurate.

In [44]:
from statistics import mean
avgskillsDict = {}
for key in keyList:
    previousList = softskillDictionary[key]
    avgskillsDict[key] = mean(previousList)

### Sort the avgskillsDictionary by keys
This clarifies which words most describe the person. At this point, the program can output like top 10 words from the softskills csv that describe this testcandidate blob of text.

In [45]:
sorted_dict = {}
sorted_keys = sorted(avgskillsDict, key=avgskillsDict.get)

for w in sorted_keys:
    sorted_dict[w] = avgskillsDict[w]

print(sorted_dict)

{'Building models': 0.42194155, 'Talking to people': 0.42468357, 'Quiet environment': 0.45760763, 'Fast-paced': 0.48418954, 'Unemotional': 0.49138308, 'Fast Paced': 0.49703798, 'Long term relationships': 0.50031364, 'Changes in work environment': 0.5006228, 'Stands up': 0.5055034, 'Catalyze the best for others': 0.5088219, 'Concrete': 0.51196057, 'Spearhead': 0.51239276, 'Sociable': 0.51417965, 'Recognize team success': 0.5144537, 'Impulsive': 0.515151, 'Steady pace': 0.5177279, 'Extraversion': 0.51913977, 'Fast learner': 0.5198285, 'Willing to try new things': 0.5207214, 'Fast speed': 0.52304995, 'Helping people': 0.5240566, 'Good arranger': 0.5243644, 'Working closely with users': 0.5246269, 'Curious about people': 0.52530134, 'Freedom from rules': 0.5290068, 'Strong-willed': 0.53029734, 'Working with people': 0.53480434, 'Public speaking': 0.53913194, 'Out of the box thinking': 0.5413408, 'Patient': 0.5415085, 'Suggestions of others': 0.5433813, 'Bottom-line': 0.54402745, 'Logical t

### Find the similarity score for each of the 8 personalities from soft skills csv

In [46]:
teamLeader = softskillsDataFrame.iloc[:, 0]
boolSeries = pd.notnull(teamLeader)
teamLeader = teamLeader[boolSeries]

teamPlayer = softskillsDataFrame.iloc[:, 1]
boolSeries = pd.notnull(teamPlayer)
teamPlayer = teamPlayer[boolSeries]

thinker = softskillsDataFrame.iloc[:, 2]
boolSeries = pd.notnull(thinker)
thinker = thinker[boolSeries]

talker = softskillsDataFrame.iloc[:, 3]
boolSeries = pd.notnull(talker)
talker = talker[boolSeries]

steady = softskillsDataFrame.iloc[:, 4]
boolSeries = pd.notnull(steady)
steady = steady[boolSeries]

fastpaced = softskillsDataFrame.iloc[:, 5]
boolSeries = pd.notnull(fastpaced)
fastpaced = fastpaced[boolSeries]

detailOriented = softskillsDataFrame.iloc[:, 6]
boolSeries = pd.notnull(detailOriented)
detailOriented = detailOriented[boolSeries]

innovative = softskillsDataFrame.iloc[:, 7]
boolSeries = pd.notnull(innovative)
innovative = innovative[boolSeries]

In [47]:
teamLeaderSum = 0
for key in teamLeader:
    teamLeaderSum += avgskillsDict[key]
teamLeaderAvg = float(teamLeaderSum / len(teamLeader))
print(teamLeaderAvg)

0.6353276119284008


In [48]:
teamPlayerSum = 0
for key in teamPlayer:
    teamPlayerSum += avgskillsDict[key]
teamPlayerAvg = float(teamPlayerSum / len(teamPlayer))
print(teamPlayerAvg)

0.6173269425829252


In [49]:
thinkerValues = [float(avgskillsDict[key]) for key in thinker]
thinkerAvg = mean(thinkerValues)
print(thinkerAvg)

0.6297001764178276


In [50]:
talkerValues = [float(avgskillsDict[key]) for key in talker]
talkerAvg = mean(talkerValues)
print(talkerAvg)

0.6144353443262528


In [51]:
steadyValues = [float(avgskillsDict[key]) for key in steady]
steadyAvg = mean(steadyValues)
print(steadyAvg)

0.6482259035110474


In [52]:
fastValues = [float(avgskillsDict[key]) for key in fastpaced]
fastAvg = mean(fastValues)
print(fastAvg)

0.6177546765123095


In [53]:
detailValues = [float(avgskillsDict[key]) for key in detailOriented]
detailAvg = mean(detailValues)
print(detailAvg)

0.6512603814955111


In [54]:
innovativeValues = [float(avgskillsDict[key]) for key in innovative]
innovativeAvg = mean(innovativeValues)
print(innovativeAvg)

0.642262856165568


In [55]:
categories = {"Team Leader" : teamLeaderAvg, "Team Player" : teamPlayerAvg, "Thinker" : thinkerAvg, "Talker" : talkerAvg, "Steady" : steadyAvg, "Fast-Paced" : fastAvg, "Detail-Oriented" : detailAvg, "Innovative" : innovativeAvg}
categories

{'Team Leader': 0.6353276119284008,
 'Team Player': 0.6173269425829252,
 'Thinker': 0.6297001764178276,
 'Talker': 0.6144353443262528,
 'Steady': 0.6482259035110474,
 'Fast-Paced': 0.6177546765123095,
 'Detail-Oriented': 0.6512603814955111,
 'Innovative': 0.642262856165568}

### Sort the category scores to clarify which category best fits this candidate blob of text

In [56]:
categoriesSorted = {}
sortedCategoryKeys = sorted(categories, key=categories.get)  # [1, 3, 2]
for w in sortedCategoryKeys:
    categoriesSorted[w] = categories[w]

print(categoriesSorted) # {1: 1, 3: 4, 2: 9}

{'Talker': 0.6144353443262528, 'Team Player': 0.6173269425829252, 'Fast-Paced': 0.6177546765123095, 'Thinker': 0.6297001764178276, 'Team Leader': 0.6353276119284008, 'Innovative': 0.642262856165568, 'Steady': 0.6482259035110474, 'Detail-Oriented': 0.6512603814955111}


# 2 Independent Processes
## Candidate data extraction
1. For each candidate, combine skills, experience, strengths, job details into one string. 
2. Remove stopwords and lemmatize this string using NLTK.
3. Encode this candidate text string using BERT to a vector.

## Softskills CSV data extraction

1. Find a list of soft skills (entries in Soft Skills.csv) for keys in softskillsdictionary with pandas.
2. Encode every key using BERT to a vector (key embedding).

# Comparison
1. For every encoded key, map it to a list of cosine similarity values with each of the words in the candidate text blob.
2. Create a new dictionary (avgskillsDict) which maps these keys to the averages of those lists above. Now each key has a similarity score with the candidate blob of text.

Now we can
1) sort keys in order of ascending similarity values. Can get top "n" number of words describing this person.

or 

2) Determine average similarity score across the category itself taking an average of averages. Sort this to determine which category is best describes this person.

# Flexibility
## Results
The similarity scores seem to be very similar. The program cannot fully distinguish between the categories in soft skills ("Team Player", "Thinker", etc.). As a drawback, this also means the program is not catered to a particular type of input. 
## Reflection
1. The program can be updated to use part of speech tagging when filtering the candidate data or a job description (in the end, only a string is needed).
2. Cosine similarity and BERT models can be changed to experiment with different semantic similarity methods (thinking about implementing GloVe or Word2Vec).
3. The math involved can be done differently. Maybe use median instead of mean. Also, not every column in the softskills csv has the same number of entries, which would affect the mean similarity score calculated per category. The categories with less entries give the entries more say in the average.