# Building a job recommendation system with embeddings

## Agenda

1. Data exploration
2. Data cleaning
3. Embedddings generation
4. Similarity search with Faiss
5. Recommendations based on text

## Import data

In [1]:
import pandas as pd
df_raw = pd.read_csv("data/monster_com-job_sample.csv")
df = df_raw[["job_title", "job_description"]]

In [2]:
df

Unnamed: 0,job_title,job_description
0,IT Support Technician Job in Madison,TeamSoft is seeing an IT Support Specialist to...
1,Business Reporter/Editor Job in Madison,The Wisconsin State Journal is seeking a flexi...
2,Johnson & Johnson Family of Companies Job Appl...,Report this job About the Job DePuy Synthes Co...
3,Engineer - Quality Job in Dixon,Why Join Altec? If you’re considering a career...
4,Shift Supervisor - Part-Time Job in Camphill,Position ID# 76162 # Positions 1 State CT C...
...,...,...
21995,Assistant Vice President - Controller Job in C...,This is a major premier Cincinnati based finan...
21996,Accountant Job in Cincinnati,Luxury homebuilder in Cincinnati seeking multi...
21997,AEM/CQ developer Job in Chicago,RE: Adobe AEM- Client - Loca...
21998,Electrician - Experienced Forging Electrician ...,Jernberg Industries was established in 1937 an...


## Dataframe exploration
What can you find out about the data?
(Null-values? Unique values?)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_title        22000 non-null  object
 1   job_description  22000 non-null  object
dtypes: object(2)
memory usage: 343.9+ KB


In [4]:
df.describe()

Unnamed: 0,job_title,job_description
count,22000,22000
unique,18759,18744
top,Monster,12N Horizontal Construction Engineers Job Desc...
freq,318,104


In [5]:
df['job_title'].unique()

array(['IT Support Technician Job in Madison',
       'Business Reporter/Editor Job in Madison',
       'Johnson & Johnson Family of Companies Job Application for Senior Training Leader | Monster.com var MONS_LOG_VARS = {"JobID":',
       ..., 'AEM/CQ developer Job in Chicago',
       'Electrician - Experienced Forging Electrician Job in Chicago',
       'Contract Administrator Job in Cincinnati'], dtype=object)

## Exploration and cleaning of the job title column

In [6]:
df.job_title.value_counts()[:10]

Monster                                         318
Shift Supervisor Job in Camphill                256
RN                                               70
Shift Supervisor - Part-Time Job in Camphill     56
Manager                                          50
Please apply only if you are qualified.          31
ASST STORE MGR Job in Columbus                   26
LEAD SALES ASSOCIATE-FT Job in Columbus          26
LEAD SALES ASSOCIATE-PT Job in Columbus          24
SALES ASSOCIATE Job in Columbus                  24
Name: job_title, dtype: int64

### Remove jobs with Monster as title

In [7]:
df[df["job_title"] == "Monster"].sample(5).style.set_properties(**{'width-min': '100px'})

Unnamed: 0,job_title,job_description
842,Monster,"Alliance Engineering, Ltd., is actively seeking a Civil Engineering Draftsman with some experience in Planning, Civil Engineering, Landscape Design, Water Resources, Land Development, Project Management, and/or Construction Management. A four (4) year civil engineering degree is required, as well as, EIT certification.The candidate must have a minimum of three (3) years experience in civil design, and be proficient with Autodesk Civil 3D 2014 or later software on Windows 8. The candidate must also be proficient in the tracking, development, and writing/pricing of proposals. The candidate will work out of AEL’s Westminster, CO office; however, since AEL provides services in various states, the candidate must be willing to travel if a particular job requires the candidate to do so.· Four (4) year civil engineering degree; · Very experienced and proficient with Autodesk Civil 3D 2014or later software;· EIT certification; · Three (3) years experience in civil design; · Proficient in the tracking, development, and writing/pricing of proposals; and, · Willing to travel if a particular job requires it. Location: Westminster, ColoradoSalary: OpenType: Full Time - ExperiencedCategories: Civil - Design, Civil Engineering Alliance Engineering, Ltd. is a civil engineering, design, construction, and land development firm offering services for projects from conception through completion. With over 20+ years of experience in the market, we are able to provide a broad range of quality services, delivered consistently on time at competitive prices. We offer excellent plans and programs for employees. Employees are rewarded with a competitive salary and comprehensive benefits package which may include: health benefits with coverage for families and domestic partners, vacation, retirement plans, paid holidays, tuition reimbursement, generous bonuses, and profit sharing. You can experience the excitement of our company-it's the difference between taking a job and starting a career."
3865,Monster,Report this job About the Job Janitorial Positions ServiceMaster is now accepting applications for Full Time evening commercial janitorial positions in the health care field. Call 865-281-0220 for more information. Posting provided by: Report
16231,Monster,"Report this job About the Job Connexion Systems & Engineering, a Boston based IT and Engineering Solutions Company immediately seeks individuals with the following skills: Job# bh6076 Staff AccountantJob Description:The Staff Accountant II, under the direction of the SVP/Director of Accounting Operations, performs a variety of accounting and operational duties in support of the Finance department and other business lines.Position Requirements:Candidate should have BS degree in accounting/finance and 3+ years accounting experience in a banking environment. Good problem solver with strong accounting/financial fundamentals. A working knowledge of Excel and Word, good communication and people skills and must be a team player.Specific Job Functions:Performs daily, weekly, and monthly operational functionsInteracts with a variety of personnel in resolving various balancing and related issuesDevelops full understanding of our financial systems and operationsPerforms the reconciliation of correspondent bank statements and general ledger accountsSupports routine accounting processes as directed, including all calculations, record keeping, proof and reconciliation functionsAssists with the administration and execution of wire transfersAssists with department record retention requirementsManages the accounts payable, fixed assets, and prepaid items functionsPerforms a variety of daily processing and reconciling activitiesDevelops analytical methods to review income and expense activityPrepares internal reporting for monthly Board packagePrepares budget variance analysis at department levelAssists Director in support of oversight of Vendor Management programProvides operational support to other business lines Works well in a team environment. When responding to this job posting you MUST include the Job# and Job Title in your subject line. Duration: Permanent Rate 50-57K Locations: Belmont, MA Contact Info: ConneXion Systems & Engineering490 Boston Post RoadSudbury, MA 01776jobpostings@csetalent.com Report"
3955,Monster,Report this job About the Job Scott's Hotrods is moving to eastern Tennessee. Scott's is the premier manufacturer of quality suspension kits and complete chassis for American made vehicles. We also build complete ground-up award winning custom vehicles. YEARS of Experience and Natural Skill. Must be able to work as a team to achieve a common goal: CNC Programmers $21-$28/hr Custom Sheet Metal Fabricators Chassis Fabricators Office Manager Competitive Pay! www.scottshotrods.com email resume: cam@scottshotrods.com 805-485-0382 Posting provided by: Report
3938,Monster,Report this job About the Job Local distributor taking applications for F/T delivery person. Clean driving record & valid TN driver's lic. Send resume to: tracy@xpaperchem.com or call 865-688-5757.Posting provided by: Report


In [8]:
df = df.drop(df[df.job_title == "Monster"].index)
# or df = df[df['job_title']!="Monster"] 

### Replace the jobtitle RN with Registered Nurse

In [9]:
df[df["job_title"] == "RN"]

Unnamed: 0,job_title,job_description
2668,RN,Job Description Registered professional nurse ...
2672,RN,Job Description Registered professional nurse ...
2763,RN,"Job Description Up to $5,000 Sign On Bonus for..."
2794,RN,Provides professional nursing care for the com...
2821,RN,Job Description As part of the application pro...
...,...,...
19635,RN,"Up to $5,000 Sign On Bonus for experienced Eme..."
19743,RN,"Registered professional nurse who assesses, im..."
19773,RN,"Registered professional nurse who assesses, im..."
19927,RN,Davis Hospital and Medical Center is a 225-bed...


In [10]:
df["job_title"] = ["Registered nurse" if title=="RN" else title for title in df.job_title]

### Count the most occuring words in the job_title column

In [11]:
from collections import Counter
Counter(" ".join(df["job_title"]).split()).most_common(100)

[('Job', 19779),
 ('in', 19166),
 ('-', 4726),
 ('Manager', 3026),
 ('Sales', 1475),
 ('Dallas', 1429),
 ('Specialist', 1247),
 ('Engineer', 1136),
 ('Technician', 1054),
 ('{', 959),
 ('/', 959),
 ('Assistant', 947),
 ('Project', 930),
 ('Cincinnati', 900),
 ('for', 844),
 ('Service', 828),
 ('|', 809),
 ('Application', 797),
 ('Monster.com', 794),
 ('Supervisor', 782),
 ('Columbus', 776),
 ('San', 760),
 ('Analyst', 740),
 ('Construction', 723),
 ('Senior', 722),
 ('Quality', 707),
 ('=', 654),
 ('var', 653),
 ('MONS_LOG_VARS', 637),
 ('{"JobID":', 637),
 ('and', 611),
 ('Associate', 589),
 ('Nurse', 581),
 ('Shift', 545),
 ('City', 530),
 ('Coordinator', 525),
 ('&', 515),
 ('Representative', 514),
 ('Time', 498),
 ('}', 483),
 ('body', 475),
 ('margin:px;', 475),
 ('overflow:', 475),
 ('visible', 475),
 ('!important;', 475),
 ('#ejb_header', 475),
 ('color:', 475),
 ('#;', 475),
 ('font-family:', 475),
 ('Verdana', 475),
 ('Level', 472),
 ('Chicago', 469),
 ('Entry', 458),
 ('Regis

### Remove the unwanted words from Job_Title column (such as Job)

In [12]:
jobtitle=df['job_title'].str.split('Job')
df['job_title']=jobtitle.str[0]

### Remove punctuation from job_title column

In [13]:
import string
punc = string.punctuation
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [14]:
df['job_title'] = df['job_title'].apply(lambda x: "".join([word for word in x if word not in punc]))

In [15]:
df['job_title'].value_counts()[:10]

CyberCoders                                                    264
Shift Supervisor                                               260
Project Manager                                                154
N Horizontal Construction Engineers                            138
B Combat Engineer  Construction and Engineering Specialist      99
L Construction Vehicle Repairer                                 98
Restaurant Manager                                              95
Security Officer                                                88
Registered nurse                                                70
Maintenance Technician                                          66
Name: job_title, dtype: int64

In [16]:
# Reset index of dataframe for later mapping
df = df.reset_index()

## Convert descriptions into vectors with the  Universal Sentence Encoder

In [17]:
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [18]:
embeddings = embed(df.job_title.to_list())

In [19]:
embeddings[0]

<tf.Tensor: shape=(512,), dtype=float32, numpy=
array([ 5.12974672e-02, -2.62291506e-02, -5.80487065e-02,  3.83210694e-03,
       -2.07152199e-02, -2.42365990e-02,  4.04172726e-02, -2.02521868e-02,
        7.80966654e-02, -6.32373020e-02,  4.99930941e-02, -6.03388958e-02,
        1.89307071e-02, -2.51962431e-02,  7.05621988e-02,  1.12409843e-03,
       -5.51056862e-03,  1.13683625e-03,  5.51899783e-02, -7.56627843e-02,
        6.14884421e-02,  1.87566422e-03,  4.42417450e-02, -9.26245842e-03,
        2.25352440e-02,  6.17089728e-03, -8.32106546e-02,  2.31221858e-02,
       -3.00848112e-02, -2.78400648e-02, -7.13586658e-02,  6.72420710e-02,
       -3.51649150e-02, -1.12885246e-02, -5.77206612e-02, -1.82857178e-02,
        6.03349172e-02,  6.27986109e-03, -7.15801194e-02,  5.21528572e-02,
        6.00093603e-02, -3.55952494e-02, -4.34854738e-02, -2.40718853e-02,
        6.70997873e-02, -1.11188358e-02,  1.43688172e-02, -2.32283622e-02,
       -2.39833910e-02,  1.82318769e-03, -7.39896372

In [20]:
import numpy as np
np.shape(embeddings)

TensorShape([21682, 512])

### Iterate over all job titles and create list of embeddings

In [21]:
from tqdm.notebook import tqdm

embeddings_array = [item.numpy() for item in tqdm(embeddings)]

HBox(children=(FloatProgress(value=0.0, max=21682.0), HTML(value='')))




## Building an approximate similarity matching index with Faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.    
https://github.com/facebookresearch/faiss

### Import faiss and create an index with our embeddings

In [22]:
import faiss
import numpy as np
dimensions = len(embeddings_array[0])
faiss_index = faiss.IndexFlatL2(dimensions)
faiss_index.add(np.array(embeddings_array))

### Search for nearest neighbour in faiss

In [23]:
embedding = embed(["this is a test"])
D, nearest_items = faiss_index.search(embedding.numpy(), 5)
nearest_items

array([[13770,  5037, 10526, 13780,  1556]])

In [24]:
df.iloc[[13770,  5037, 10526, 13780,  1556]]

Unnamed: 0,index,job_title,job_description
13770,13960,Test Lead,RESPONSIBILITIES:Kforce has a client that is s...
5037,5144,Test Lead,Test Lead Qualification: Bachelors in scienc...
10526,10688,USIT Tester Tier,Tester Role Bachelors degree or equivalent exp...
13780,13970,Test Engineer,Mastech is a growing company dedicated to inno...
1556,1581,Test Engineer,Randstad is looking for a Test Engineer in Dal...


In [25]:
def get_most_similar_jobs(text: str, num_recos: int = 5)-> pd.DataFrame:
    embedding = embed([text])
    D, nearest_items = faiss_index.search(embedding.numpy(), num_recos)
    return df.iloc[nearest_items[0]]

In [26]:
get_most_similar_jobs("I want to develop in python", 10)

Unnamed: 0,index,job_title,job_description
14986,15197,Python Developer,Job Description:-Client is looking for a Pytho...
3500,3533,Python Developer,Job Description:-Client is looking for a Pytho...
11023,11194,Python Developer II,RESPONSIBILITIES:Kforce has a client that is s...
17246,17533,Sr Python Developer,"Experis IT, a Manpower Company, is seeking a S..."
16475,16731,API Tester w Python Scripting experience,• To test the different services and API’s nee...
15529,15745,Senior Software Engineer Python,Senior Software Engineer (Python)Senior Softwa...
17933,18220,Java Groovy Developer,RESPONSIBILITIES:Kforce has a client in Westla...
4735,4841,Software Automation Engineer – Python,"Software Automation Engineer – Python, Ruby, J..."
3484,3517,Java SDET,Quickly growing company on the East side is lo...
21044,21362,Java,Java DeveloperJava Developer Description: The ...
