# Building a job recommendation system with embeddings

## Agenda

1. Data exploration
2. Data cleaning
3. Embedddings generation (https://tfhub.dev/google/universal-sentence-encoder/4)
4. Similarity search with Faiss (https://github.com/facebookresearch/faiss)
5. Recommendations based on text

## Import data

In [1]:
import pandas as pd
df_raw = pd.read_csv("data/monster_com-job_sample.csv")
df = df_raw[["job_title", "job_description"]]

In [2]:
df

Unnamed: 0,job_title,job_description
0,IT Support Technician Job in Madison,TeamSoft is seeing an IT Support Specialist to...
1,Business Reporter/Editor Job in Madison,The Wisconsin State Journal is seeking a flexi...
2,Johnson & Johnson Family of Companies Job Appl...,Report this job About the Job DePuy Synthes Co...
3,Engineer - Quality Job in Dixon,Why Join Altec? If you’re considering a career...
4,Shift Supervisor - Part-Time Job in Camphill,Position ID# 76162 # Positions 1 State CT C...
...,...,...
21995,Assistant Vice President - Controller Job in C...,This is a major premier Cincinnati based finan...
21996,Accountant Job in Cincinnati,Luxury homebuilder in Cincinnati seeking multi...
21997,AEM/CQ developer Job in Chicago,RE: Adobe AEM- Client - Loca...
21998,Electrician - Experienced Forging Electrician ...,Jernberg Industries was established in 1937 an...


## Dataframe exploration
What can you find out about the data?
(Null-values? Unique values?)

In [3]:
# Your code here

## Exploration and cleaning of the job title column

### What are the most common job titles?
Hint: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [4]:
# Your code here

### Count the most occuring words in the job_title column
Hint: https://docs.python.org/3/library/collections.html#collections.Counter.most_common

In [5]:
# Your code here

### Further ideas for exploration?

In [6]:
# Your code here

### Clean job_title colum
Ideas:
- Remove jobs with Monster as title
- Replace the jobtitle RN with Registered Nurse
- Other? 

In [7]:
# Your code here

### Remove the unwanted words from Job_Title column
Ideas: 
- Job
- Monster.com
- Other? 

In [8]:
# Your code here

### Remove punctuation from job_title column

In [9]:
# Your code here

### Reset index of dataframe
In order to map the indexes in the similarity search back to the dataframe, we need to reset the index here

In [10]:
df = df.reset_index()

# Universal Sentence Encoder

## Convert text into vectors with the  Universal Sentence Encoder

In [11]:
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

### Iterate over all job titles and create list of embeddings

In [12]:
# Your code here
embeddings = embed(["test"])

In [13]:
print(embeddings[0])

tf.Tensor(
[-6.80234432e-02 -6.85853809e-02 -1.30280964e-02  1.88582931e-02
  2.57420018e-02 -4.86867055e-02  2.86325533e-03 -3.84496488e-02
  5.93511090e-02  4.94558886e-02  4.37081642e-02  3.49676684e-02
  8.89310762e-02  7.06300065e-02 -2.95681506e-02  7.23352507e-02
  2.85208728e-02  5.63974008e-02  9.06294733e-02  1.48673793e-02
 -6.72313869e-02 -1.69373080e-02 -3.11953388e-02  4.28184494e-02
  5.73504856e-03 -1.10957902e-02 -6.67554364e-02 -1.41196316e-02
  1.79266073e-02 -4.48298194e-02  2.98095383e-02 -4.44913581e-02
 -2.33741663e-02  3.66820097e-02  9.42519959e-03 -4.14091796e-02
 -6.11785203e-02 -4.18652408e-02  5.44798970e-02  8.00584853e-02
 -5.27739041e-02 -3.06957774e-02 -8.25393945e-02  7.32660946e-03
  2.29071057e-03 -8.70237220e-03 -9.87577531e-03 -3.66706476e-02
  6.82314411e-02 -1.26337688e-02  2.74885651e-02  6.64538369e-02
  8.37582722e-02  7.10683987e-02  1.49990693e-02 -1.93336774e-02
  6.04777224e-02 -5.57912663e-02  1.51881687e-02  8.50801021e-02
 -3.13722603e-

In [14]:
import numpy as np
np.shape(embeddings)

TensorShape([1, 512])

In [15]:
from tqdm.notebook import tqdm
embeddings_array = [item.numpy() for item in tqdm(embeddings)]

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




## Building an approximate similarity matching index with Faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.    
https://github.com/facebookresearch/faiss

### Import faiss and create an index with our embeddings

In [16]:
import faiss
import numpy as np
dimensions = len(embeddings_array[0])
faiss_index = faiss.IndexFlatL2(dimensions)
faiss_index.add(np.array(embeddings_array))

### Search for nearest neighbour in faiss
Write code that takes a text, transform it into a vector and get similar vectors from faiss_index    
Hint search in index: https://github.com/facebookresearch/faiss/wiki/Getting-started

In [17]:
# Your code here

### Write a function that returns similar jobs for an input text

The search operation returns the ids (row numbers or index in the vector store) of the k most similar vectors for each query vector along with their respective distances. Use these indexes in order to return the correct rows of our dataframe

In [19]:
def get_most_similar_jobs(text: str, num_recos: int = 5) -> pd.DataFrame:
    # Your code here
    pass

In [21]:
get_most_similar_jobs("I want to develop in python")