### Job Recommender

Nowadays, due to the high volume of data in the recruitment industry, applying data-driven approaches can pave the way to the corporation OKRs. <br />
This demo depicts a big picture of a job recommendation system. The concepts are illustrated by data and basic examples. <br />
It is assumed that there is a rudimentary infrastructure to capture the nature of data. Therefore, data is unsupervised although in an enterprise company, one can boost the accuracy of the recommender by applying the feedbacks of applicants' admission result. <br />
There are 3 sources of data:
* Jobs
* Applications
* Interactions

The project goes for applying a 2-phases architecture applying a search engine to find probable positions for the applicants as phase 1 and deploying a collaborative filter (user-based) to imitate the actions of similar applicants - phase 2.

At first, let's retreive data and extract the useful features.

In [1]:
import pandas as pd

raw_jobs = pd.read_csv('data/Combined_Jobs_Final.csv', index_col='Job.ID')
# Select useful features
raw_jobs = raw_jobs[['Title', 'Position', 'Company', 'Job.Description']]
# Replace NAN values with empty string
raw_jobs.fillna('', inplace=True)
# Sort data by job IDs
raw_jobs.sort_index(axis = 0, inplace=True) 

raw_jobs.head(5)

Unnamed: 0_level_0,Title,Position,Company,Job.Description
Job.ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,Customer Service @ Bayer healthcare,Customer Service,Bayer healthcare,Candidates should be familiar with Microsoft O...
28,Kitchen Staff/Chef @ Pacific Catch,Kitchen Staff/Chef,Pacific Catch,"OVERVIEW\r\nPacific Catch, the Bay Area's hott..."
30,Bartender @ Dave's American Bistro,Bartender,Dave's American Bistro,Work and maintain fast pace bar. Knowledge of ...
33,Server @ Haven,Server,Haven,"Located in Oaklandâ€™s Jack London Square, Ha..."
35,Kitchen Staff @ Skool,Kitchen Staff,Skool,Featuring a wide variety of seafood dishes and...


The data contains many trash characters and words. Thus, there is a must to process and clean the raw data. We propose the below function to purging punctuations and stop words, also applying stemming.

In [2]:
# Importing the required libraries and tools
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')


def text_processing(df, col_name):
    # punctuation cleaning
    df['clean_{}'.format(col_name)] = df[col_name].str.replace('[^\w\s]','')

    # tokenizing phase
    df['clean_{}'.format(col_name)] = df['clean_{}'.format(col_name)].apply(nltk.word_tokenize)
        
    # stopwords cleaning
    stop = stopwords.words('english')
    df['clean_{}'.format(col_name)] = df['clean_{}'.format(col_name)]. \
        apply(lambda x: [word for word in x if word not in (stop)])

    # stemming
    ps = PorterStemmer()
    df['clean_{}'.format(col_name)] = df['clean_{}'.format(col_name)]. \
        apply(lambda words: [ps.stem(word) for word in words])
    
    
    df['clean_{}'.format(col_name)] = df['clean_{}'.format(col_name)].apply(lambda words: ' '.join(words))
    
    return df['clean_{}'.format(col_name)]

[nltk_data] Downloading package punkt to /home/amin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/amin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Concatenate columns
jobs = raw_jobs['Title'] + ' ' + raw_jobs['Position'] + ' ' +  raw_jobs['Job.Description']
# Convert object into Pandas data-frame
jobs = pd.DataFrame(jobs, columns=['aggregated'])
jobs['aggregated'] = text_processing(jobs.copy(), 'aggregated')
jobs.head(5)

Unnamed: 0_level_0,aggregated
Job.ID,Unnamed: 1_level_1
3,custom servic bayer healthcar custom servic ca...
28,kitchen staffchef pacif catch kitchen staffche...
30,bartend dave american bistro bartend work main...
33,server haven server locat oaklandâ jack london...
35,kitchen staff skool kitchen staff featur wide ...


According to the upper result, the data contains textual data; Hence, we need to convert it to numeric features to can be fed into machine-learning/NLP algorithms and tools. The procedure is called vectorization. TF/IDF is one of the most prominent approaches of the vectorization.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
tfidf_matrix = v.fit_transform(jobs['aggregated'])

In [5]:
from sklearn.metrics.pairwise import linear_kernel

def retrieve_similar_items(tfidf_matrix, data, _text, k = 3):
    query_vector = v.transform([_text])
    similarities = linear_kernel(query_vector, tfidf_matrix).flatten()
    scores = pd.DataFrame(list(zip(similarities, data.index)), columns=['score', 'Job.ID'])
    result = pd.merge(data, scores, how='inner', on=['Job.ID', 'Job.ID']).sort_values('score', axis=0, ascending=False)
    return result.head(k)

This phase illustrates the performance of the model by applying a query containing the attributes of an applicant. <br />
The score column depicts the similarity value of the results.

In [6]:
query = "data entry team"
results = retrieve_similar_items(tfidf_matrix, raw_jobs, query)
results

Unnamed: 0,Job.ID,Title,Position,Company,Job.Description,score
78527,313610,Data Entry Clerk @ Accountemps,Data Entry Clerk,Accountemps,Ref ID:00070-109384Classification:Data Entry C...,0.56142
56245,285660,Data Entry Clerk @ OfficeTeam,Data Entry Clerk,OfficeTeam,Ref ID: 01500-9738660Classification: Data Entr...,0.537745
73492,307144,Data Entry Clerk / Data Entry Specialist (Data...,Data Entry Clerk / Data Entry Specialist (Data...,OfficeTeam,Ref ID: 00110-9742612Classification: Data Entr...,0.529848


Let's investigate the nature of applicants' data and interactions.

In [7]:
candidates = pd.read_csv('data/Experience.csv', index_col='Applicant.ID')
candidates = candidates[['Position.Name', 'Employer.Name', 'Job.Description']]
candidates.fillna('', inplace=True)
candidates.sort_index(axis = 0, inplace=True) 

candidates.head(5)

Unnamed: 0_level_0,Position.Name,Employer.Name,Job.Description
Applicant.ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Volunteer,School for Self-Healing,* Read aloud Meir Schneider's books and record...
2,Writer for the Uloop Blog,Cecilia Abate,"* Wrote articles for the ""Uloop Blog,"" which i..."
3,Marketing Intern,Honda,
3,Server,Aloha Beach Resort,
3,Prep Cook,Moscone Center,


In [8]:
actions = pd.read_csv('data/Job_Views.csv')
actions = actions[['Applicant.ID', 'Job.ID']]
actions.sort_index(axis = 0, inplace=True) 
actions.head(5)

Unnamed: 0,Applicant.ID,Job.ID
0,10000,73666
1,10000,96655
2,10001,84141
3,10002,77989
4,10002,69568


Let's check the percentage of active applicants having job views.

In [9]:
import numpy as np

all_candidates_idxs = candidates.index.values
action_idxs = actions['Applicant.ID'].values
active_indexes = list(set(all_candidates_idxs).intersection(action_idxs))

active_candidates = candidates[candidates.index.isin(active_indexes)].sort_index(axis = 0, inplace=False) 
print('Size of all candidates matrix:    {}'.format(candidates.shape))
print('Size of active candidates matrix: {}'.format(active_candidates.shape))
active_candidates.head(5)

Size of all candidates matrix:    (8653, 3)
Size of active candidates matrix: (2178, 3)


Unnamed: 0_level_0,Position.Name,Employer.Name,Job.Description
Applicant.ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
42,Courtesy Clerk,Safeway,
42,Street Marketer,Media Nation,
96,Cashiet/Waiter,Miss Saigon,"Place table for customers, provide water or te..."
96,Receptionist,CCSF Duplicatiing,Greeting professors and faculty staff.
96,Cashier,Honey Berry,Greeting people and introducing/recommend food...


This phase proposes the CF phase to capture similar applicants to offer their applied job positions to other users.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

candidates['aggr_data'] = candidates['Position.Name'] + \
                            ' ' + candidates['Employer.Name'] + \
                            ' ' +  candidates['Job.Description']

candidates['aggr_data'] = text_processing(candidates.copy(), 'aggr_data')
candidates.head(5)

v2 = TfidfVectorizer()
tfidf_matrix_user = v2.fit_transform(candidates['aggr_data'])

In [11]:
from sklearn.metrics.pairwise import linear_kernel

def retrieve_similar_users(tfidf_matrix, data, _text, candid_id, k = 3):
    query_vector = v2.transform([_text])
    similarities = linear_kernel(query_vector, tfidf_matrix).flatten()
    scores = pd.DataFrame(list(zip(similarities, data.index)), columns=['score', 'Applicant.ID'])
    result = pd.merge(data, scores, how='inner', on=['Applicant.ID',
                                                     'Applicant.ID']).sort_values('score', axis=0, ascending=False)
    result = result[result['Applicant.ID'] != candid_id]
    return result.head(k)

To illustrate the 2nd phase, let's query an applicant by his/her ID and then find his/her similar applicants.

In [12]:
applicant_id = 1001
applicant = candidates.query('index == {}'.format(applicant_id))
applicant

Unnamed: 0_level_0,Position.Name,Employer.Name,Job.Description,aggr_data
Applicant.ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,Sales Associate,Athleta,Assist customers on sales floor and in fitting...,sale associ athleta assist custom sale floor f...


In [13]:
dfs = []
for index, row in applicant.iterrows():
    r = retrieve_similar_users(tfidf_matrix_user, candidates, row['aggr_data'], applicant_id, k = 3)
    dfs.append(r)
df = pd.concat(dfs).drop(['aggr_data'], axis=1).sort_values('score', axis=0, ascending=False)
df.head(10)

Unnamed: 0,Applicant.ID,Position.Name,Employer.Name,Job.Description,score
12201,5304,Sales Associate,Ross,"Fitting room, customer service, sales floor, a...",0.403909
23800,9944,Team Member,Flyers Energy,"Receive payment from customers, ensure that re...",0.336311
23805,9944,Seasonal Sales Associate,Macy's,Greet and make a connection with each customer...,0.336311


As the result, we can imitate the offers of applicant #5304 to applicant #1001.