<a href="https://colab.research.google.com/github/Zantorym/Aidi-capstone-I/blob/main/Final%20code/02_Similarity_Finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RecSys


## Code Setup

In [1]:
# Timer to measure code execution time
!pip install ipython-autotime
%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 2.61 ms (started: 2021-12-11 21:49:32 +00:00)


In [13]:
# Importing libraries
import pickle
import nltk
import pandas as pd
import numpy as np
from nltk.tokenize import TreebankWordTokenizer
from nltk.util import ngrams
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

time: 17.5 ms (started: 2021-12-11 21:52:51 +00:00)


In [3]:
'''
Constants used throughout the program

These constants include file paths as well as options for whether to apply
certain pre-processing and processing techniques. This was implemented so that
the program could be as modular as possible, allowing us to test various
combinations of techniques in order to find the most effective ones.
'''

# Pickle Input
JD_FILES_PICKLE_OUTPATH='/content/drive/MyDrive/Durham College/Capstone - I/data/Datasets/jds.pickle'
RESUME_FILES_PICKLE_OUTPATH='/content/drive/MyDrive/Durham College/Capstone - I/data/Datasets/resumes.pickle'

# Test file path
EVAL_MATRIX_FILE_PATH = '/content/drive/MyDrive/Durham College/Capstone - I/Evaluation_Matrix.xlsx'

# Tokenization
'''
0 - string split
1 - NLTK TreebankWordTokenizer
Note: not defined value = 0 = string split
'''
TOKENIZATION_ALGORITHM=0


# NGrams
NGRAM_COUNT=1 # Number of n-grams

# Stop Words
FILTER_STOP_WORDS=1 # 1 to filter stop words, 0 to not filter stop words

# The Source for stop words
'''
1 - Use NLTK stop words
2 - Use Scikit Learn stop words
Note: not defined value = 0 = intersection of both NLTK and Scikit-learn
'''
STOP_WORDS_SOURCE=0

# Case Folding
# Note: case folding is always performed as job description and resumes
#       should have minimal use of proper nouns for differentiating against 
#       common words.

# Stemming
STEMMER_ALGORITHM=0
'''
1 = Use Porter stemmer
2 = Use Snowball stemmer
Note: not defined value = 0 = no stemming performed
'''

# Lemmatization
# Note: Cannot perform lematization as punctuation is removed from source text.
#       Lemmatization requires parts of speech to work properly.

# Filtering non-alphabetic tokens
'''
0 = Don't filter
1 = Filter
'''
FILTER_NON_ALPHANUMERIC_TOKENS = 1

PREPROCESS_METHOD = 1 # 0 if we don't want to pre-process, 1 if we want to pre-process
VECTORIZATION_METHOD = 1 # 0 for TF-IDF, 1 for bag of words
SIMILARITY_METHOD = 0 # 0 for cosine similarity, 1 for euclidean distance

time: 17.1 ms (started: 2021-12-11 21:49:37 +00:00)


In [4]:
# Attaching google drive to colab instance
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
time: 34.7 s (started: 2021-12-11 21:49:40 +00:00)


## Pre-processing

In [5]:
# Loading the dataset from pickled files
jd_files_dict = resume_files_dict = {}
with open(JD_FILES_PICKLE_OUTPATH, 'rb') as fh:
  jd_files_dict = pickle.load(fh)
with open(RESUME_FILES_PICKLE_OUTPATH, 'rb') as fh:
  resume_files_dict = pickle.load(fh)

time: 8.08 s (started: 2021-12-11 21:50:28 +00:00)


In [6]:
print('Count of JDs:', len(jd_files_dict))
print('Count of Resumes:', len(resume_files_dict))

Count of JDs: 151210
Count of Resumes: 50023
time: 1.96 ms (started: 2021-12-11 21:50:36 +00:00)


In [15]:
'''
Tokenization function used in vectorization

args:
  * text - the input text to be tokenized

returns:
  * A string of the tokens that represent the text
'''
def tokenize(text):
  # Choosing the list of stop words to choose
  if FILTER_STOP_WORDS == 1:
    if STOP_WORDS_SOURCE != 2:
      nltk_stop_words = nltk.corpus.stopwords.words('english')
    
    if STOP_WORDS_SOURCE != 1:
      from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words

    if STOP_WORDS_SOURCE == 1:
      stop_words = nltk_stop_words
    elif STOP_WORDS_SOURCE == 2:
      stop_words = sklearn_stop_words
    else:
      stop_words = sklearn_stop_words.intersection(nltk_stop_words)

  # Choosing the stemmer algorithm to use
  if STEMMER_ALGORITHM == 1 or STEMMER_ALGORITHM == 2:
    from nltk.stem.snowball import SnowballStemmer
    if STEMMER_ALGORITHM == 1:
      stemmer = SnowballStemmer(language='porter')
    elif STEMMER_ALGORITHM == 2:
      stemmer = SnowballStemmer(language='english')

  tokenized = [] # Variable to store the tokenized version of input text
  
  # Tokenizing the words and converting them to lowercase
  if TOKENIZATION_ALGORITHM == 1:
    tokenized = TreebankWordTokenizer().tokenize(text.lower())
  else:
    tokenized = text.lower().split()

  # Filtering stop words and words of length less than 2
  if FILTER_STOP_WORDS == 1:
    tokenized = [token for token in tokenized if (token not in stop_words and len(token)>1)]

  # Applying the stemmer algorithm
  if STEMMER_ALGORITHM == 1 or STEMMER_ALGORITHM == 2:
    tokenized = [stemmer.stem(token) for token in tokenized]

  # Applying ngrams
  if NGRAM_COUNT > 1:
    # Handle files that are "empty", i.e. contains only spaces
    if len(tokenized) == 0:
      ngram_tokens = []
    else:
      ngram_tokens = [' '.join(t) for t in ngrams(tokenized, NGRAM_COUNT)]
    tokenized += ngram_tokens

  if FILTER_NON_ALPHANUMERIC_TOKENS == 1:
    tokenized = [token for token in tokenized if (all(char.isalpha() or char.isdigit() for char in token) and any(char.isalpha() for char in token))] # Only aplhanumeric tokens that contain at least one alphabet

  string = ' '.join(tokenized) # Convert vectorized list to string

  return string

time: 39.5 ms (started: 2021-12-11 21:53:06 +00:00)


In [11]:
# Combining both corpuses in to one dictionary
combined_files_dict = {}
for filename in jd_files_dict:
  modified_fn = 'jd:' + filename
  if len(jd_files_dict[filename].strip()) == 0:
    continue
  combined_files_dict[modified_fn] = jd_files_dict[filename]
for filename in resume_files_dict:
  modified_fn = 'rs:' + filename
  if len(resume_files_dict[filename].strip()) == 0:
    continue
  combined_files_dict[modified_fn] = resume_files_dict[filename]

# Storing the filenames in a list
jd_filenames = [key for key in combined_files_dict.keys() if key.startswith('jd:')]
resume_filenames = [key for key in combined_files_dict.keys() if key.startswith('rs:')]

# Converting corpus from dictionary to dataframe
corpus_raw = pd.DataFrame.from_dict(combined_files_dict, orient='index', columns=['text'])

time: 415 ms (started: 2021-12-11 21:51:43 +00:00)


In [16]:
# Applying some pre-processing and tokenization to the dataset
if PREPROCESS_METHOD == 1:
  corpus_raw['text'] = corpus_raw['text'].apply(tokenize)

# Applying vectorization
if VECTORIZATION_METHOD == 0:
  vectorizer = TfidfVectorizer()
  corpus_vectors = vectorizer.fit_transform(corpus_raw['text'])
else:
  vectorizer = CountVectorizer()
  corpus_vectors = vectorizer.fit_transform(corpus_raw['text'])

corpus_filenames = corpus_raw.index.values # List of file names in the corpus

time: 2min 28s (started: 2021-12-11 21:53:09 +00:00)


## Loading test files for evaluation of models

In [17]:
eval_matrix = pd.ExcelFile(EVAL_MATRIX_FILE_PATH)
jd2jd = pd.read_excel(eval_matrix, 'JD_2_JD') # JD_2_JD testing dataset
r2r = pd.read_excel(eval_matrix, 'Resume_2_Resume') # Resume_2_Resume testing dataset
jd2r = pd.read_excel(eval_matrix, 'JD_2_Resume') # JD_2_Resume testing dataset
r2jd = pd.read_excel(eval_matrix, 'Resume_2_JD') # Resume_2_JD testing dataset

jd2jd = jd2jd.drop('Contributer', axis=1) # Removing the contributer column
jd2jd.set_index('Query_File_ID', inplace=True) # Makining Query_File_ID the index

r2r = r2r.drop('Contributor', axis=1) # Removing the contributer column
r2r.set_index('Query_File_ID', inplace=True) # Makining Query_File_ID the index

jd2r = jd2r.drop('Contributor', axis=1) # Removing the contributer column
jd2r.set_index('Query_File_ID', inplace=True) # Makining Query_File_ID the index

r2jd = r2jd.drop('Contributor', axis=1) # Removing the contributer column
r2jd.set_index('Query_File_ID', inplace=True) # Makining Query_File_ID the index

time: 444 ms (started: 2021-12-11 21:55:38 +00:00)


## Similarity score

In [18]:
def find_similarity(test_file, jd_or_resume = 0):
  """
  Given a test file, it returns the similarity score against that file for all JDs/Resumes

  params:
    test_file: One entry from the corpus representing a JD or Resume
    jd_or_resume: Whether to compare against JDs or to compare against resumes
                  0 - JDs
                  1 - Resumes
  
  returns: A pandas dataframe containing the similarity scores of all the required files
  """

  sample_index = np.where(corpus_filenames == test_file)[0][0]
  sample = corpus_vectors[sample_index]
  
  test_output = []
  if SIMILARITY_METHOD == 0:
    test_output = cosine_similarity(corpus_vectors, sample)
  else:
    test_output = euclidean_distances(corpus_vectors, sample)

  test = pd.DataFrame(test_output, index = corpus_filenames, columns = ['similarity'])

  if jd_or_resume == 0:
    test = test.loc[ jd_filenames, : ]
  else:
    test = test.loc[ resume_filenames, : ]

  if SIMILARITY_METHOD == 0:
    test.sort_values(by=['similarity'], ascending = False, inplace = True) # Cosine
  else:
    test.sort_values(by=['similarity'], ascending = True, inplace = True) # Euclidean

  return test

time: 16.1 ms (started: 2021-12-11 21:55:38 +00:00)


## MAP Score

In [19]:
def generate_MAP_input(test_file_type = 0):
  """
  Generates the two inputs required for calculating the MAP score

  params:
          test_file_type: which of the 4 test files it is (default = 0)
                          0 - JD_2_JD
                          1 - Resume_2_Resume
                          2 - JD_2_Resume
                          3 - Resume_2_JD

  returns:
          the two inputs for the MAP score function
  """
  actual = []
  predicted = []

  test_file = []
  ind_prefix = ''
  res_prefix = ''
  jd_or_resume = 0

  if test_file_type == 0:
    test_file = jd2jd
    ind_prefix = 'jd:'
    res_prefix = 'jd:'
  elif test_file_type == 1:
    test_file = r2r
    ind_prefix = 'rs:'
    res_prefix = 'rs:'
    jd_or_resume = 1
  elif test_file_type == 2:
    test_file = jd2r
    ind_prefix = 'jd:'
    res_prefix = 'rs:'
    jd_or_resume = 1
  else:
    test_file = r2jd
    ind_prefix = 'rs:'
    res_prefix = 'jd:'

  for index, row in test_file.iterrows():
    # List of files relevant to the query file in the testing document
    try: # Had to implement a try-except statement because sometimes there are multiple entries for one file (eg. line 141 and 142 of JD_2_JD are the same)
      relevant_files = test_file.loc[index].tolist()
    except:
      relevant_files = test_file.loc[index].iloc[0].tolist() # we select the first entry from the list of entries in the testing dataset
    relevant_files = [res_prefix + file for file in relevant_files if not(pd.isnull(file))]

    
    # Finding files relevant to the query file using our code
    test = find_similarity(ind_prefix + index, jd_or_resume)

    # Removing top result if it is the same as the query file
    if test.index[0] == ind_prefix + index:
      test = test.iloc[1: , :]

    predicted_files = test.head(len(relevant_files)).index # Getting the top predicted files

    actual.append(relevant_files)
    predicted.append(predicted_files)

  return predicted, actual

time: 33.5 ms (started: 2021-12-11 21:55:38 +00:00)


In [20]:
"""
A function to calcualte the precision@k.

Input: Two lists and a number.
      - 'predicted' is the list of file names that our algorithm generates in response to a specific query
      - 'actual' is the list of file names that our AI algorithm is supposed to return
      - 'k' is the k-index for which we're supposed to calculate the precision@k

Output: A number denoting the precision@k
"""

def precision_at_k(predicted, actual, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    result = len(act_set & pred_set) / float(k)
    return result

time: 4.17 ms (started: 2021-12-11 21:55:38 +00:00)


In [21]:
"""
A function to calculate the average precision for a specific query.

Input: Two lists.
      - 'predicted' is the list of file names that our algorithm generates in response to a specific query
      - 'actual' is the list of file names that our AI algorithm is supposed to return

Output: A number denoting the average precision for a query.

Things to check for: If the length of our predicted array is less than the length of our actual array, the code will fail (ideally this shouldn't happen, and should be checked for before calling the map score function)
"""

def avg_precision(predicted, actual):
  avg_prec = 0
  n = 0

  for i in range(len(actual)):
    if predicted[i] == actual[i]:
      avg_prec += precision_at_k(predicted, actual, i+1)
      n += 1
  
  if n>0:
    avg_prec /= n
  
  return avg_prec

time: 10.7 ms (started: 2021-12-11 21:55:38 +00:00)


In [22]:
"""
A function to calculate the Mean Average Precision (MAP) Score for the entire testing dataset.

Input: Two 2D Lists. 
      - 'predicted' is the list of list of file names that our algorithm generates. Each list corresponds to one input
      - 'actual' is the list of list of file names that we're supposed to get. Each list corresponds to one input

Output: A number denoting the map_score
"""

def score(predicted, actual):
  map_score = 0
  n = 0

  for i in range(len(actual)):
    map_score += avg_precision(predicted[i], actual[i])
    n += 1

  if n>0:
    map_score /= n
  
  return map_score

time: 8.51 ms (started: 2021-12-11 21:55:38 +00:00)


## Getting the MAP Score on the test set

Based on the options set for which pre-processing and processing methods to use (in particular the variables PREPROCESS_METHOD, VECTORIZATION_METHOD, and SIMILARITY_METHOD), this part shows you the MAP scores corresponding to the 4 test sets. For the best combination of these methods, refer to the final report.

In [None]:
# For JD_2_JD
predicted, actual = generate_MAP_input(0)
print("JD-2-JD score: ", score(predicted, actual))

JD-2-JD score:  0.05444659776055125
time: 2min 49s (started: 2021-12-09 05:23:13 +00:00)


In [None]:
# For Resume_2_Resume
predicted, actual = generate_MAP_input(1)
print("Resume-2-Resume score: ", score(predicted, actual))

Resume-2-Resume score:  0.021722846441947566
time: 1min 49s (started: 2021-12-09 05:26:02 +00:00)


In [None]:
# For JD_2_Resume
predicted, actual = generate_MAP_input(2)
print("JD-2-Resume score: ", score(predicted, actual))

JD-2-Resume score:  0.0
time: 23.6 s (started: 2021-12-09 05:27:52 +00:00)


In [None]:
# For Resume_2_JD
predicted, actual = generate_MAP_input(3)
print("Resume-2-JD score: ", score(predicted, actual))

Resume-2-JD score:  0.0
time: 55.5 s (started: 2021-12-09 05:28:16 +00:00)
