# Recommender Systems, NLP
#### Author: Atanu Choudhury

Environment: Python 3.6.5 and Jupyter Notebook

Libraries used: 
* re (for regular expression) 
* json (to write json file)
* re (for regular expression) 
* nltk 3.2.2 (Natural Language Toolkit)
* nltk.tokenize (for tokenization)
* nltk.collocations (for finding bigrams)
* nltk.stem (for stemming)
* collections (for defaultdict and Counter)
* itertools (for chaining iterables)
* TfidfVectorizer (to create the Tf-Idf vectors)
* pandas (to process the job and resume data)
* cosine_similarity (to calculate the cosine similarity between two vectors)



## Task 1: Parsing Job Postings
This task comprises of extracting the job related information for a job posting. The extracted data has to be written to a json and xml file of a similar structure.

The task was achieved in the following steps:
- Read the job posting file job_postings_raw.dat.
- Analysed the data to identify the patterns in the data.
- Designed regex and other logics to extract most of the data effeciently.
- Write extracted data to json and xml

More details for each task will be given in the following sections.

## Step 1 - Importing libraries

In [1]:
import re
import json

## Step 2 - Reading job postings file

- Reading the file based on the separator for each job posting, thus 'data' is a list of all the job postings.


In [2]:
job_file=open('./input/job_postings_raw.dat', "r")
data=job_file.read().split("------------------------------")
job_file.close()

## Step 3 XML writer

#### Workflow
- XML String creator
  - Takes a dictionary as input
    - Checks if the type of the input is dict or list
    - Replacing all escaped characters in xml by their equivaelnt according to the specifications
    - Recursively calling the method to go to any depth of children, and executing the same
  - Returns string object with indentation in xml like format

In [3]:
def xml_writer(json_object, indent=""):
    result_list = list()
    json_object_type = type(json_object)
    
    xml_escape_char={'&':'&amp;',
                    '<':'&lt;',
                    '>':'&gt;',
                    '"':'&quot;',
                    "'":'&apos;'}
    
    
    if json_object_type is list:
        for elem in json_object:
            result_list.append(xml_writer(elem, indent))
        return "\n".join(result_list)

    if json_object_type is dict:
        for tag_name in json_object:
            child = json_object[tag_name]
            if not isinstance(child, list):
                for key, value in xml_escape_char.items():
                    child=re.sub(key,value,child)
            result_list.append("%s<%s>" % (indent, tag_name))
            result_list.append(xml_writer(child, "\t" + indent))
            result_list.append("%s</%s>" % (indent, tag_name))


        return "\n".join(result_list)

    return "%s%s" % (indent, json_object)

## Step 4 - Parsing the file

#### Workflow
- Created a dictionary of regex replacements
  - Replacing all the found variations of the each tag with a normalised version of the form '^key\r', i.e. left padding it with '^' and right padding it with '\r'
  - Variations found
    - "^start_date\r" -> (DATE_START|START DATE|START_DA|start_date|DATES)
    - "^required_qualifications\r" -> "(REQ_QUALS|QUALIFICATION|qualifications|QUALIFS|REQUIRED QUALIFICATIONS)"
    - "^job_responsibilities\r" -> "(JOBRESPONSIBILITIES|RESPONSIBILITY|JOB_RESPS|RESP|responsibilities)"
    - "^title\r" -> "(JOB TITLE|JOB_T|\_TTL|TITLES|title)"
    - "^location\r" -> "(JOB_LOC|\_LOC|\_LOCS|LOCATION|LOCATIONS)"
    - "^job_descriptions\r" -> "(JOB DESCRIPTION|JOB_DESC|job_desc|\_description|DESCRIPTION)"
    - "^salary\r" -> "(JOB_SAL|REMUNERATION|SALARY|remuneration|salary)"
    - "^application_procedure\r" -> "(JOB_PROC|PROCEDURE|PROCEDURES|procedures|JOB_PROCS)"
    - "^application_deadline\r" -> "(APPLICATION_DEADL|APPLICATION_DL|DEAD_LINE|DEADLINES|deadline)"
    - "^about_company\r" -> (ABOUT COMPANY|COMPANYS_INFO|about_company|\_info|ABOUT)

- Removing the unneccessary tags such as "(REMUNERATION\/|OPEN TO\/|START DATE\/|ABOUT PROGRAM\/)" which are making the data dirty
- Iterating every job posting
  - Checking for headers with children and applying regex and child creation logic and performing extraction
      - Job Description:
            Identified the separator of each description child as being double quotes. The data has to be cleaned before being able to extract the children. Cleaning the data for the job description by replacing three double quotes with one double quote and two single quotes, then replacing the two double quotes with two single quotes, and then splitting the data on a single double quote. Then replacing multiple occurences of 'NA' with a single value of 'N/A' inserted as the child. Populating that into the dictionaries maintained for xml and json writing
      - Job Responsibilities & Required Qualifications:
            Identified these two headers to be separated majorly on the basis of '-' as bullet and ';' or '.' which marks the end of each child. Few variations were there in which the child end was marked by new line or comma, which has been handled conditionally. The regex used for extracting the children is '(?:(?!- ).)\*(?=\\n|$)'. The '(?:)' specifies the capturing group and the '(?!)' specifies the negative lookahead, thus it will capture eveyrthing after the negative lookahead expression specified including the line breaks as specified by the DOTALL modifier, till it finds a new line or end of line followed by the negative lookahead in the non capturing group. Similarly the other variations of the regex just has the end of child different extracting similar type of data with the oberved variations.
  - Extracting id separately
  - Extracting all other tags
  - Write each record to xml file for every iteration
  - Populate a each listing to dictionary for json writer
- Write the populated dictionary for json writer to the json file
    
                   


In [6]:
#Initialising a listing dictionary to store the values for writing the json file
listings = {'listings': {"listing":[]}}

# Creating a headers regex dict to replace the irregular headers with the normalised ones
headers_regex_dict = {"\\n(DATE_START|START DATE|START_DA|start_date|DATES):": "^start_date\r", 
                      "\\n(REQ_QUALS|QUALIFICATION|qualifications|QUALIFS|REQUIRED QUALIFICATIONS):": "^required_qualifications\r",
                      "\\n(JOB RESPONSIBILITIES|RESPONSIBILITY|JOB_RESPS|RESP|responsibilities):":"^job_responsibilities\r",
                     "\\n(JOB TITLE|JOB_T|_TTL|TITLES|title):":"^title\r",
                     "\\n(JOB_LOC|_LOC|_LOCS|LOCATION|LOCATIONS):":"^location\r",
                     "\\n(JOB DESCRIPTION|JOB_DESC|job_desc|_description|DESCRIPTION):":"^job_descriptions\r",
                     "\\n(JOB_SAL|REMUNERATION|SALARY|remuneration|salary):":"^salary\r",
                     "\\n(JOB_PROC|PROCEDURE|PROCEDURES|procedures|JOB_PROCS):":"^application_procedure\r",
                     "\\n(APPLICATION_DEADL|APPLICATION_DL|DEAD_LINE|DEADLINES|deadline):":"^application_deadline\r",
                     "\\n(ABOUT COMPANY|COMPANYS_INFO|about_company|_info|ABOUT):":"^about_company\r",
                     "(ID):":"id\r"}

# Initialsing the list of tags to be used in further dictionaries
list_of_xml_tags=["title","location","job_descriptions","job_responsibilities","required_qualifications","salary",
              "application_procedure","start_date","application_deadline","about_company"]
list_of_nones = [None] * 10
list_of_json_tags=['_id']+list_of_xml_tags

# opening the xml file to write data
file_obj=open('./output/job_postings_extracted.xml', 'w')
# writing initial lines according to the xml specs
file_obj.write('<?xml version="1.0" encoding="UTF-8" ?>\n')
file_obj.write('<listings>\n')

n=len(data)-1
# iterating each job posting
for i in range(0,n):
    # Replacing the dirty tags, cleansing the data
    text=re.sub('\\n(REMUNERATION\/|OPEN TO\/|START DATE\/|ABOUT PROGRAM\/)\\n','\n',data[i])
    
    # Normalising the headers using the dict defined before
    for key, value in headers_regex_dict.items():
        text=re.sub(key,value,text)
    
    # Splitting each key from other keys
    tag_split=text.split('^')
    
    # Initialsing the dictionaries to be used to maintain the xml and the json formats
    dictionary_json = dict(zip(list_of_json_tags, list_of_nones))
    dictionary_xml = dict(zip(list_of_xml_tags, list_of_nones))
    
    # Iterating the list of keys along with their values
    for element_tag in tag_split:
        # Extracting the key
        header=element_tag.split('\r')[0].strip()
        # Extracting the value
        value=element_tag.split('\r')[1].strip()
                
        # Checking for key which is of child form
        if header in ['job_descriptions','required_qualifications','job_responsibilities']:
            
            if header == 'job_descriptions':
              # Cleaning the data in the job descriptions
                cleaned_value=re.sub('"""',"\"''",value)
                # Replacing two double quotes with two single quotes to not lose the data
                cleaned_value=re.sub('""',"''",cleaned_value)
                # Cleansing the data
                d=re.sub('\n'," ",cleaned_value).split('"')
                description_list=re.sub('\n'," ",cleaned_value).split('"')
                # extracting the list of probable job descriptions after splitting
                description_list=[x for x in description_list if x.strip() if x!=',']

                # iterating each job description and re-inserting 'N/A' for multiple occurences for 'NA'
                desc_list=[]
                for d in description_list:
                    if d.count(',NA,')>0:
                        desc_list.append('N/A')
                    else:
                      # using the similar logic above to maintain integrity of the data 
                        desc_list.append(re.sub("''",'"',d))
                
                # initialising the json form of the current key and populating it
                json_desc={'description':None}
                
                # Proceed only if list not empty
                if desc_list:
                
                    json_desc['description']=desc_list
                    dictionary_json[header]=json_desc

                    # initialising the xml form of the current key and populating it
                    list_of_desc_dict=[]
                    for d in desc_list:
                        desc_dict={'description':None}
                        desc_dict['description']=d
                        list_of_desc_dict.append(desc_dict)
                    dictionary_xml[header]=list_of_desc_dict
            
            else:
                # Check if header is required qual
                if header=='required_qualifications':
                    header_key='qualification'
                    
                else:
                # Check if header is job resposnibility
                    header_key='responsibility'
                
                
                # Using regex to extract the text starting with '-' until ';' or '.'
                #cleansing the data by removing tabs and extra new line breaks
                value=re.sub('\\t',' ',value)
                value=re.sub('\\n[\w\/ ]+:[ ]*\\n','\\n',value)
                
                #checking if the semi colon is present in the data according to my observation 
                #that probably some other separator
                # separates the child
                if value.find(';')==-1:
                    # two variations were also observed
                    # one ending with comma
                    if value.find(',\\n')==-1:
                        qualifications=re.findall('(?:(?!- ).)*(?=\\n|$)',value,re.M|re.S)
                    #not ending with any special character but with a new line break
                    else:
                        qualifications=re.findall('(?:(?!- ).)*(?=[,\.])',value,re.M|re.S)
                        
                # child ending with semi colon
                else:
                    qualifications=re.findall('(?:(?!- ).)*(?=[;\.])',value,re.M|re.S)
        
                # removing new line breaks from extracted value list
                qualifications=[re.sub('\n',' ',q.strip()) for q in qualifications if q]
                
                # initialising the json form of the current key and populating it
                json_req_qual={header_key:None}
                
                # Proceed only if list not empty
                if qualifications:
                    json_req_qual[header_key]=qualifications
                    dictionary_json[header]=json_req_qual
                
                    # initialising the xml form of the current key and populating it
                    list_of_qual_dict=[]
                    for qual in qualifications:
                        qual_dict={header_key:None}
                        qual_dict[header_key]=qual
                        list_of_qual_dict.append(qual_dict)
                    dictionary_xml[header]=list_of_qual_dict
                
        elif header == 'id':
            # writing 'id' value to xml file
            file_obj.write("\t<listing id='"+value.strip()+"'>\n")
            # storing id to json dict for processing later
            dictionary_json['_id']=re.sub('\n',' ',value).strip()
        
        else:
            # extracting all other tags using same logic
            dictionary_json[header]=dictionary_xml[header]=re.sub('\n',' ',value).strip()
     
    
    # replacing the empty or None values with N/A in json dict
    for k, v in dictionary_json.items():
        if v is None or not v:
            dictionary_json[k] = "N/A"
    
    # replacing the empty or None values with N/A in xml dict
    for k, v in dictionary_xml.items():
        if v is None or not v:
            dictionary_xml[k] = "N/A"
    
    # Creating the json object of the xml_dictionary and passing it to xml_writer to prepare the xml string
    j = json.loads(json.dumps(dictionary_xml))
    file_obj.write(xml_writer(j,'\t\t')) # writing the returned xml formatted string
    file_obj.write("\n\t</listing>\n") # ending current listing
    
    # appending the current listing to json dictionary to be used later
    listings['listings']["listing"].append(dictionary_json)
    
# ending the root of the xml
file_obj.write("</listings>")
# closing the xml file
file_obj.close()
# opening the json file
json_file_obj=open('./output/job_postings_extracted.json', 'w')
#dumping the json dictionary to the file using indent and pretty print
json.dump(listings, json_file_obj, sort_keys=False, indent=4)
#closing th json file
json_file_obj.close()

## Task 1 Summary

This task measured the understanding of parsing dirty data of huge numbers using Python, not using external packages. The main outcomes were:

- **Reading file** - Reading a single file of that many lines at a point had to made effecient as the buffer memory could not allocatte so much for reading the whole data at once.
- **Designing Regex** - Scanned through the raw data file to find variations and patterns which could help form the regex to effeciently extract most chunks of data.
- **Extracting data** - Processing the records and extracting them one by one using the regex designed.
- **Processing 30K+ records** - Effectively processed the records in a very small amount of time approximately 1K records in a second.
- **Writing structured output files** - Writing the extracted data to the JSON and XML structured format, using well defined tags and child separators.

## Task 2: Parsing Resume Files
This task comprises of extracting the candidate related information from a resume file. The extracted data has to be used to create a vocab and a count vector.

The task was achieved in the following steps:
- Read the resume dataset file containing the resumes to process
- Read the relevant resumes
- Apply the tokenization, word removals and stemming to produce good vocab
- Write the extracted vocab in a file and its relevant resume file and related count in another



## Step 1 - Importing libraries

In [8]:
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from collections import defaultdict,Counter
import math
import string
from itertools import chain
from nltk.collocations import *



## Step 2 - Read Files

- Reading resume_dataset file to extract my relevant resume nos
- Creating a raw resume dictionary to store all the data related to the resume nos provided to me
- Checking if any resume is empty removing that resume key from the dictionary
- Reading the stopwords from the file provided

In [9]:
#reading the common file to extract the resume nos related to my student id
resume_dataset_file=open ('./input/resume_dataset.txt', "r")
resume_dataset=resume_dataset_file.read().split("\n\n")
resume_dataset_file.close()
my_resume_nos=list(set(re.findall('\d+',[x for x in resume_dataset if '29262909' in x][0])[1:]))

#extracting the information related to one resume in a dictionary format
resume_raw={}
for res_no in my_resume_nos: 
    file = open('./input/resumeTxt/resume_('+res_no+').txt','r',encoding='UTF-8')
    resume_raw[res_no]=file.read()
    file.close()

# Removing all the keys which have their values empty
empty_keys=[]
for k,v in resume_raw.items():
    if v.strip() =='':
        empty_keys.append(k)
for x in empty_keys:
    resume_raw.pop(x, None)
# Reading the stopwords from the given file
file = open('./input/stopwords_en.txt','r')
stopwords_list=file.read().split()
file.close()

## Step 3 - Defining the functions

- lower_repl() function is used to convert the word after a new line or full stop to lower
- generate_200_bigrams() is used to generate the required bigrams for the vocab
- clean_data() is used to clean the original resume content by removing non-printable characters, extra spaces and extra lines


In [10]:
# function to convert the second group of the match to lower, used for lower casing the words after '\n' or '.'
def lower_repl(match):
     return match.group(1) + match.group(2).lower()
 
  #function to generate the meaningful 200 bigrams from a input dictionary of values containing tokens 
def generate_200_bigrams(input_dict):
    # making one list of tokens from the multiple token list for each key
    all_resume_tokenised_words = list(chain.from_iterable(input_dict.values()))
    # Crating an object for the BigramAssocMeasures 
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    # used to find the bigram collocations from a list of words
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_resume_tokenised_words)
    # rejects bigrams which have less than 10
    bigram_finder.apply_freq_filter(10)
    # filter those words which are of lenth less than 3 or is a stopword
    bigram_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stopwords_list)
    #finds the top n bigrams using the Pointwise Mutual Information association measure
    top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200)
    #returns the top 200 bigrams
    return top_200_bigrams

# function to the clean the data in the resume, given the resume no
def clean_data(res_num):
    res_content=resume_raw[res_num]
    #clean the data for the tokens which cannot be identified or some bullets
    content_cleaned=re.sub('[–”“‘’_]','',res_content)
    # clean the data by replacing multiple '\n' with a single '\n'
    content_cleaned=re.sub('\\n( *(\\n)+)+','\\n',content_cleaned)
    # clean the data by replacing more than one space with a single space
    content_cleaned=re.sub(' +',' ',content_cleaned)
    # cleaning the data by replacing space or '\n' after a full stop by a single space after it
    content_cleaned=re.sub("\. +\\n *",'. ',content_cleaned)
    # extracting the word after a '\n' or a full stop to normalise the case
    content_cleaned=re.sub('(\\n *|\. +)(\w+)',lower_repl,content_cleaned)
    # replacing all escape sequences with single space
    content_cleaned=re.sub(string.whitespace,' ',content_cleaned)
    return content_cleaned

## Step 4 - Creating the Vocab

1. Normalising the case of the tokens which are at the beginning of a sentence or line, using the dictionary data structure {res_no:res_tokens}
2. Generating the top 200 meaningful bigrams
3. Using MWETokenizer to retokenise the original tokens again to remove the split words which are now in the bigrams
4. Removing the context independent stop words and short words
5. Removing the context dependent stop words and rare tokens based on threshold document frequency
6. Stemming the remaining vocab using PorterStemmer
7. Write the vocab into the vocab file with the following dictionary data structure {word:word_index}
8. Count the term frequency of the vocab items and write to the count vector file with the dictionary {res_no: {word1:count1, word2:count2,...}}

### Step 4.1 - Cleaning and Normalising
The cleaned data is tokenised using the given word regex

In [11]:
#the word tokenizer provided in the specs
word_tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

# using the tokenizer on each resume content after cleaning with lowercasing
resume_tokenised={}
for res_num in resume_raw.keys():
    resume_tokenised[res_num]=word_tokenizer.tokenize(clean_data(res_num).strip())




### Step 4.2 - Finding the top 200 Meaningful bigrams
The 200 meaningful bigrams are generated

In [12]:
# generating the top 200 meaningful bigrams  
top_200_bigrams=generate_200_bigrams(resume_tokenised)



### Step 4.3 - Retokenizing the original tokens using MWETokenizer
Retokenising to ensure that the split words are excluded from the tokens using the multi word tokenizer

In [13]:
#using the MWETokenizer to retokenise the list of tokens before generating the bigrams, so as to exclude the split words
mwe_tokenizer = MWETokenizer(top_200_bigrams)
resume_c={}
for key in resume_tokenised.keys():
    resume_c[key]=mwe_tokenizer.tokenize(resume_tokenised[key])
    



### Step 4.4 - Filtering out the stop words and short words
Removing the stop words and the short words from the list of tokens for each resume

In [14]:
#removing context independent stop words and short words
resume_stopped={}
for key in resume_c.keys():
    resume_stopped[key]=[x for x in resume_c[key] if len(x)>=3 if x.lower() not in stopwords_list]




### Step 4.5 - Removing the rare and common tokens
Calculating the threshold frequencies of the word and removing the ones which appear in less than 2% or more than 98% of the documents.

In [15]:
#Removing tokens which are not between 2% and 98% of document frequency
#Initialising an integer dicitonary to count the document frequency
doc_freq=defaultdict(int) 
for res_no in resume_stopped.keys():
    for word in set(resume_stopped[res_no]):
        doc_freq[word.lower()]+=1
#creating a list of tokens not within that document frequency
rare_common_tokens=[]
for k in doc_freq.keys():
    if doc_freq[k]>0.98*len(resume_raw.keys()) or doc_freq[k]<0.02*len(resume_raw.keys()):
                    rare_common_tokens.append(k)
# filtering out those tokens which are not there within that range of document frequency
resume_frequent={}
for key in resume_stopped.keys():
    resume_frequent[key]=[x for x in resume_stopped[key] if x.lower() not in rare_common_tokens]



### Step 4.6 - Stemming the tokens using Porter Stemmer
Stemming the tokens before combining all the words

In [16]:
                    
#Stemming the words using the porter stemmer
stemmer=PorterStemmer()
resume_stemmed={}
for key in resume_frequent.keys():
    resume_stemmed[key]=[stemmer.stem(x) for x in resume_frequent[key]]



### Step 4.7 - Finding out the term frequency for each resume
Calculating the term frequency for each term in a document

In [17]:
# Counting the term frequency of each term in a document using the collections.Counter class
resume_words = []
resume_term_freq={}
for key in resume_stemmed.keys():
    resume_words+=resume_stemmed[key]
    resume_term_freq[key]=dict(Counter(resume_stemmed[key]))



### Step 4.8 - Creating a dictionary of words and their indices
Making a vocab with words and their index positions after being sorted

In [18]:
# Create a dictionary of the sorted set of all the words with their index to be written in the vocab
vocab_dict={word:index for index, word in enumerate(sorted(set(resume_words)))}    



### Step 4.9 - Writing the vocab of word:word_index form
Writing the vocab file

In [19]:
#open the vocab file and write the vocab_dict in the specified format
vocab_file=open('./output/resume_vocab.txt','w')
for word,index in vocab_dict.items():
    vocab_file.write("%s:%s\n" % (word, index))
vocab_file.close()


### Step 4.10 - Writing the countVec file 
Writing the countVec file

In [20]:

#open the count vector file and write the resume_term_freq in the specified format
countVec_file=open('./output/resume_countVec.txt','w')
for key,tfd in resume_term_freq.items():
    countVec_file.write("%s" % 'resume_('+key+')')
    for term, frequency in tfd.items():
        countVec_file.write(", %s:%s" % (vocab_dict[term], frequency))
    countVec_file.write("\n")
countVec_file.close()


## Task 2 Summary

This task measured the understanding of extracting meaningful information from a large number of files using Python and NLTK. The main outcomes acheived were:

- **Pre-processing the data** - The data should be cleaned before processing just to extract meaningful information much effectively.
- **Generating Bigrams and MWETokenizer** - Generating bigrams based on the corpus and using multi word tokenizer to tokenise them.
- **Filtering certain type of words** - Filter of certain type of words which are not useful for analysis such as stop words, common tokens, rare tokens and short tokens,
- **Calculating term frequency and document frequency** - The calculation of term frequency and document frequency can be further used to provide recommendations.
- **Stemming the words** - Rooting the word is important as the variations in the language can make two contextually similar words appear different although their essence is same

## Task 3: Ranking Resumes with respect to job advertisements
The task comprises of recommending top 10 resumes that fit best for the first 500 job advertisements in task 1 with respect to their required qualifications.

The steps to achieve the task are as follows:
- From the vocab.txt and countVec.txt, read the resume ids and their content, and convert it to a dataframe of columns ['ID','Corpus']
- From the job_postings_extracted.json.json, read the first 500 job postings which have the 'required qualifications' field in them, similarly convert that into a dataframe consisting of job ids and their content as ['ID','Corpus']
- After creating the job and resume dataframes, adding each job to a copy of the resume dataframe, and using that combined dataframe to find the tf-idf vector and correspondingly finding the vector cosine similarity of the job with respect to all the resumes and using it to determine the top 10 resumes having the highest similarity in the cosine value.
- Writing the recommended resumes for a job to the file.




## Step 1 - Importing libraries

In [21]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from collections import defaultdict,Counter
import math
import string
from itertools import chain
from nltk.collocations import *
import itertools
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import json


## Step 2 - Read Files

- Reading stopwords_en.txt file to extract the stopwords
- Reading the resume_vocab.txt file to extract all the resume vocab from task 2
- Reading the resume_countVec.txt file to extract the related resume and the word index of the vocab from task 2
- Reading the job_postings_extracted.json file to extract all the job postings from task 1

In [34]:
#initialising the tokeniser for words as specified in task2
word_tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

# reading the stopwords file
file = open('./input/stopwords_en.txt','r')
stopwords_list=file.read().split()
file.close()

# reading the vocab file
read_vocab_file=open("./output/resume_vocab.txt",'r')
vocab_data=read_vocab_file.read().split('\n')
read_vocab_file.close()


# reading the countVec file
read_countVec_file=open("./output/resume_countVec.txt",'r')
countVec_data=read_countVec_file.read().split('\n')
read_countVec_file.close()

# reading the job posting output json file
read_json=open("./output/job_postings_extracted.json",'r')
jobs_json=json.loads(read_json.read())
read_json.close()

## Step 3 - Creating the resume dataframe

- Iterate the vocab file data to make a vocab dictionary
- Iterate the countVec file data to make a dictionary of the form {resumeid:resumecontent} using the vocab dictionary
- Use the dictionary from the above step to create a resume dataframe of the columns ['ID','Corpus']

In [30]:
# a placeholder for all the vocab and their indices
vocab_dict={}
for v in vocab_data[:len(vocab_data)-1]:
    vocab_dict[v.split(':')[1]]=v.split(':')[0]

# a dictionary for all the resume and its content in key value pairs
resume_dict={}
for res in countVec_data[:len(countVec_data)-1]:
    resume_string=''
    #splitting the content on comma
    count_list=res.split(',')
    #extracting the resume id from the data using group
    key=re.search('\d+',count_list[0]).group(0)
    #iterating each word index and using the vocab dict to find the actual word
    for c in count_list[1:]:
        word_index=c.split(':')[0]
        word_count=int(c.split(':')[1])
        #performing the lookup by using the word index
        word_text=vocab_dict[word_index.strip()]
        # using the frequency of the word to form the string of words for each resume
        for i in range(0,word_count):
            resume_string+=' '+ word_text
    #setting the content value for a resume key
    resume_dict[key]=resume_string.strip()

#Transforming the above formed dictionary to a dataframe
resume_df=pd.DataFrame.from_dict(resume_dict, orient='index').reset_index()
#Renaming the columns of the dataframe
resume_df.rename(columns={'index': 'ID', 0: 'Corpus'}, inplace=True)

## Step 4 - Creating the job dataframe

- Extracting the first 500 job postings which have the neccessary field for recommendation, i.e. the required qualification field
- Tokenise the job content with the same regex used in task 2, find bigrams, remove short words and stop words, stem the tokens, thus making the tokens similar to that of the resumes
- Create a job dataframe['ID':'Corpus'] from the dictionary {jobid:jobcontent}

In [36]:
# bigram function to generate the bigrams for the job postings similar to that of the resumes in task2
def generate_200_bigrams(input_dict):
    # making one list of tokens from the multiple token list for each key
    all_resume_tokenised_words = list(chain.from_iterable(input_dict.values()))
    # Crating an object for the BigramAssocMeasures 
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    # used to find the bigram collocations from a list of words
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_resume_tokenised_words)
    # rejects bigrams which have less than 10
    bigram_finder.apply_freq_filter(10)
    # filter those words which are of lenth less than 3 or is a stopword
    bigram_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stopwords_list)
    #finds the top n bigrams using the Pointwise Mutual Information association measure
    top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200)
    #returns the top 200 bigrams
    return top_200_bigrams
  
# using the json input to extract out all the job postings
list_of_jobs=jobs_json['listings']['listing']
job_data={}
ctr=0
# iterating each job posting to check whether it has the header needed for performing recommendation
for j in list_of_jobs:
    if j['required_qualifications']!='N/A':
        job_data[j['_id']]=j['required_qualifications']['qualification']
        ctr+=1
    if ctr==500:#till we find 500 job postings
        break
    
#Performing the same operations on the job posting as done in task 2 for resumes

#tokenise the job content
job_tokenised={}
for k in job_data.keys():
    job_tokenised[k]=list(chain.from_iterable([word_tokenizer.tokenize(x) for x in job_data[k]]))

#find the bigrams for the job data
job_top_200_bigrams=generate_200_bigrams(job_tokenised)
#use the mwetokenizer to avoid the split words of bigrams to be in the corpus
job_mwe_tokenizer = MWETokenizer(job_top_200_bigrams)
job_c={}
for key in job_tokenised.keys():
    job_c[key]=job_mwe_tokenizer.tokenize(job_tokenised[key])

#Removing the stopwords and short words
job_stopped={}
for key in job_c.keys():
    job_stopped[key]=[x for x in job_c[key] if len(x)>=3 if x.lower() not in stopwords_list]

# Stemming the stopped tokens to be similar to the ones after task2 for resumes    
stemmer=PorterStemmer()
job_stemmed={}
for key in job_stopped.keys():
    job_stemmed[key]=[stemmer.stem(x) for x in job_stopped[key]]

#Combining the set of tokens into a single string for each job     
jids=[]
job_words = []
job_term_freq={}
for key in job_stemmed.keys():
    jids.append(key)
    job_words.append(' '.join(job_stemmed[key]))
    job_term_freq[key]=dict(Counter(job_stemmed[key]))

# Making the job ids and their content into the dataframe similar to that of the previous step for the resume
job_df=pd.DataFrame.from_dict(dict(zip(jids,job_words)), orient='index').reset_index()
job_df.rename(columns={'index': 'ID', 0: 'Corpus'}, inplace=True)


## Step 5 - Finding the best fit resumes related to each job

- Iterating each job in job dataframe
  - make a tf-idf vector after combining that job with resume corpus
  - find the cosine similarity of that job with all other resumes
  - extract the most similar ones for each job and write to a dictionary of the form {jobid:[resumeid1,resumeid2,...]}
  

In [37]:
# creating a placehoder for the recommended resumes for each job
rec_dict={}
# initialising the tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(input = 'content', analyzer = 'word')
#Iterating each job in the job dataframe
for index, row in job_df.iterrows():
    
    #combining the single job record with the resumes 
    job_resume_df=resume_df
    job_resume_df.loc[-1]=row
    job_resume_df.index+=1
    job_resume_df.sort_index(inplace=True)
    
    #creating tf-idf vectors for the combined corpus
    tfidf_vectors = tfidf_vectorizer.fit_transform(job_resume_df['Corpus'])
    #finding the cosine similarities for the job with respect to the other resumes
    cosine_similarities = cosine_similarity(tfidf_vectors[0:1],tfidf_vectors)
    # sorting the smilarities and extracting the indices which are having more similarity
    res_indices=cosine_similarities[0].argsort()[:-12:-1]
    # creating a dictionary of job id as key and the list of top ten recommended resumes for each job
    rec_dict[row['ID']]=[job_resume_df['ID'][i] for i in res_indices][1:]
    #dropping the current job record for further iterations
    job_resume_df.drop(job_resume_df.index[0],inplace=True)
    job_resume_df.reset_index(drop=True,inplace=True)


## Step 5 - Writing the bonus output file

- Using the dictionary from the above step, to write the bonus output file in the same format as mentioned in the specifications

In [38]:
# writing the recommended resumes for each job to the file
bonus_file=open('./output/ranked_resumes.txt','w')
for key in rec_dict.keys():
    bonus_file.write(key+':'+','.join(rec_dict[key])+'\n')
bonus_file.close()

## Task 3 Summary

This task of the assessment measured the understaning of using a recommender system to recommend top resumes for a job. The main outcomes achieved were:

- **Generating Tf-Idf vectors** - To find the similarity the corpus has to be generated using the tf-idf vectorizer, and the required profiling to be done of the resumes. 
- **Cosine Similarity** - This cosine similarity helps us to find the similarity between the vectors based on the cosine of the angle formed between the two.

# References

- Python Software Foundation. (2018). *Python Documentation*. Retrieved from https://docs.python.org/3.6/
- Regex101. (2018). *Quick Reference*. Retrieved from https://regex101.com/
- Python Software Foundation. (2018). *Python Documentation*. Retrieved from https://docs.python.org/3.6/
- Regex101. (2018). *Quick Reference*. Retrieved from https://regex101.com/
- NLTK Project. (2018). *Collocations*. Retrieved from http://www.nltk.org/howto/collocations.html
- NLTK 3.3 documentation. (2018). *Association Measures*. Retrieved from http://www.nltk.org/_modules/nltk/metrics/association.html
- The `pandas` Project. (2018). *pandas 0.23.4 documentation: pandas.DataFrame*. Retrived from https://pandas.pydata.org/pandas-docs/stable/
- Python Software Foundation. (2018). *Python Documentation*. Retrieved from https://docs.python.org/3.6/
- Wikipedia. (2018). *Cosine similarity*. Retrieved from https://en.wikipedia.org/wiki/Cosine_similarity
- Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. (2018). *sklearn.metrics.pairwise.cosine_similarity*. Retrieved from http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
- Perone, Christian. (2013, September 12). *Machine Learning :: Cosine Similarity for Vector Space Models (Part III)* Retrieved from http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
- Raman Venkat. (2017, October 23). *How To Build a Simple Content Based Book Recommender System* https://www.linkedin.com/pulse/content-based-recommender-engine-under-hood-venkat-raman/