# Recommender System: Ranking resumes with respect to the given job advertisements
#### Name: Arunava Munshi
#### Date:03-Sep-2018
#### Environment: Python 3 and Jupyter notebook
#### Libraries used: 
* String (for String Processng)
* nltk - natural language toolkit (tokenizer, lemmatizer, stopwords, collocations and probabilities)
* re (for regular expression, included in Anaconda Python 2.7) 
* itertools (for iterations)
* collections (for Collocations)

# Introduction
The purpose of this project is to demonstrate how wrangled data from different sources can be used to help the organizations to make informed decisions. In this project, the system recommends the top 10 resumes that are the best fit for the first 500 job advertisements in 'raw_data.dat' their “required qualifications” section. The resume files are given in 'resume_dataset.txt', from which resumes are picked and recommended for suitable job profiles.

Output files: Recommender_System.ipynb and Recommended_Resume.txt which contains the recommended resumes for the first 500 job advertisements. The txt file contains 500 lines and each line of the txt file has the following format: Job_advertisment_id: first_ranked_resume_id, second_ranked_resume_id, …., tenth_ranked_resume_id

# Step 1. Importing Packages
All the required packages are imported before being started.

In [1]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk.data
import string
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup as bsoup
import re
import os
import nltk
from nltk.collocations import *
from itertools import chain
import itertools
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from itertools import islice

# Step 2: Creating Separate Lists of Job Ids and Job Responsibilities after Stopword Removal and Sremming
The following steps need to be performed in order to do the above 
## Reading the given raw_data.dat file and Taking 1st 1000 Records
The below code reads the content of **'raw_data.dat'** file. Because the file is too big, a partial reading would be the best for the performance improvement. In the given file, it can be seen that the data regarding the job postings are given and each and every job posting is separated by a '------------------------------'. So this string became separator between two jobs and it is used in as a separator in reading in the below code. the variable **'readfile'** contains a list of individual job postings. There are also some exception handling done for the file processing. If the file is not found, then the FileNotFoundError is thrown, when there is permission error, PermissionError exception is raised etc.

In [2]:
try:                    #Reading file raw_data.dat and splitting on the below eparator
    readfile = open('raw_data.dat').read().split('------------------------------')
    readfile.remove(readfile[len(readfile) - 1]) #Removing the last item from the list as it is null
    all_job_postings_list = readfile[:1000] #Taking out 1st 1000 Occurrences
except FileNotFoundError:
    raise
except PermissionError:
    raise
except OSError:
    raise
except:
    raise

## Reading the Stop Word File: stopwords_en.txt
This file is read with a split on '\n' and the stopwords are stored into a list named **'stop_word_list'** which is kept for later use. 

In [3]:
stop_word_file = open('stopwords_en.txt', 'r') #Opening and read each stopwords_en.txt file in read mode
stop_word_list = stop_word_file.read().split('\n') #Splitting items on '\n'
stop_word_file.close() #Closing the file

## Taking Out 1st 500 Job Responsibilities
The below portion of code extracts the 1st 500 Job reponsibilities and put them into the dictionary all_job_postings_dict.

In [4]:
current_counter = 1
maximum_items_no = len(all_job_postings_list)  #Length of readfile
all_job_postings_dict = {}
for items in all_job_postings_list:
    
    temp_item = '' 
    if current_counter <= 500:  #Cheking in current counter is <= Maximum items in the list or not
        #The below Regular Expressions find the specific patterns and substitute them with a constant one
        replaced_str = re.sub('(ID:)', '^_id^', items)
        replaced_str = re.sub('(DATE_START:|DATES:|start_date:|START DATE:|START_DA:)', '^start_date^', replaced_str)
        replaced_str = re.sub('(job_desc:|_description:|JOB_DESC:|DESCRIPTION:|JOB DESCRIPTION:)', '^job_descriptions^', replaced_str)
        replaced_str = re.sub('(RESPONSIBILITY:|JOB RESPONSIBILITIES:|responsibilities:|JOB_RESPS:|RESP:)', '^job_responsibilities^', replaced_str)
        replaced_str = re.sub('(title:|JOB TITLE:|JOB_T:|TITLES:|_TTL:)', '^title^', replaced_str)
        replaced_str = re.sub('(ABOUT COMPANY:|about_company:|COMPANYS_INFO:|ABOUT:|_info:)', '^about_company^', replaced_str)
        replaced_str = re.sub('(QUALIFS:|QUALIFICATION:|REQUIRED QUALIFICATIONS:|REQ_QUALS:|qualifications:)', '^required_qualifications^', replaced_str)
        replaced_str = re.sub('(APPLICATION_DL:|APPLICATION_DEADL:|DEAD_LINE:|DEADLINES:|deadline:)', '^application_deadline^', replaced_str)
        replaced_str = re.sub('(JOB_PROCS:|PROCEDURES:|JOB_PROC:|PROCEDURE:|procedures:)', '^application_procedure^', replaced_str)
        replaced_str = re.sub('(LOCATION:|JOB_LOC:|LOCATIONS:|_LOC:|_LOCS:)', '^location^', replaced_str)
        replaced_str = re.sub('(JOB_SAL:|REMUNERATION:|SALARY:|remuneration:|salary:)', '^salary^', replaced_str)
        #Avoiding garbage records by substituting them with null character
        replaced_str = re.sub('(REMUNERATION\/|START DATE\/|ABOUT PROGRAM\/|OPEN TO\/)', '', replaced_str)
        
        replaced_list = replaced_str.split('^')  #Splitting the substituted string on '^'
        replaced_dict = {}
        punctuation = '!"#$%&\'*+,-./:;<=>?@\\^_`|~ \n' #Storing the list of punctuation
        temp_qualification = ''
        temp_id = ''
        if 'required_qualifications' in replaced_list:
            for items in replaced_list:        #Iterating each items in replaced_list
                if items == '_id':
                    item_index = replaced_list.index(items)
                    temp_id = replaced_list[item_index + 1].strip(punctuation).replace('\n',' ')
                                    #Populating dictionary items after replacing '\n' with ' and python striping
                if items == 'required_qualifications':
                    item_index = replaced_list.index(items)
                    temp_qualification = replaced_list[item_index + 1].strip(punctuation).replace('\n',' ')
                                    #Populating dictionary items after replacing '\n' with ' and python striping 

                all_job_postings_dict[temp_id] = temp_qualification

            current_counter +=1
all_job_postings_dict.pop('', None)

''

## Removing the Stopwords and the Tokens with length lss than 3
The Below code removes the Stopwords and the Tokens with length less than 3

In [5]:
stopwords_list = stopwords.words('english') #Get the English Stopwords
for keys, values in all_job_postings_dict.items(): #Traverse for all job responsibilities
    tokenizer = RegexpTokenizer(r"\w+(?:[-.]\w+)?") #Word tokenize using Regular Expression
    unigram_tokens = tokenizer.tokenize(values) #Creating Unigram tokens
    #Joining back the tokens into strings after stop word and token less than length 3 removal
    all_job_postings_dict[keys] = ' '.join(set([ items for items in unigram_tokens if items.lower() not in stopwords_list or len(items) >= 3]))

## Creating Tokenized Dictionary and All Vocab List
In this step two different objects are formed:
> - all_all_job_postings_dict_tokenized dictinary that consists of each document id as key and list of unigram tokens for each document as value. This dictionary is created by invoking a method named tokenizePatent() which actually generates unigram word tokens from the string of each job responsibility content.
> - From all_job_postings_dict_tokenized dictionary, the list of all vocabularies is created by chain.from_iterable() method which takes the list of lists as input and returns a single list of all items across this list of lists.

In [6]:
def tokenizePatent(pid):
    """
        the tokenization function is used to tokenize each patent.
        The one argument is patent_id.
        First, normalize the case.
        Then, use the regular expression tokenizer to tokenize the patent with the specified id
    """
    each_dict_item = all_job_postings_dict[pid]
    each_dict_item_tokenized = tokenizer.tokenize(each_dict_item)
    return (pid, each_dict_item_tokenized) # return a tupel of patent_id and a list of tokens

#Getting the tokenized version of the dictionary where each word token is separated
all_job_postings_dict_tokenized = dict(tokenizePatent(pid) for pid in all_job_postings_dict.keys())
#Getting the whole word lists
all_job_postings_words = list(chain.from_iterable(all_job_postings_dict_tokenized.values()))

## Generating Top 200 Bigrams
A bigram is an asociation of two meaningful words. The below code creates top 200 meaningful bigrams in form of a list of tuples with the help of the full vacab list all_job_postings_words, created in the previous step. This program also set frequency filter at the level of 10 and also filters out those bigrams with length less than 3.

In [7]:
bigram_measures = nltk.collocations.BigramAssocMeasures() #Creating bigram_measures object
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_job_postings_words)
                                                #Creating bigram_finder object using all word list obtained from previous step
bigram_finder.apply_freq_filter(10) #Filter out the bigrams which occur less than 10 times in the enntire vocabulary
bigram_finder.apply_word_filter(lambda w: len(w) < 3)#Ignoring bigrams less than length 3
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # making list of top 200 bigrams
top_200_bigrams

[('SQL', 'Experience'),
 ('fields', 'University'),
 ('preferably', 'Economics'),
 ('team', 'within'),
 ('information', 'Ability'),
 ('Office', 'Analytical'),
 ('Office', 'Proficiency'),
 ('equivalent', 'degree'),
 ('service', 'Excellent'),
 ('Higher', 'Armenian'),
 ('professional', 'least'),
 ('written', 'both'),
 ('Excellent', 'desirable'),
 ('Java', 'knowledge'),
 ('Ability', 'independently'),
 ('data', 'experience'),
 ('the', 'public'),
 ('both', 'communication'),
 ('related', 'field'),
 ('experience', 'International'),
 ('NET', 'and'),
 ('working', 'Good'),
 ('solving', 'work'),
 ('years', 'Proven'),
 ('experience', 'Finance'),
 ('Bachelor', 'degree'),
 ('international', 'with'),
 ('years', 'Bank'),
 ('work', 'office'),
 ('languages', 'financial'),
 ('Ability', 'attention'),
 ('management', 'working'),
 ('personality', 'knowledge'),
 ('Team', 'Ability'),
 ('Good', 'education'),
 ('plus', 'years'),
 ('preferable', 'Knowledge'),
 ('thinking', 'Good'),
 ('and', 'deadlines'),
 ('Excel'

## Generating Tokenized Dictionary and All Vocab List with Bigrams
In this step the function MWETokenizer() merges each the top_200_bigrams into unigrams with a separator '\_'. The  mwetokenizer is an iterative object which is further used to create a dictionary named all_job_postings_dict_tokenized_with_bigram that consists of all unigram version of bigram tokens with the separate words of each bigram being removed. This is done with the help of mwetokenizer.tokenize() function.

In [8]:
mwetokenizer = MWETokenizer(top_200_bigrams) #Merging bigrams with '_' through multiword tokenizer
all_job_postings_dict_tokenized_with_bigram =  dict((pid, mwetokenizer.tokenize(patent)) for pid,patent in all_job_postings_dict_tokenized.items())
                                    #Inclusion of biagrams into the tokenized dictionary, ingnoring their individual occurrences

## Stemming on Word Tokens at each Resume Level
This step finally does the stemming on all_job_postings_dict_tokenized_with_bigram. After this step, all_job_postings_dict_tokenized_with_bigram will consist of document id as the key and stemmed tokens of each job responsibility content. The input for stemmer program is each word from the list(dictionary value) and the output dictionary from this step is **all_job_postings_dict_tokenized_with_bigram**. 

In [9]:
stemmer = PorterStemmer()  #Creating stemmer object
final_tokens =[]
### Taking each list of word tokens with unigram and bigrams for individual documents
for keys, values in all_job_postings_dict_tokenized_with_bigram.items():
    final_tokens = final_tokens + [stemmer.stem(w) for w in values ] #Stemming the token list
    all_job_postings_dict_tokenized_with_bigram[keys] = final_tokens #Replacing the older token list with updated one
    final_tokens = []

## Creating Separate Lists of Job Ids and Job Responsibilities
In order to do this, first, two separate lists each for job id and job content is initialized. Then the all_job_postings_dict_tokenized_with_bigram is traversed to populate the job ids into id list and job responsibilities in job content list. Join() method is used to convert the list of string into a string of pace separator.


In [10]:
job_id_list = []    #Initializing two lists
job_content_list = []
#Iterating for each tokenized list after garbage tokens ramoval
for pid, tokens in all_job_postings_dict_tokenized_with_bigram.items():
    job_id_list.append(pid)  #Appending job ids to job is list
    text = ' '.join(tokens)   #Joining back tokens into strings
    job_content_list.append(text)  #Appending job responsibilities into job content list
    text = ''

# Step 3: Creating Separate Lists of Resume Ids and Resume Details after Stopword Removal, Context Dependent and Rare Token Roval and Stemming
The following steps need to be performed in order to do the above 
## Reading Input Files:
### Reading Resume numbers from 'resume_dataset.txt'
In this step, the 'resume_dataset.txt', provided in the datasets, is searched for 29453232 to get the following list of resume number. Below is the format of the document:

29453232:\[0 12 23 ------]

So in order to extract the document ids, first the correponding number(29453232 in this case) is found in the document list and then the following string enclosed by a '[]' is extracted through the regular expression: ('29453232:\[(.*?)\]') Once the string is found, all the '\n' are replaced with a ' ' and the replaced string is split on ' ' to get the document numbers. The whole list is converted into a set and then again into a list in order to remove the duplicates.

In [11]:
readfile = open('resume_dataset.txt').read() #Reading the file
readfile_29453232_data = re.findall('29453232:\[(.*?)\]',readfile,re.S)[0] #Finding the resume numbers for '29453232'
readfile_29453232_data  = readfile_29453232_data.replace('\n', ' ') #Replacing all '\n' with ' '
readfile_29453232_list = readfile_29453232_data.split(' ') #Splitting the whole string into the list of resume numbers
readfile_29453232_list = list(set([ items for items in readfile_29453232_list if items != '' ]))
                        #Removing the null items, removing duplicate resume numbers and finally putting them into final list

### Reading the Contents of the Resumes
After the resume numbers are extracted, the proper file names are built with the following format:
**resume(123).txt**
For this, the resume number is sandwiched inside the string **'resume()'**. When resume id is found zero, then 1 is passed within the string. Once the resume ids are generated, resumes are read and entered into into the list **'all_resume_list'**.

In [12]:
all_resume_list = []
#Building resume names for each resume numbers obtained from previous step
for items in readfile_29453232_list:  
    if items == '0':  #If resume number is '0' then the resume name will be resume_(1).txt
        build_resume_str = 'resume_(' + '1' + ').txt'
    else: #For others the resume name will be resume_(<Resume No>).txt
        build_resume_str = 'resume_(' + items + ').txt'
    read_resume = open(build_resume_str,'r',encoding='UTF-8') #Opening and read each resume file in read mode
    all_resume_list.append((items, read_resume.read()))
    read_resume.close() #Closing resume file

## Sentence Tokenizing, Word Tokenizing, Stop Word Removal and Token of Length Less Than Three Removal
In this step many accomplishments such as Sentence Tokenization, Word Tokenization, Stop Word Removal and removal of Token length less than three take place. 
Firstly, the list **all_resume_list** is iterated. For each resume in that list the following activities are performed:
> - Each resume is tokenized into sentences through sentence tokenizer.
> - Now for each sentence in the sentence list the below operations take place:
    > - Each sentence is further tokenized into word tokens with the use of regular expression r"\w+(?:[-.]\w+)?"
    > - The first token of each sentence is lowercased
    > - The tokens with length less than three are removed
    > - The word tokens are joined back to a complete sentence
    > - Each parsed sentence are again joined back to the resume content
> - The Document Number and Parsed Resume ontent are placed into a dictionary name 'all_resume_dict'

In [13]:
resume_profile = []
#Iterating each resume from the resume list
for each_resume_tuple in all_resume_list:
    item_index = all_resume_list.index(each_resume_tuple) #Taking down the index number
    sentences = nltk.data.load('tokenizers/punkt/english.pickle') #Sentence Tokenizing
    sentences_list = sentences.tokenize(each_resume_tuple[1].strip()) #Stripping spaces from the ends
    each_sentence_wo_sw_lc = ''
    all_sentence_wo_sw_lc = ''
    #Iterating each sentence from the sentence list
    for each_sentence in sentences_list:
        tokenizer = RegexpTokenizer(r"\w+(?:[-.]\w+)?")  #Word Tokenization is done on this Regular Expression
        unigram_tokens = tokenizer.tokenize(each_sentence) #Unigram tokens are obtained
        if len(unigram_tokens) > 0:  #Ignoring the null sentences
            unigram_tokens[0] = unigram_tokens[0].lower()  #Lowercasing the 1st word of each sentence
            
            #Removing the stopwords
            unique_unigram_tokens_wo_sw = [ items for items in unigram_tokens if items.lower() not in stop_word_list ]
            #Removing tokens of length less than 3
            unique_unigram_tokens_wo_sw = [ items for items in unique_unigram_tokens_wo_sw if len(items) >= 3 ]
            #Joining back the word tokens into complete sentences
            each_sentence_wo_sw_lc = ' '.join(unique_unigram_tokens_wo_sw)
            all_sentence_wo_sw_lc += each_sentence_wo_sw_lc + ' '  #Joining back all the sentences
            
    all_resume_list[item_index] = (each_resume_tuple[0], all_sentence_wo_sw_lc) 
                            #Putting resume number and resume contents into each tuple
all_resume_dict = dict(all_resume_list)  #Converting the tuple into dictionary

## Creating Tokenized Dictionary and All Vocab List
In this step two different objects are formed:
> - all_resume_dict_tokenized dictinary that consists of each document id as key and list of unigram tokens for each document as value. This dictionary is created by invoking a method named tokenizePatent() which actually generates unigram word tokens from the string of each resume content.
> - From all_resume_dict_tokenized dictionary, the list of all vocabularies is created by chain.from_iterable() method which takes the list of lists as input and returns a single list of all items across this list of lists.

In [14]:
def tokenizePatent(pid):
    """
        the tokenization function is used to tokenize each patent.
        The one argument is patent_id.
        First, normalize the case.
        Then, use the regular expression tokenizer to tokenize the patent with the specified id
    """
    each_dict_item = all_resume_dict[pid]
    each_dict_item_tokenized = tokenizer.tokenize(each_dict_item)
    return (pid, each_dict_item_tokenized) # return a tupel of patent_id and a list of tokens

#Getting the tokenized version of the dictionary where each word token is separated
all_resume_dict_tokenized = dict(tokenizePatent(pid) for pid in all_resume_dict.keys())
#Getting the whole word lists
all_resume_words = list(chain.from_iterable(all_resume_dict_tokenized.values()))

## Generating Top 200 Bigrams
A bigram is an asociation of two meaningful words. The below code creates top 200 meaningful bigrams in form of a list of tuples with the help of the full vacab list all_resume_words, created in the previous step. This program also set frequency filter at the level of 10 and also filters out those bigrams with length less than 3.

In [15]:
bigram_measures = nltk.collocations.BigramAssocMeasures() #Creating bigram_measures object
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_resume_words)
                                    #Creating bigram_finder object using all word list obtained from previous step
bigram_finder.apply_freq_filter(10) #Filter out the bigrams which occur less than 10 times in the enntire vocabulary
bigram_finder.apply_word_filter(lambda w: len(w) < 3) #Ignoring bigrams less than length 3
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # making list of top 200 bigrams


## Generating Tokenized Dictionary and All Vocab List with Bigrams
In this step the function MWETokenizer() merges each the top_200_bigrams into unigrams with a separator '\_'. The  mwetokenizer is an iterative object which is further used to create a dictionary named all_resume_dict_with_bigram that consists of all unigram version of bigram tokens with the separate words of each bigram being removed. This is done with the help of mwetokenizer.tokenize() function. Finally, the all_resume_vocabs list is created which also consists of the entire resume vocabulary with unigram version of bigrams included and separate words for each bigram removed.

In [16]:
mwetokenizer = MWETokenizer(top_200_bigrams)  #Merging bigrams with '_' through multiword tokenizer
all_resume_dict_with_bigram =  dict((pid, mwetokenizer.tokenize(patent)) for pid,patent in all_resume_dict_tokenized.items())
                    #Inclusion of biagrams into the tokenized dictionary, ingnoring their individual occurrences
all_resume_vocabs = list(chain.from_iterable(all_resume_dict_with_bigram.values()))
                    #Inclusion of biagrams, ingnoring their individual occurrences
all_resume_vocabs = list(set(all_resume_vocabs)) #Removing duplicate words

## Identification of Context Dependent Tokens and Rare Tokens
The word tokens that appear in 98% or more documents are called Context Dependent Tokens, whereas word tokens that appear in 2% or less documents are called Rare Tokens. Removal of these tokens is necessary in any text procesing technique. The below code identifies words of these two categories and place them into appropriate lists. In order to identify such words a dictionary of words with corresponding frequencies is created. Now each word of this dictionary(in thi case, all_words_document_freq) is travarsed, and depending on its frequency value in respect to 98% or 2%, it is placed either in context_dependent_token_list or in rare_token_list. Now these two lists are merged to get the complete list of the tokens to be removed.

In [17]:
context_dependent_token_list = []
rare_token_list = []

all_words_document_freq = {}
#Iterating the dictionary with unigram and bigram eord tokens
for keys,lists in all_resume_dict_with_bigram.items():
    for items in set(lists): #Taking each unique unigrams and bigrams
        if all_words_document_freq.get(items) == None: #If the word is not yet discovered then making its freq 1
            all_words_document_freq[items] = 1
        else:                   #If the word is discovered then increasing its freq by 1
            all_words_document_freq[items] += 1

for keys, values in all_words_document_freq.items(): #Traversing this dictionary containing all vocabs and their document frequencies
    all_words_document_freq[keys] = (values/len(all_resume_dict_with_bigram)) * 100
                                    #Calculating the document frequency of each token
    if all_words_document_freq[keys] >= 98: #If the document frequency is more than 98%, placing it into context dependent list
        context_dependent_token_list.append(keys)
    elif all_words_document_freq[keys] <= 2: #If the document frequency is less than 2%, placing it into rare token list
        rare_token_list.append(keys)
#Mergng context_dependent_token_list and rare_token_list into one list total_list
if len(context_dependent_token_list) > 0:
    total_list = context_dependent_token_list + rare_token_list
else:
    total_list = rare_token_list

## Removing Context Dependent and Rare Tokens from each Resume Level
This step removes the Context Dependent and Rare Tokens from the all_resume_dict_with_bigram dictionary, the dictionary with key as Resume Number and value as list of Unigrams and Bigrams in each Resume, making it ready for Stemming at each resume level.

In [18]:
#Removing the context dependent and raretokens from the list of word tokens for each resume
for key, value in all_resume_dict_with_bigram.items():
    all_resume_dict_with_bigram[key] = [ items for items in value if items not in total_list]

## Stemming on Word Tokens at each Resume Level
This step finally does the stemming on all_resume_dict_with_bigram. After this step, all_resume_dict_with_bigram will consist of document id as the key and stemmed tokens of each resume content. The input for stemmer program is each word from the list(dictionary value) and the output dictionary from this step is **all_resume_dict_with_bigram**. 

In [19]:
stemmer = PorterStemmer()  #Creating stemmer object
final_tokens =[]
### Taking each list of word tokens with unigram and bigrams for individual documents
for keys, values in all_resume_dict_with_bigram.items():
    final_tokens = final_tokens + [stemmer.stem(w) for w in values ] #Stemming the token list
    all_resume_dict_with_bigram[keys] = final_tokens #Replacing the older token list with updated one
    final_tokens = []

## Creating Separate Lists of Resume Ids and Resume Details
In order to do this, first, two separate lists each for resume id and resume details is initialized. Then the all_resume_dict_with_bigram is traversed to populate the resume ids into id list and resume details in job content list. Join() method is used to convert the list of string into a string of pace separator.

In [20]:
resume_id_list = []  #Initializing two lists
reume_content_list = []
#Iterating for each tokenized list after garbage tokens ramoval
for pid, tokens in all_resume_dict_with_bigram.items(): 
    resume_id_list.append(pid) #Appending job ids to job is list
    text = ' '.join(tokens)    #Joining back tokens into strings
    reume_content_list.append(text)  #Appending job responsibilities into job content list
    text = ''

# Step 4: Building the Recommender System and Place the Whole Recommended Content Output File 
In the below activity the, a final output file is generated with top 10 resumes against each job postings. Here, fistly job_content_list, generated in previous step is 1st traversed. For each job content, the following tasks are done: 
> - A temporary list temp_reume_content_list is created and whole resume content is copied
> - The Job Content is then appended at first of this temp_reume_content_list to compare all resumes with that particular job content.
> - The tfidf_vectorizer vector object is created and all resumes are matched with the job content through 
cosine_similarity() function
> - The cosine similarity of each resume and the resume numbers are combines and sorted on the similarity values.
> - The top ten resume numbers are selected depending on the higher similarity values
> - The top 10 resumes are placed in the output file in the given format.

In [21]:
reume_content_list
count = 1
job_recommend_dict = {}  #Defining recommender dictionary
for items in job_content_list:   #Iterating on each job content
    index = job_content_list.index(items)  #Taking down the list index
    temp_reume_content_list = []
    
    temp_reume_content_list = [item for item in reume_content_list]  #Copying list data into temporary list
    temp_reume_content_list.insert(0, items) #Insert the job details into the 1st position of the list
    tfidf_vectorizer = TfidfVectorizer() #Creating tfidf_vectorizer object
    tfidf_matrix = tfidf_vectorizer.fit_transform(temp_reume_content_list) #Creating tfidf_matrix
    cosine_similarity_list = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).tolist()
                    #Generating the Cosine Similarity Matrix
    cosine_similarity_list = cosine_similarity_list[0] #Converting into 1st level of list
    del(cosine_similarity_list[0])  #Delete the job content
    resume_recommend_zip = zip(cosine_similarity_list, resume_id_list) #Creating the list of tuple on Cosine similatity
    resume_recommend_list = list(resume_recommend_zip)
    resume_recommend_list.sort(key=lambda tup: tup[0], reverse=True) #Sorting the list on reverse
    resume_recommend_dict = dict(resume_recommend_list) #Creating the recommendation dictionary
    resume_recommend_list_of_10 = list(islice(resume_recommend_dict.items(), 10)) #Giving Top 10 recommendation
    resume_recommend_list_of_10 = [items[1] for items in resume_recommend_list_of_10]
    job_recommend_dict[job_id_list[index]] = resume_recommend_list_of_10 #Creating Final dictionary

#Writing the output file Recommended_Resume.txt
bonus_29453232 = open('Recommended_Resume.txt', 'w')
for keys, values in job_recommend_dict.items():
    file_write_str = ''
    file_write_str = keys + ': '
    count = 1
    for items in values:
        if count != 10:
            file_write_str += items + ', '
        else:
            file_write_str += items + '\n'
        count += 1
    file_write_str = file_write_str.strip(',')
    bonus_29453232.write(file_write_str)