## Task 2: Text Preprocessing
Environment: Python 3.6.5 and Jupyter notebook


### 1. Introduction

In this task, we are given a dataset of around 250 CVs of students,from which we need to generate a sparse representation of the resumes by using the given regular expression for tokenization and vocablary generation. We also need to generate a sparse matrix to keep a track of all  the words in the vocabulary in each resume.

Steps followed for text preprocessing:

1. Loading the resumes and storing it in the form of a dictionary, with key as the resume number and its data as its value.      Normalisation of the first character in the string is also done by senetence tokenising the string.

2. After normalising I use the given regular expression and word tokenise each of the resumes, maintaining the dictionary    structure to avoid loosing the resume number of the resume.

3. After that, I remove context independent stopwords from the tokens of each resume. 

4. It is followed by finding the top 200 bigrams that occur in the text and create a collocation vocabulary, which contains a combination of the unigrams as the bigrams and storing it in a new dictionary with the help of MWE Tokenizer. Still the resume number is preserved in the key  of the dictionary.

5. Porter stemming is used to stem all the suffixes from the tokens and reduce the size of the vocabulary

6. Followed by porterstemmer(), context dependent stopwords, i.e. the words that occur in too many or too few documents are removed from the vocabulary

7. Words that are less than 3 letters are removed

8. A final vocabulary is formed and a sparse matrix is created that keeps a track of each word in the vocabulary to the number of times it occurs in the resume

### 2. Importing the required packages for task 2

In [71]:
import os
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem.porter import *
from nltk.collocations import *
from sklearn.feature_extraction.text import TfidfVectorizer
from itertools import chain
import itertools
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.stem import PorterStemmer
import pandas as pd
import re
from collections import Counter
from nltk.tokenize import sent_tokenize
from nltk.probability import *
nltk.download('punkt')
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import MWETokenizer
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sidha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3. Dictionary of resume numbers allocated with its data 

In this step we make a dictionary of all the resumes, with key as the resume numbers present in 'resume_dataset.txt' that are assigned to us, with the data in the resume as its value. We split the data in a resume by a '.' to normalize the first character of the string. After normalising the data we join the string again and assign it to its repective key in the dictionary.

In [2]:
#Filepath at which the resumes are present
file_path = 'C:/Users/sidha/Desktop/resumeTxt'

#Opening the given regualar expression 
fp = open('resume_dataset.txt') #opeing the file
resume_dataset = fp.read() #reading from the file
resume_dataset = resume_dataset.replace('\n',' ') #removing the line breaker 
resume_dataset = resume_dataset.replace(',',' ') #removing ',' 
resume_dataset = list(resume_dataset[1:-1].split(" ")) #forming a list by spliting the string at the whiteapce, after removing the square brackets at both ends

#A empty list to store all the resumes  
resume_dict={}

for each in resume_dataset: #looping through each file in the folder
    file_name = 'resume_('+each+').txt' #forming the file name
    xfile = os.path.join(file_path, file_name) #forming the file path to file name
    if os.path.isfile(xfile): #if the file exist
        file_pointer = open(xfile,encoding="utf8") #open the file to read
        data = file_pointer.read() #read the full content of the file to a variable
        if data: #if the file is not empty
            data = data.replace('\n','') #remove the line breakers to form a single line
            data_list=str.split(data,'. ') 
            normalized_data_list= list(map(lambda w: w[: w.find(' ')].lower() + w[w.find(' '):] if w else '', data_list)) #normalizing the data to lower case
            normalized_data = ' '.join(normalized_data_list)
            resume_dict[each] = normalized_data #saving the normalized data from each document to a dictionary with resume name as the key

### 4. Tokenizing words using the given regular expression

In this step, we tokenize each resume in the dictionary with the help of the given regular expression. Then we form a dictionary of all the resumes with tokenized data in its values 

In [3]:
#The given regular expression
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?") 


#To tokenize the data read from the reumes
def tokenize(resume):
    tokenized_resumes = tokenizer.tokenize(resume_dict[resume]) #tokenizing the string
    return (resume, tokenized_resumes) # return a tupel of patent_id and a list of tokens

#A dictionary with the resume numbers as the key and its value in the form of tokens
tokenized_resumes = dict(tokenize(resume) for resume in resume_dict.keys()) #calling the tokenize method in a loop for all the elements in the dictionary


### 5. Removing  tokens that are from  the given context independent stopwords from the vocab 

In this step, we remove all the tokens from the dictionary that are present in the given 'stopwords_en.txt' file. The purpose of this step is to reduce the vocab size as these stopwords do not reveal much information about the text and occur too often in the text, which makes other words in the vocabulary less significant. 


In [4]:
#An empty list to store all the given stopwords
stopwords=[]

#Opening the given stopwords file and storing the words in the stopwords list
with open('stopwords_en.txt') as f:
    stopwords = f.read().splitlines()    

#Looping to remove all the context independent words from the dictionary
for resume in tokenized_resumes.values():  
    for word in resume:  
        if word.lower() in stopwords:
            resume.remove(word)


###  6. Finding the top 200 Bigrams in the resumes

In this step we find out the top 200 bigrams present in the text. Bigrams is a sequence of words that make completely different meaning when they are considered individually rather than together.

Here, we use nltk.collocations.BigramAssocMeasures() to get the Bigram measure and nltk.collocations.BigramCollocationFinder.from_words() with nbest to find the top 200 bigrams usning liklihood_ratio


In [65]:
#Concatenating all the tokenized values using the chain.frome_iterable function to create a list of all the words 
total_tokens = list(chain.from_iterable(tokenized_resumes.values())) 

#Finding the top 200 bigrams
finder=BigramCollocationFinder.from_words(total_tokens)
bigrams=finder.nbest(BigramAssocMeasures.likelihood_ratio, 208)

#Eliminating numbers from bigrams
bigrams_list=[x for x in bigrams if not any(c.isdigit() for c in x)] 

#Preserving these bigrams and putting it back in the dictionary, along with the unigrams
mwetokenizer = MWETokenizer(bigrams_list)

#colloc_resumes is a dictionary that contains both the bigrams as well as the unigrams
colloc_resumes =  dict((resume, mwetokenizer.tokenize(data)) for resume,data in tokenized_resumes.items())

### 7. Porter stemmmer

In this step, we use the porter stemmer package to stem all the words by removing the suffixes from the tokens. This helps us in reducing the vocabulary size as the words with similar meaning are stemmed to a single word.


In [66]:
#Using the porterstemmer method
ps=PorterStemmer()
#An empty string to store the content of a particular resume
strcontent=''
#An empty dictionary to append the stemmed data back 
stemmed_dict=dict()

#Looping to stem each value in the dictionary
for key,resume in colloc_resumes.items():  
    for word in resume:  
        #Temporarily storing the data in an empty string
        strcontent=strcontent+ ' ' + ps.stem(word)
    
    #Assigning the string to the respective key
    stemmed_dict[key]=strcontent
    #Again emptying the string to store the next resume
    strcontent=''

#Loop to again word tokenize each resume in the dictionary and assigning it back to its resume number 
for key,resume in stemmed_dict.items():
    stemmed_dict[key]=word_tokenize(resume)
    
        

### 8.  Removing context independent words

Even after removing the context independent stopwords, there are some words that add little perspective to text preprocessing as they may be occuring in too many or too few documents to consider them in your vocabulary. Hence, in this step we remove words that occur only in 2% or over 98% of the total number of documents.

In [67]:
#An empty list to store the resume ids
resumes = []
#An empty list to store all the resume content in a singe list
resume_words = []
#An empty string to assign all the content
txt = ''
#Looping to append all the resume content in the empty sring
for resume, tokens in stemmed_dict.items():
    resumes.append(resume)
    txt = ' '.join(tokens)
    resume_words.append(txt)

In [68]:
# Using count vectorizer to remove words that occur in more than 98% and less than 2% of the resumes
vectorizer = CountVectorizer(input = 'content', analyzer = 'word',max_df=0.98, min_df=0.02)
vectorizerobject = vectorizer.fit_transform(resume_words)

#Vocab contains a list of words after removing the context dependent stopwords
vocab = vectorizer.get_feature_names()

#Checking the shape of the vectorizer obejct
print(vectorizerobject.shape)


(216, 2160)


### 9. Removing words that are less than 3 letters from the vocabulary

A subset of the vocabulary is created which contains the words that are greater than 2 letters.

In [76]:
#Using a list comprehension to elimenate 1 and 2 letter words
vocab2=[word for word in vocab if len(word)>2]

#Getting the length of the resultant vocabulary
print('LENGTH OF THE VOCABULARY: ' + str(len(vocab2)))
vocab_file = open("29330750_vocab.txt", 'w')
for word in vocab2:
    vocab_file.write(word +  '\n')

LENGTH OF THE VOCABULARY: 2020


### 10. Creating a sparse matrix to keep a track of occurance of all the words in vocabulary in each resume

The sparse matrix keeps a count of all the word occurances in vocabulary in each of the given resumes. It helps in keeping a track of diverity of the resultant vocabulary. 

In [58]:
#Initalising a file to write the sparse matrix
save_file = open("29330750_countVec.txt", 'w')

In [59]:
#Return the coordinate representation of a sparse matrix
cx = vectorizerobject.tocoo() 
for i,j,v in itertools.zip_longest(cx.row, cx.col, cx.data):
    save_file.write(resumes[i] + ',' + vocab[j] + ',' + str(v) + '\n')