#  Text2XML
####  Name: Girish Bhatta

Date: 31/08/2018

Environment: Python 3.6 and Jupyter notebook
Libraries used: 
* nltk - natural language toolkit (tokenizer, stopwords, collocations and probabilities)
* re (for regular expression, included in Anaconda Python 3.6) 


## 1. Introduction
This analysis extracts data from an text based corpus which has 250 resume documents. Data was extracted by reading the allotted input files for each student in resume_student.txt files. The names of the allotted files were maintained in a list and were read one after the other by iterating the list items and then capturing the file data in a dictionary where the key is the filename and the value is the data of that file.

The list of alloted files had duplicate entries. These duplicate entries were ommited while reading the files. 

Text pre-processing was performed with the objective of producing a lexical vocabulary for the resumes and the associated sparse count vector for each abstract. The pre-processing included tokenisation, stemming and removal of stopwords. Additionally, the most frequent and least frequent words were removed, and meaningful bigrams were identified.  The initial tokenised vocabulary of the corpus was 9690 words, which was reduced to 2334 words following pre-processing.



## 2.  Import libraries 

In [2]:
#importing natural language toolkit
import nltk

#importing ngrms from nltk.util
from nltk.util import ngrams

#importing regexptokenizer from mltk.tokenize
from nltk.tokenize import RegexpTokenizer

#importing all functions from nltk.probability to calculate frequency distribution and other functions
from nltk.probability import *
import nltk.data

#importing Regular expression library
import re

## 3. Reading the Allotted Files

The folder "resumeTxt" contains 866 files, where each file represents a resume. Each resume is unstructured text file. In order to process the data, the data of each file is read and stored as a list of dictionaries where each dictionary represents a resume document.
  

### 3.1 Inital loading of all files

In [3]:
#list of all the files that needs to be read
files = ["446 857 219 751 750 784 548  84 234 775 807 355 106 119 198 243 519 813 476 155 341 274 256 308 401 743 640 312  29 626 474 178 467 789 358 593 818 236 661 352 571 382 860  35  84 411 772 144 336 525 705 341 537 656 294 286 247 773 807 224 394 612 480 844 586 170 583 784 616 662 349 366 644 799 828 556   6 643 458 492 660 757 572 606 659 810 311 293 729 472 845   4 529 401 352 291 208 519 686 687 844 682 707 539  72 164 619 496 168  90 299 661 635 670 125 765 623 403 540  58 665 284 801 342 767 810 44 434  78 393 838 731 309 167 265 778 609 769 681 771 220 301 829 131 358 759 519 362  13 536 349 576  95 466 665 413 858 659 839 540 732 132 773   3 324 346 849 604 307 548 564 479 351 279 728 270 539 549 590 771 13 397 629 300 525 775  72 597 608 288 278 842  62 283 677 381 265 417 736 827 667 216  99 107 225 820 554 811 347 161 535 469 383 111 206 279 569 644 136 550  10 139 629 139 625 499 463 369 607 379 557 372 824 638 505 289 794 619 356  19 240  77   3  59 457 294 509 238 579 531"]

# splitting by the space to get each file entry and then getting rid of emtpy entries in the list 
list_of_files = files[0].split(" ")
list_of_files = list(filter(None, list_of_files))


        
list_of_files = list(set(list_of_files))
list_of_files.sort()

#In order for the reading of files to be successful, It must be ensured that this .ipybnb file has been placed within the same
#directory of text files. If not, then this would throw an error and further execution cannot be carried out. 


#list which will store dictionaries where each dictionary represents a resume document
file_data_list = []

count = 0
for each_file in list_of_files:
    count += 1
    file_data = {}
    filename = "resume_("+each_file+").txt"
    fh = open(filename,"r",encoding="utf8")
    file_data[each_file] = fh.read()
    file_data_list.append(file_data)
    fh.close()

print("Total number of files read :"+str(count))


Total number of files read :218


### 3.2 Inspection of the structure of Data in the files

In [4]:
#checking the content of the first file
file_data_list[0:1]

[{'10': 'Laurent Lapaire \n\nDate of birth: \nNationality: \nAddress: \nMobile: \nMail: \n\nEducation  \n2017 \n \n2013 \n\n \n2012 -13 \n\n \n2010 \n \n\n24 April 1990 \nSwiss \n26 Jalan Elok, Singapore 229064 \n+65 8399 9433  \nlapaire.laurent@gmail.com  \n\n \n\nChartered Accountant (Singapore) - ISCA \n--end? Bachelor thesis, Geneva University) \nBachelor in Business Administration, Geneva University, faculty of SES (Social and Economic \nScience), Switzerland –Thesis on the future of the Chinese currency. (Lapaire, LJ 2013, Renminbi dead-\nTwo semesters scholarship at Yonsei University, Seoul, South Korea to finalize my bachelor in Business \nAdministration (international exchange program). \n\nHigh school degree at Collège Calvin, Geneva, Switzerland (specialization in law and economics) \n\nProfessiona l  Experience  \n12/15 – present \n\n  Corporate Services Manager at Alpadis (Singapore) Pte. Ltd. In Singapore \n-  Monthly preparation of consolidated financial reports, budget,

The data looks rather unstructured. But for initial preprocessing like sentence segmentation and Case normalization we have to get rid of the special characters that appear as part of the texts.

### 3.3 Removal of special characters from the data of each file

In [5]:
#removal of special characters as part of preprocessing
for each_res in file_data_list:
    for k in each_res:    
        all_spl = list(set(re.findall('^[^A-Za-z0-9\,\(\)\{\}\[\]\#\@\$\%\^\&\s\+\|\.\,\:\*]+',each_res[k],re.MULTILINE)))
        for each_spl in all_spl:   
            each_res[k] = each_res[k].replace(each_spl,'')

We iterate through the list of dictionaries and replace each special occurence with an empty string. the regex <b>"^[^A-Za-z0-9\,\(\)\{\}\[\]\#\@\$\%\^\&\s\+\|\.\,\:\*]+"</b> can be dissected in the following way. all the special characters appeard to begin at the start of each new line hence the caret appears outside the character class. inside the character class we have another character which will look for anything, but the characters mentioned in the character class. all these are recorded in a findall of the regex and each of these special characters are the replaced with empty string.

## 4 Parsing the Data for further processing

### 4.1 Sentence detection

In [6]:
#download the required packages to carry out sentence tokenization
nltk.download('punkt')
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

#sentence detection and making changes to the file_data_list
for each_res in file_data_list:
    for k in each_res:        
        each_res[k] = sent_detector.tokenize(each_res[k].strip())
        
file_data_list[0:1]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[{'10': ['Laurent Lapaire \n\nDate of birth: \nNationality: \nAddress: \nMobile: \nMail: \n\nEducation  \n2017 \n \n2013 \n\n \n2012 13 \n\n \n2010 \n \n\n24 April 1990 \nSwiss \n26 Jalan Elok, Singapore 229064 \n+65 8399 9433  \nlapaire.laurent@gmail.com  \n\n \n\nChartered Accountant (Singapore)  ISCA \nend?',
   'Bachelor thesis, Geneva University) \nBachelor in Business Administration, Geneva University, faculty of SES (Social and Economic \nScience), Switzerland –Thesis on the future of the Chinese currency.',
   '(Lapaire, LJ 2013, Renminbi dead\nTwo semesters scholarship at Yonsei University, Seoul, South Korea to finalize my bachelor in Business \nAdministration (international exchange program).',
   'High school degree at Collège Calvin, Geneva, Switzerland (specialization in law and economics) \n\nProfessiona l  Experience  \n12/15 – present \n\n  Corporate Services Manager at Alpadis (Singapore) Pte.',
   'Ltd.',
   'In Singapore \n  Monthly preparation of consolidated finan

As part of Data parsing, we first perform sentence tokenization. This is useful for case normalization.We can see as a result of sentence tokenization, the content of the file is <b>divided into sentences and returned as list of strings where each string represents a sentence.</b> we have to case normalize the first token/word of the file and concetentate them back together in order to be processed further.

### 4.2 Case normalization and Concatenation

In [8]:
#This code snippet iterates through each file's data, which is currently list of strings, case normalises the first word of the sentence 
# and concatenates it back in the same format.
for each_res in file_data_list:
    for k in each_res:
        sentence = ""
        for each_sentence in each_res[k]:
            sentence += each_sentence.replace(each_sentence[0:each_sentence.find(' ')],each_sentence[0:each_sentence.find(' ')].lower())
        each_res[k] = sentence
        
file_data_list[0:1]

[{'10': 'laurent Lapaire \n\nDate of birth: \nNationality: \nAddress: \nMobile: \nMail: \n\nEducation  \n2017 \n \n2013 \n\n \n2012 13 \n\n \n2010 \n \n\n24 April 1990 \nSwiss \n26 Jalan Elok, Singapore 229064 \n+65 8399 9433  \nlapaire.laurent@gmail.com  \n\n \n\nChartered Accountant (Singapore)  ISCA \nend?bachelor thesis, Geneva University) \nbachelor in Business Administration, Geneva University, faculty of SES (Social and Economic \nScience), Switzerland –Thesis on the future of the Chinese currency.(lapaire, LJ 2013, Renminbi dead\nTwo semesters scholarship at Yonsei University, Seoul, South Korea to finalize my bachelor in Business \nAdministration (international exchange program).high school degree at Collège Calvin, Geneva, Switzerland (specialization in law and economics) \n\nProfessiona l  Experience  \n12/15 – present \n\n  Corporate Services Manager at Alpadis (Singapore) Pte.ltd.in Singapore \n  Monthly preparation of consolidated financial reports, budget, cost reports f

We can see the result of case normalization in the above output. First word of each file has been case normalized and each of the sentences are put back together in the same format for tokenization.

### 4.3 Tokenization of file contents

In [9]:
#List to hold the list of tokens of each file which will be stored as the dictionary of value of the key, which is the filename
token_list = []

#using the RegexpTokenizer to tokenize the file text
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")    
for each_resume in file_data_list:
    token_dict = {}
    for k in each_resume:
        token_dict[k] = tokenizer.tokenize(each_resume[k])
    token_list.append(token_dict)

token_list[0:1]

[{'10': ['laurent',
   'Lapaire',
   'Date',
   'of',
   'birth',
   'Nationality',
   'Address',
   'Mobile',
   'Mail',
   'Education',
   '2017',
   '2013',
   '2012',
   '13',
   '2010',
   '24',
   'April',
   '1990',
   'Swiss',
   '26',
   'Jalan',
   'Elok',
   'Singapore',
   '229064',
   '65',
   '8399',
   '9433',
   'lapaire',
   'laurent',
   'gmail',
   'com',
   'Chartered',
   'Accountant',
   'Singapore',
   'ISCA',
   'end',
   'bachelor',
   'thesis',
   'Geneva',
   'University',
   'bachelor',
   'in',
   'Business',
   'Administration',
   'Geneva',
   'University',
   'faculty',
   'of',
   'SES',
   'Social',
   'and',
   'Economic',
   'Science',
   'Switzerland',
   'Thesis',
   'on',
   'the',
   'future',
   'of',
   'the',
   'Chinese',
   'currency',
   'lapaire',
   'LJ',
   '2013',
   'Renminbi',
   'dead',
   'Two',
   'semesters',
   'scholarship',
   'at',
   'Yonsei',
   'University',
   'Seoul',
   'South',
   'Korea',
   'to',
   'finalize',
   'my

We can see the tokens of the file as a list of strings. This format will exist for each dictionary. Where each dictionary will have the file name as the key and the list of tokens as the value.

### 4.4 Functions to find the length of the number of tokens for all the files

In [10]:
# A function which will come handy while calculating the size of the 
def token_size(token_list):
    all_tokens = []
    for each_res in token_list:
        for key in each_res:
            all_tokens += each_res[key]
    return len(all_tokens)

#function to determine the length of the vocabulary
def get_vocab_size(token_list):
    all_tokens = []
    for each_res in token_list:
        for key in each_res:
            all_tokens += each_res[key]
    return len(set(all_tokens))

In [11]:
print("The length of the initial token list is : " + str(token_size(token_list)))
print("The length of the vocabulary is : " + str(get_vocab_size(token_list)))
print("The lexical diversity is : " + str(token_size(token_list)/get_vocab_size(token_list)))

The length of the initial token list is : 145413
The length of the vocabulary is : 16916
The lexical diversity is : 8.59618113029085


This is a way to determine the size of the token list. This size doesnt indicate anything about the the vocabulary length. it will just serve as a reference to carry out further steps with a checkpoint that token size is decreasing after each step.

### 4.5 Removing the Context independent/Stop words

In [12]:
# reading the contents of the stopwords_en.txt file. the path has to changed according to the directory where the file resides
path = "E:/all study resources/Semester2/Data Wrangling/Assignment 1/assign1/task2/"
fh = open(path+"stopwords_en/stopwords_en.txt","r")
stop_word = fh.read().split("\n")
fh.close()

#iterating over the token_list and filtering the stop words from the token list
for each_token_list in token_list:
    filtered_tokens = []
    for k in each_token_list:
        filtered_tokens = [token for token in each_token_list[k] if token not in stop_word]
        each_token_list[k] = filtered_tokens
        
print("The length of the token list after stop word removal : " + str(token_size(token_list)))
print("The length of the vocabulary after stop word removal is : " + str(get_vocab_size(token_list)))
print("The lexical diversity after stop word removal is : " + str(token_size(token_list)/get_vocab_size(token_list)))

The length of the token list after stop word removal : 108404
The length of the vocabulary after stop word removal is : 16573
The lexical diversity after stop word removal is : 6.541000422373741


### 4.6 Identification of the first 200 meaningful bigrams

### 4.6.1 Initial selection of 250 bigrams

In [14]:
#This code snippet collects all the tokens of all the documents and looks for bigrams in that . The first 250 bigrams are first
#extracted. the FreqDist function is used to extract first 250 most appearing bigrams and then it is converted to a list.
ngram_list = []
for each_res in token_list:
    for key in each_res:
        ngram_list += each_res[key]

bigrams = ngrams(ngram_list,n=2)
bigrams = list(FreqDist(bigrams).most_common(250))
bigrams
        

[(('Hong', 'Kong'), 390),
 (('cid', '1'), 146),
 (('due', 'diligence'), 103),
 (('financial', 'statements'), 99),
 (('private', 'equity'), 79),
 (('Pte', 'Ltd'), 69),
 (('M', 'A'), 62),
 (('Microsoft', 'Office'), 61),
 (('WORK', 'EXPERIENCE'), 61),
 (('University', 'Hong'), 55),
 (('Bachelor', 'Business'), 54),
 (('Business', 'Administration'), 52),
 (('English', 'Mandarin'), 52),
 (('Financial', 'Services'), 50),
 (('Asset', 'Management'), 48),
 (('Private', 'Equity'), 46),
 (('asset', 'management'), 45),
 (('real', 'estate'), 45),
 (('financial', 'models'), 43),
 (('Word', 'Excel'), 41),
 (('internal', 'external'), 40),
 (('Accounting', 'Finance'), 40),
 (('Fluent', 'English'), 38),
 (('fund', 'managers'), 37),
 (('internal', 'control'), 37),
 (('Business', 'School'), 37),
 (('2016', 'Present'), 36),
 (('cash', 'flow'), 36),
 (('Senior', 'Associate'), 35),
 (('Certified', 'Public'), 35),
 (('P', 'L'), 34),
 (('2015', 'Present'), 33),
 (('Risk', 'Management'), 33),
 (('Fund', 'Service

We select 250 bigrams initially rather than selecting 200 bigrams because the bigrams may have numerical value and it is necessary to remove these bigrams as we are assuming that bigrams which have numerical value are not of much assistance to us. We select the first 250 most commonly appearing bigrams and then later, we get rid of numerical bigrams follwed by the selection of the first 200 bigrams only. This way we would have collected the first 200 meanignful bigrams. We are initially collecting 250 to accomodate more bigrams and to consider more bigrams and thereafter ommiting the numerical values which subsequenlty provides a robust set of bigrams. 

### 4.6.2 Filtering the numerical Bigrams

In [15]:
#the code snippet looks through the first 250 bigrams and filter out the bigrams with numerical values in them. We extract the 
# first 200 bigrams from the resultant list

#variable to count number of non-numeral bigrams
count  = 0

#List to hold filtered bigrams
filtered_bigrams = []
for each_tup in bigrams:
    if each_tup[0][0].isalpha() and each_tup[0][1].isalpha():
        count += 1
        filtered_bigrams.append(each_tup)
print("The total number of only alphabetic bigrams are  :" ,count)

filtered_bigrams = filtered_bigrams[0:200]
filtered_bigrams

The total number of only alphabetic bigrams are  : 221


[(('Hong', 'Kong'), 390),
 (('due', 'diligence'), 103),
 (('financial', 'statements'), 99),
 (('private', 'equity'), 79),
 (('Pte', 'Ltd'), 69),
 (('M', 'A'), 62),
 (('Microsoft', 'Office'), 61),
 (('WORK', 'EXPERIENCE'), 61),
 (('University', 'Hong'), 55),
 (('Bachelor', 'Business'), 54),
 (('Business', 'Administration'), 52),
 (('English', 'Mandarin'), 52),
 (('Financial', 'Services'), 50),
 (('Asset', 'Management'), 48),
 (('Private', 'Equity'), 46),
 (('asset', 'management'), 45),
 (('real', 'estate'), 45),
 (('financial', 'models'), 43),
 (('Word', 'Excel'), 41),
 (('internal', 'external'), 40),
 (('Accounting', 'Finance'), 40),
 (('Fluent', 'English'), 38),
 (('fund', 'managers'), 37),
 (('internal', 'control'), 37),
 (('Business', 'School'), 37),
 (('cash', 'flow'), 36),
 (('Senior', 'Associate'), 35),
 (('Certified', 'Public'), 35),
 (('P', 'L'), 34),
 (('Risk', 'Management'), 33),
 (('Fund', 'Services'), 32),
 (('Junior', 'College'), 32),
 (('Kuala', 'Lumpur'), 31),
 (('Class'

This step has helped us achieve a list of most commonly appearing meaningful bigrams. These would be retokenized into the token list of all the documents so as to make them available as part of the vocabulary. a snapshot of the filtered bigrams is shown.

### 4.7 Consolidation of token by including bigrams in the vocabulary using the MWE tokenizer

In [16]:
from nltk.tokenize import MWETokenizer

#list to hold all the bigrams which will be in the form of tuples. these formalized bigrams will then retokenized using MWE tokenizer 
# and will be included as part of the vocabulary
formalised_bigrams = []
for i in range(len(filtered_bigrams)):
    formalised_bigrams.append(filtered_bigrams[i][0])

#loop that will iterate through the parent token list and update the token list with the top 200 bigrams 
for each_res in token_list:
    for key in each_res:
        #passing formalised_bigrams as the parameter for MWETokenizer
        mwe_tokenizer = MWETokenizer(formalised_bigrams)
        mwe_tokens = mwe_tokenizer.tokenize(each_res[key])
        each_res[key] = mwe_tokens

print("The length of the token list after removing tokens after retokenization with MWE tokenization : " + str(token_size(token_list)))
print("The length of the vocabulary after removing tokens after retokenization with MWE tokenization  : " + str(get_vocab_size(token_list)))
print("The lexical diversity after removing tokens after retokenization with MWE tokenization : " + str(token_size(token_list)/get_vocab_size(token_list)))
token_list[0:1]

The length of the token list after removing tokens after retokenization with MWE tokenization : 103785
The length of the vocabulary after removing tokens after retokenization with MWE tokenization  : 16759
The lexical diversity after removing tokens after retokenization with MWE tokenization : 6.192791932692882


[{'10': ['laurent',
   'Lapaire',
   'Date',
   'birth',
   'Nationality',
   'Address',
   'Mobile',
   'Mail',
   'Education',
   '2017',
   '2013',
   '2012',
   '13',
   '2010',
   '24',
   'April',
   '1990',
   'Swiss',
   '26',
   'Jalan',
   'Elok',
   'Singapore',
   '229064',
   '65',
   '8399',
   '9433',
   'lapaire',
   'laurent',
   'gmail',
   'Chartered_Accountant',
   'Singapore',
   'ISCA',
   'end',
   'bachelor',
   'thesis',
   'Geneva',
   'University',
   'bachelor',
   'Business_Administration',
   'Geneva',
   'University',
   'faculty',
   'SES',
   'Social',
   'Economic',
   'Science',
   'Switzerland',
   'Thesis',
   'future',
   'Chinese',
   'currency',
   'lapaire',
   'LJ',
   '2013',
   'Renminbi',
   'dead',
   'Two',
   'semesters',
   'scholarship',
   'Yonsei',
   'University',
   'Seoul',
   'South',
   'Korea',
   'finalize',
   'bachelor',
   'Business_Administration',
   'international',
   'exchange',
   'program',
   'high',
   'school',
   

This step is necessary after the removal of stop words and before stemming because, a combination of stop words are most likely to appear consecutively as they are context independent . Therefore, removing the stop words and then finding the bigrams makes sense. Futhermore, it is essential to do this before stemming because once the tokens are stemmed, the tokens get truncated and the bigrams on a whole would not convey any meaning. hence , the identification of bigrams at this stage is justified.

### 4.8 Stemming of lower case tokens

In [17]:
#require the PorterStemmer function available in the nltk.stem package
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#iterating through the token_list, tokenize only the lower case characters
for each_res in token_list:
    for key in each_res:
        stemmed_tokens = [stemmer.stem(token) if token.islower() and not re.search(r'[A-Za-z]+_[A-Za-z]+',token) else token for  token in  each_res[key]]
        each_res[key] = stemmed_tokens
        
print("The length of the token list after stemming is: " + str(token_size(token_list)))
print("The length of the vocabulary after stemming is : " + str(get_vocab_size(token_list)))
print("The lexical diversity after stemming is: " + str(token_size(token_list)/get_vocab_size(token_list)))

token_list[0:1]

The length of the token list after stemming is: 103785
The length of the vocabulary after stemming is : 14398
The lexical diversity after stemming is: 7.208292818447006


[{'10': ['laurent',
   'Lapaire',
   'Date',
   'birth',
   'Nationality',
   'Address',
   'Mobile',
   'Mail',
   'Education',
   '2017',
   '2013',
   '2012',
   '13',
   '2010',
   '24',
   'April',
   '1990',
   'Swiss',
   '26',
   'Jalan',
   'Elok',
   'Singapore',
   '229064',
   '65',
   '8399',
   '9433',
   'lapair',
   'laurent',
   'gmail',
   'Chartered_Accountant',
   'Singapore',
   'ISCA',
   'end',
   'bachelor',
   'thesi',
   'Geneva',
   'University',
   'bachelor',
   'Business_Administration',
   'Geneva',
   'University',
   'faculti',
   'SES',
   'Social',
   'Economic',
   'Science',
   'Switzerland',
   'Thesis',
   'futur',
   'Chinese',
   'currenc',
   'lapair',
   'LJ',
   '2013',
   'Renminbi',
   'dead',
   'Two',
   'semest',
   'scholarship',
   'Yonsei',
   'University',
   'Seoul',
   'South',
   'Korea',
   'final',
   'bachelor',
   'Business_Administration',
   'intern',
   'exchang',
   'program',
   'high',
   'school',
   'degre',
   'Collèg

We perform stemming on the lower case tokens only as the Porterstemmer function by default converts all the tokens to lower case irrespective of the case. Hence in order to preserve the upper case tokens with the assumption that they convey specific meaning, we apply stemming only on the lowercase tokens. We can gradually see the stemmer reducing the number of tokens and the vocabulary size. However, the stemming has increased the lexical diversity which forms the base of analysis from an overall perspective.

### 4.9 Removing tokens with length less than 3

In [18]:
#iterates through the token list and filter the tokens with length less than 3
for each_res in token_list:
    for key in each_res:
        proper_len_tokens = [token for token in each_res[key] if len(token) >= 3]
        each_res[key] = proper_len_tokens

print("The length of the token list after stemming is: " + str(token_size(token_list)))
print("The length of the vocabulary after stemming is : " + str(get_vocab_size(token_list)))
print("The lexical diversity after stemming is: " + str(token_size(token_list)/get_vocab_size(token_list)))

The length of the token list after stemming is: 96114
The length of the vocabulary after stemming is : 13693
The lexical diversity after stemming is: 7.019206894033448


This step is necessary to remove tokens with length less than 3 considering the tokens less than the length of 3 dont contribute towards the vocabulary and doesnt help in the analysis.

### 4.9 Removal of context dependent tokes with threshold greater than 98%

In [19]:
# we import all the functions from the nltk.probability to find out the tokens with highest number of document frequency
from nltk.probability import *        
uniq_tokens = []        
for each in token_list:
    for k in each:
        # we take the set of token list of each file. Thereby limiting a single occurence of token per file. This way a count of a
        # token gives the number of files the token is present in.
        uniq_tokens += list(set(each[k]))
        
freq = FreqDist(uniq_tokens)
freq.most_common(200)

[('client', 173),
 ('manag', 171),
 ('includ', 158),
 ('report', 155),
 ('team', 154),
 ('compani', 151),
 ('work', 151),
 ('2013', 140),
 ('account', 139),
 ('perform', 135),
 ('2015', 134),
 ('2011', 133),
 ('Singapore', 132),
 ('provid', 131),
 ('invest', 130),
 ('process', 130),
 ('fund', 130),
 ('financi', 128),
 ('2012', 127),
 ('2014', 127),
 ('busi', 126),
 ('University', 123),
 ('prepar', 118),
 ('review', 117),
 ('servic', 117),
 ('Management', 117),
 ('oper', 114),
 ('2016', 114),
 ('gmail', 113),
 ('project', 113),
 ('Present', 111),
 ('market', 110),
 ('2010', 108),
 ('Email', 108),
 ('analysi', 105),
 ('bank', 104),
 ('Finance', 103),
 ('audit', 102),
 ('Hong_Kong', 100),
 ('Business', 99),
 ('industri', 98),
 ('ensur', 95),
 ('skill', 95),
 ('intern', 94),
 ('present', 94),
 ('portfolio', 93),
 ('product', 92),
 ('develop', 92),
 ('support', 92),
 ('system', 92),
 ('valuat', 91),
 ('May', 91),
 ('Mandarin', 90),
 ('issu', 90),
 ('2008', 90),
 ('trade', 89),
 ('meet', 88)

Instead of writing the code to remove all the words greater than 98% directly. we find out the document frequency of the most appearing word and if the document frequency of that is greater than 98% we go ahead with the logic of removing these words from the list. However, if we dont, we do not remove any of these tokens from the token list. This way we are reducing the process cycles of our program.

In [20]:
uniq_tokens.count('client')/218*100

79.35779816513761

we see that the most occuring token doesnt surpass the threshold 98% documet frequency. Hence, we dont remove any of the other tokens since they are arranged in the decreasing order of their document frequency.

### 4.10 Removal of tokens with document Frequency less than 2%

In [21]:
# List to hold all the tokens that appear less than 2% documents. this list will be used later to filter out tokens from the parent list
less_freq = []    
for each in uniq_tokens:
    if uniq_tokens.count(each)/218*100 <= 2:
            less_freq.append(each)
            
# filtering the tokens from the parent list against the less_freq list which contains all the tokens that appear in less than 2% documents
for each_res in token_list:
    for key in each_res:
        less_app_tokens = [token for token in each_res[key] if token not in less_freq]
        each_res[key] = less_app_tokens

print("The length of the token list after removing tokens with less than 2% doc frequency: " + str(token_size(token_list)))
print("The length of the vocabulary after removing tokens with less than 2% doc frequency : " + str(get_vocab_size(token_list)))
print("The lexical diversity after removing tokens with less than 2% doc frequency: " + str(token_size(token_list)/get_vocab_size(token_list)))


The length of the token list after removing tokens with less than 2% doc frequency: 77029
The length of the vocabulary after removing tokens with less than 2% doc frequency : 2489
The lexical diversity after removing tokens with less than 2% doc frequency: 30.947770188830855


We observe that after the removal of tokens with less than 2% document frequency we have reduced the size of the tokens considerably and the vocabulary size has considerably come down to 2489 tokens. 

## 5 Creating sparsevector and vocab dictionary file
**CountVector**: converts a collection of text documents to a matrix of token counts.

### 5.1 Generate sparse count vectors as 29270863_countVec.txt file

In [26]:
# we collate all the tokens of all the files into a single list and then pass this list to create indexes in order to create sparse vector
# representation
ngram_list = []
for each_res in token_list:
    for key in each_res:
        ngram_list += each_res[key]

# we import CountVectorizer from sklearn.feature_extraction.text to perform indexing on the tokens of the token list
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word")    

#creating a file to be written to the file
out_file = open("./29270863_countVec.txt", 'w')
vocab = list(set(ngram_list))

#create a dictionary to hold the index values of all the tokens
vocab_dict = {}
i = 0
for w in vocab:
    vocab_dict[w] = i
    i = i + 1

#for all the tokens in the token_list we write the sparse vector representation of all the resume documents to a file
for each_res in token_list:
    for key in each_res:
        d_idx = [vocab_dict[w] for w in each_res[key]]
        out_file.write("resume_("+key+"),")
        for k, v in FreqDist(d_idx).items():            
            out_file.write("{}:{}, ".format(k,v))
        out_file.write('\n')
out_file.close()

print("The sparse vector representation has been written to 29270863_countVec.txt.")

The sparse vector representation has been written to 29270863_countVec.txt.


### 5.2 Creating the Vocab Dictionary as  29270863_vocab.txt 

In [25]:
#dictionary to sort the keys of the parent token list
sorted_vocab_dict = {}
for key in sorted(vocab_dict.keys()):
    sorted_vocab_dict[key] = vocab_dict[key]
    
#printing the Dictionary to the 29270863_vocab.txt file
vocab_str = str(sorted_vocab_dict)
out_file = open("./29270863_vocab.txt", 'w')
out_file.write(vocab_str)
out_file.close()

print("The vocabulary dictionary has been written to 29270863_vocab.txt")

The vocabulary dictionary has been written to 29270863_vocab.txt


### 6 Final Vocabulary
The final vocabulary at the end of pre-processing is the set of tokens remaining. The vocab size generated in this manner should be the same as that generated by the CountVectorizer used to create the sparse count vectors below. Since there are 2489

In [24]:
# Code to showcase the final vocabulary
print(vocab[0:100])


['Audit_Associate', 'honour', 'sensit', 'set', '1000', 'transit', 'Revenue', 'Fund', 'Profile', 'Financial_Analyst', 'suffici', 'email', 'pursu', 'doubl', 'Skill', 'Clients', 'school', 'unusu', 'Outlook', 'ledger', 'identifi', 'collect', 'ADDITIONAL', 'impair', 'Life', 'activ', 'purpos', 'References', 'Identify', 'mortgag', 'Conference', 'speak', 'explain', 'Advanced', 'matter', 'notabl', 'Factset', 'High_School', 'sens', 'form', 'Air', 'master', 'leverag', 'Service', 'expatri', 'ownership', 'LLP_Singapore', 'GPA', 'reconcili', 'charg', 'nation', 'dividend', 'CSR', 'Investment_Management', 'Programme', 'Written', 'appropri', 'intellig', 'Initiated', 'salari', 'websit', 'institutional_investors', 'valuation_reports', 'verifi', 'join', 'digit', 'P_L', 'led', 'traine', 'Honour', 'assur', 'Institute', 'adher', 'Others', 'ventur', 'Employment', 'upgrad', 'India', 'Spring', 'Date', 'rais', 'seek', 'adequaci', 'vers', 'multi-task', 'This', 'Cultural', 'deliv', 'subsidiari', 'social', 'strateg

### 7 Final Checks for Sparse Count Vector and  Pre-processing of text
To complete the pre-processing, a check of the tokenisation, stemming, removal of stop words, replacememt of bigrams, and the final creation of sparse count vectors can be checked by examining an individual resume as shown below.

The sample patent below shows the following:
  1.   extraction of document data   
  2.   Removal of special characters, Sentence segmentation and case normalization 
  3.   Tokenisation, and removal of stop words
  4.   Extraction of top 200 bigrams, e.g. watches -> watch
  5.   Retokenization of bigrams using MWE tokenization
  6.   Applying Porterstemmer on lower case tokens
  7.   Removing tokens with length less than 3
  8.   most and least frequent words with threshold of greater than 98% document frequency and less than 2% frequecy removed            e.g. 'method', 'includes', 'first', 'second'.
  9.   reduction to sparse vector and writing the sparse count vector to a file
  10.   check of cross-reference to vocabulary

We have performed all the necessary steps as part of wrangling the data. The output after each step is shown after every step is performed. The progress of the wrangling process is gauged by three parameters : Total number of tokens, Vocabulary size, and Lexical diversity. We can observe that after each step a substantial decrease in the number of tokens and vocab size and increase in the lexical diversity. This could be a reaffirmation that we are moving in the right direction as far as wrangling is concerned.

### 8.Summary:
##### 1. Section 3, showed extraction of data into a list of dictionaries where each dictionary represents a resume document. It also shows some preprocessing
##### 2. Section 4,  showed all the predominant parts of parsing like : tokenization, extracting bigrams, removing stop words, removing most frequent and least frequent tokens in terms of a set threshold of document frequency,removal of tokens with length less than 3.
##### 3. Section  5, deomnstrated creation of count sparse vector and vocabulary dictionary and writning each them to respective files.

The wrangling cycle illustrated here, shows how a text starting with bellow statistics, ended up reduced to more sparse text while conserving the main text feature. The end output of this text, should serve a good basline line for further topic modelling, feature analysis as well as a basis for an efficient retrival system .


######################### Text Statistics Before Wrangling ##################################

Total number of words: <b>145416</b><br>
Total number of vocabs: <b>16917</b><br>
Lexical diversity is :  <b>8.595850328072354 %</b><br>

######################### Final Text Statistics After Wranling ##################################

Total number of words: <b>77030</b><br>
Total number of vocabs: <b>2489</b><br>
Lexical diversity is : <b>30.94817195660908 %</b><br>

## 8. References


#### <li>https://docs.python.org/3/library/re.html</li>


#### <li>https://www.nltk.org/</li>


#### <li>http://www.nltk.org/howto/stem.html</li>


#### <li>http://www.nltk.org/_modules/nltk/util.html</li>


#### <li>https://www.nltk.org/_modules/nltk/probability.html</li>

