# FIT5196   Assessment 2 - Text Pre-Processing & Feature Generation

Date: 15/09/2019

Version: 2.0

Environment: Python 3.7.3 and Anaconda 4.3.0 (64-bit)

#### Libraries Used

- **os**:
The OS module in python provides functions for interacting with the operating system. OS, comes under Python's standard utility modules. This module provides a portable way of using operating system dependent functionality.

- **requests**:
The requests module in python will allow you to send http requests. With the help of it, you can add anytype of content like headers, form data, multipart files and to access the response data of Python in the same way.

- **re**: 
It is a python inbuilt library used for matching special sequence of characters in any given text. It is more efficient than conventional pattern matching.

- **pandas**: 
It is a third party library and mainly used for Data Science applications.In this assignment pandas library has been used with Series which is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index and using value_counts method got the frequency of each series.

- **nltk**:
The Natural Language ToolKit is one of the best-known and most-used NLP libraries in the Python ecosystem, useful for all sorts of tasks from word segmentation, to tokenization,to removal of stop-words, to stemming, to part of speech tagging etc.

- **RegexpTokenizer**:
A RegexpTokenizer is used to divide a string into substrings by splitting on the specified string based on the specified regular expression.

- **MWETokenizer**:
A MWETokenizer takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs.

## 1. Introduction

This assignment deals with the second step of data wrangling that is analyzing the textual data i.e., extracting data from non-structured format and converting the extracted data into a proper format suitable for a downstream modelling task.

In the first stage we will download the pdf files present in the form of url's in a table, by sending the requests to the urls and then using pdf2txt for converting the downloaded pdf files to text files.

After the conversion to text files, we observe some changes in .txt files. So, we will do the required preprocessing first, and then do the following steps ,

1. Sentence Segmentation
2. Tokenization , Case Normalization
3. Generating bigrams
4. Removal of stop words
5. Removing the words based on threshold frequency which are greater than 95%(context-dependent words) 
6. Removing the words based on threshold frequency which are less than 3%(rare tokens)
7. Tokens with length less than 3 must be removed
8. Stemming the unigrams with the help of PorterStemmer,

to produce suitable input to NLP AI systems, recommender-systems, information-retrieval algorithms, etc.

More details for each task will be given in the following sections.

## 2. Importing Libraries

In [1]:
import os
import re
import requests
import nltk.data
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
import itertools
import multiprocessing
import pandas as pd

### 3.  Downloading Files from text file

We were given a pdf file which contains the filename and download link for 200 pdf documents. Our first task is to convert the Group060.pdf to text file using pdfminer. To achieve this we use os library to run the command required to convert .pdf to .txt format in the terminal i.e., os.system() takes a string as an argument and runs that string as command in the terminal. Once this is done we will open our Group060.txt and store the links and names for 200 documents. Now using multiprocessing package (this module allows the programmer to fully leverage multiple processors on a given machine.) we try to speed up the download process by setting the number of processess to 4.It also depends on the system hardware specs. So we provided the code for multiprocessing in comments below, But for this assignment we didn't use multiprocessing and used only requests package.

In [2]:
%%time
#converting the pdf to text file
os.system('pdf2txt.py -o Group060.txt Group060.pdf')
File_list=[]
with open('Group060.txt','r') as inFile:
    for line in inFile:
        if re.search(r"(.*).pdf",line):
            l=line.split(" ")
            r=requests.get(l[1].rstrip())
            File_list.append(l[0][:6])
            with open(l[0],'wb') as outFile:
                outFile.write(r.content)

CPU times: user 20 s, sys: 999 ms, total: 21 s
Wall time: 6min 35s


In [3]:
'''
This downloading task can be achieved fast using multiprocessing technique(this depends on the hardware spec of your
system) to be careful whi;e implementing this.

%%time
#converting the pdf to text file
os.system('pdf2txt.py -o Group060.txt Group060.pdf')
File_list=[]
l1=[]
with open('Group060.txt','r') as inFile:
    for line in inFile:
        if re.search(r"(.*).pdf",line):
            l=line.split(" ")
            l1.append(l[1].strip())
            File_list.append(l[0][:6])


def pdf_writer(r,l):
    with open(f'{l}.pdf','wb') as writer:
        writer.write(r.content)
count=0
def file_download(url):
    global count
    r=requests.get(url)
    pdf_writer(r,File_list[count])
    count+=1
    return 'Done Downloading'

pool = multiprocessing.Pool(processes=4)
pool.map(file_download, l1)

'''

'\nThis downloading task can be achieved fast using multiprocessing technique(this depends on the hardware spec of your\nsystem) to be careful whi;e implementing this.\n\n%%time\n#converting the pdf to text file\nos.system(\'pdf2txt.py -o Group060.txt Group060.pdf\')\nFile_list=[]\nl1=[]\nwith open(\'Group060.txt\',\'r\') as inFile:\n    for line in inFile:\n        if re.search(r"(.*).pdf",line):\n            l=line.split(" ")\n            l1.append(l[1].strip())\n            File_list.append(l[0][:6])\n\n\ndef pdf_writer(r,l):\n    with open(f\'{l}.pdf\',\'wb\') as writer:\n        writer.write(r.content)\ncount=0\ndef file_download(url):\n    global count\n    r=requests.get(url)\n    pdf_writer(r,File_list[count])\n    count+=1\n    return \'Done Downloading\'\n\npool = multiprocessing.Pool(processes=4)\npool.map(file_download, l1)\n\n'

## 4. Converting pdf to text files

This bock of code is responsible for converting the pdf files to text files. It is achieved by using pdf2txt.py(script by given pdfminer) and os.system() to run the command 'pdf2txt -o filename.txt filename.pdf', Where -o is the option which specifies  the output file name. By default, it prints the extracted contents to stdout in text format. And filename.txt is the output file name and filename.pdf is the input file name.

In [4]:
%%time
for word in File_list:
    os.system(f'pdf2txt.py -o {word}.txt {word}.pdf')#using string formating to dynamically convert 200 files.

CPU times: user 0 ns, sys: 427 ms, total: 427 ms
Wall time: 2min 5s


## 5. Sentence Segmentation

Sentence segmentation is used to break the sentence from a list of sentences depending upon the punctuation marks "." , "?" ,
"!".

Inorder to segment the sentences, we use NLTK's Punkt pacakge.

The NLTK's Punkt Sentence Tokenizer was designed to split text into sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [5]:
def file_segmentation(string_sub_refer):
    '''
    When pdf file is converted to .txt file, we will observe that if the word continues in the next line with hyphen at the
    end , we will get the word like (.*)-\n(.*). Search for the words which obtain such particular pattern and replace
    '-\n' with an empty string.
    '''
    if re.search(r'(.*)-\n',string_sub_refer):
        string_sub_refer = string_sub_refer.replace('-\n','')
   
    '''
    When pdf file is converted to .txt file, we will observe that 2 different words are combained with \n in the middle,
    so, where ever we find the '\n' in the middle of the word, split it into 2 words 
    '''  
    if re.search(r'(.*)\n(.*)',string_sub_refer):
        string_sub_refer = re.sub(r'\n'," ",string_sub_refer)
        
        
    l=[]
    '''
    When pdf file is converted to .txt file, we will observe that some of words like doesn't is appeared as doesn?t. Initially,
    split the complete string based on spaces. Wherever we find the word that contains '?' we will replace it with "'". All,
    the words that contain ? are being replaced in this regular expression. Actually, if a word contains ? at the end it
    shouldn't be replaced, but, as it will be removed in the segmentation process if the word contains ? at the end and do
    not make any difference, we will replace all ?'s with "'"
    '''
    for word in (string_sub_refer.split(' ')):
        if word!='?':
            word = word.replace('?',"'")
        l.append(word)
    string_sub_refer = ' '.join(l)
    
    '''
    Using punkt package loading the data and tokenizing the complete string, so that each sentence is appeared as one string.
    '''
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(string_sub_refer.strip())
    return sentences

## 6. Word Tokenization and Case Normalization

It is the process of breaking a stream of text into tokens. A Text is usually represented as sequences of characters by computers. The natural language processing (NLP) and text mining algorithms can only operate on tokens. So, we will generate the tokens which will be used by NLP AI systems.

In the below file_tokenization method, we will pass the parameter that is obtained from sentence segmentation to it. Sentence segmentation consists of a list of strings, where each string represents one sentence. Now, we will traverse through each sentence from a list of sentences, tokenize each sentence with the given regular expression using RegexpTokenizer and append the obtained tokens from each sentence into a list. If we see the generated tokens, we will observe that certain words that starts with fi in the middle of each sentence are not matching the words obtained in the sentence segmentation i.e fi is missing from the word. So, before appending to a list itself, we will check if a word is present in the dictionary, if so replace those words with an actual word and then append to a list. Here itself, we will also do the case normalization, before appending it to a list.

- **Case Normalization**:
Case Normalization convert the words into either uppercase or lowercase words inorder to make users differentiate between nouns and proper nouns.

Here, we will do the case normalization by using lower case, where each starting word of a sentence we will convert it into a lower case independent of whether it is a noun or proper noun.

Finally, append all unigrams obtained from each sentence to another list.

In [29]:
def file_tokenziation(string1):
    #creating an empty list to add individual tokens of all sentences for one particular document
    l1 = []
    #traversing through each sentence and obtaining the unigram tokens
    for sent in string1:
        #using the RegexpTokenizer tokenizing the sentence with the below regular expression
        tokenizer = RegexpTokenizer( r"[A-Za-z]\w+(?:[-'?]\w+)?")
        unigram_tokens = tokenizer.tokenize(sent)
        l=[]
        missing_dict={'fty':'fifty',
                      'rst':'first',
                      'eld':'field',
                      'elds':'fields',
                      'll':'fill',
                      'nd':'find',
                      'nding':'finding',
                      'ndings':'findings',
                      'xation':'fixation',
                      'nal':'final',
                      'gure':'figure',
                      'nely':'finely',
                      'ne':'fine',
                      'ner':'finer',
                      'nance':'finance',
                      'nancial':'financial',
                      'xed':'fixed',
                      'nally':'finally',
                      'citious':'ficitious',
                      'negrained':'finegrained',
                      'nergrained':'finergrained',
                      'nergained':'finergained',
                      'negained':'finegained',
                      'lter':'filter',
                      'lters':'filters',
                      'le':'file',
                      'les':'files',
                      'nger':'finger',
                      'nite':'finite',
                      'fth':'fifth',
                      'ive':'five',
                      'eficient':'efficient',
                      'dificult':'difficult',
                      'ofline':'offline'
                      }
        
        '''
        After getting the tokens, doing some preprocessing like replacing and doing the case normalization(converting first
        word to lower case) 
        '''
        for x in unigram_tokens:
            if x in list(missing_dict.keys()):
                x=missing_dict[x]
            if x == unigram_tokens[0]:
                x = x.lower()
            l.append(x)
        l1.extend(l)
    return l1

## 7. Reading stopwords from stopwords_en.txt file

In [7]:
'''
Read stopwords from stopword_en.txt file , split it based on newline and store it in a set. 
Set is used rather than list because traversing a set is faster than list
'''
with open('stopwords_en.txt','r') as stop:
    stopwords_set = set(stop.read().split('\n'))


## 8.  Removal of stopwords

Stopwords are the words that are extremely common and carry little lexical content. They are often function words in English. For example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. Otherwise, we will face the curse of dimensionality. There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, you are going to remove all the stop words from the token list by using the stop word list present in 'stopwords_en.txt' file.

In [8]:
def stopwords(stopwords_listt):
    stopped_tokens = [w for w in stopwords_listt if w not in stopwords_set]
    return stopped_tokens

## 9. MWE Tokenizer

A MWETokenizer takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs. Here, MWE's are the bigrams which we generate by combining tokens from 200 individual files.

In [9]:
def mwe_tokenizer(bigram_listt,unigram_listt):
    '''
    We will add generated bigram list separated by delimiter '__' to MWETokenizer and then retokenize our unigram lists
    with mwe_tokenizer, so that if any bigrams are present in our unigrams list, they will be replaced. 
    '''
    mwe_tokenizer = MWETokenizer(bigram_listt, separator = '__')
    mwe_tokens = mwe_tokenizer.tokenize(unigram_listt)
    return mwe_tokens


## 10. Method for getting input for stat-generation

In task 2 we are asked to get top 10 most frequent words used in titles and top 10 most frequent words used in abstract and top 10 authors(ordered by the number of articles produced and if they are same then order by their names). In order to get the most frequent words from titles and abstract we have to segment them, later tokenise and remove stopwords. This is achieved by using the code in task 1.

In [10]:
#lists to store our intermediate results 
author_list=[]
title_words=[]
abstract_words=[]
def Task2(string2):
    title=string2.split('Authored by:')[0].lower()#getting the title of 200 files
    author=string2.split('Authored by:')[1].split('Abstract')[0].strip()#getting the author names of 200 files
    abstract=string2.split('Authored by:')[1].split('Abstract')[1]#getting the abstract of 200 files
    #storing all the intermediate results
    author_list.extend(author.split('\n'))
    #calling the functions on titles to get most tokenised unigrams with out stop words
    title_words.extend(stopwords(file_tokenziation(file_segmentation(title))))
    #calling the functions on abstract to get most tokenised unigrams with out stop words
    abstract_words.extend(stopwords(file_tokenziation(file_segmentation(abstract))))


## 11. Parsing through each text file, Extracting paper body and doing the required pre-processing steps

In [11]:
%%time
#creating a dictionary to store each text file name and its tokens
file_dict={}
#creating an empty list to store all the tokens of 200 documents
unigram_list=[]
#traversing through each file
for word in File_list:
    '''
    Opening each text file, reading it. When we read the file it will be stored in the form of string. Split the string
    based on the paper body and call the segmentation and tokenization methods.
    '''
    with open(f'{word}.txt','r') as writer:
        file_read=writer.read()
        #print(file_read)
        string_list = file_read.split('1 Paper Body')
        string_sub_refer = string_list[1].split('2 References')
        string_sub_refer = string_sub_refer[0].strip()
        file_token=file_tokenziation(file_segmentation(string_sub_refer))
        #store the value returned after doing segmentation, tokenization into a dictionary as value
        file_dict[word]=file_token
        '''
        append your generated unigram tokens of each document to the unigram_list, so that unigram list consists of all
        200 documents generated tokens
        '''
        unigram_list.extend(file_token)
        '''
        call the Task2 method using string_list[0], so that it contains the content other than paper body. as we splitted in
        the above based on the paper body
        '''
        Task2(string_list[0])


CPU times: user 3.68 s, sys: 22.8 ms, total: 3.7 s
Wall time: 3.71 s


In [12]:
#generated unigram list for all documents
unigram_list

['very',
 'accurate',
 'pedestrian',
 'detectors',
 'are',
 'an',
 'important',
 'technical',
 'goal',
 'approximately',
 'half-a',
 'million',
 'pedestrians',
 'are',
 'killed',
 'by',
 'cars',
 'each',
 'year',
 'gures',
 'in',
 'at',
 'relatively',
 'low',
 'resolution',
 'pedestrians',
 'tend',
 'to',
 'have',
 'characteristic',
 'appearance',
 'generally',
 'one',
 'must',
 'cope',
 'with',
 'lateral',
 'or',
 'frontal',
 'views',
 'of',
 'walk',
 'in',
 'these',
 'cases',
 'one',
 'will',
 'see',
 'either',
 'lollipop',
 'shape',
 'the',
 'torso',
 'is',
 'wider',
 'than',
 'the',
 'legs',
 'which',
 'are',
 'together',
 'in',
 'the',
 'stance',
 'phase',
 'of',
 'the',
 'walk',
 'or',
 'scissor',
 'shape',
 'where',
 'the',
 'legs',
 'are',
 'swinging',
 'in',
 'the',
 'walk',
 'this',
 'encourages',
 'the',
 'use',
 'of',
 'template',
 'matching',
 'early',
 'template',
 'matchers',
 'include',
 'support',
 'vector',
 'machines',
 'applied',
 'to',
 'wavelet',
 'expansion',
 'a

## 12. Generating the bigrams 

Bigrams are the words within a sentence which are of side by side. They are all possible word pairs formed from neighboring words in the sentence. A bigram is an n-gram for n=2.

In [13]:
'''

Using nltk pacakge, forming the collocations based on the list we send using nltk.collocations.BigramCollocationFinder.
The collocations package provides collocation finders which by default consider all ngrams in a text as candidate collocations.
'''

bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(unigram_list)

'''
While these words are highly collocated, the expressions are also very infrequent. 
Therefore it is useful to apply filters, such as ignoring all bigrams which occur less than 20 times in the corpus.
'''
bigram_finder.apply_freq_filter(20)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# filtering the words that are of length less than 3

'''
Based on pmi(Pointwise Mutual Information) measure, nbest will generate the top frequencies for the complete length of 
unigram_list. The bigram_finder.nbest function generate the top frequencies bigrams using pmi measure.
'''
total_bigrams = bigram_finder.nbest(bigram_measures.pmi, len(unigram_list)) 
total_bigrams

[('spike-and', 'slab'),
 ('Barcelona', 'Spain'),
 ('DFC-P', 'ROJ'),
 ('Long', 'Beach'),
 ('https', 'github'),
 ('ENSOR', 'KETCH'),
 ('ELECT', 'ROC'),
 ('symbolic', 'mismatch'),
 ('github', 'com'),
 ('minwise', 'hashing'),
 ('Monte', 'Carlo'),
 ('Processing', 'Systems'),
 ('Kac-Rice', 'formula'),
 ('dynamical', 'isometry'),
 ('NIPS', 'Barcelona'),
 ('diﬀerentially', 'private'),
 ('nearest', 'neighbor'),
 ('Systems', 'NIPS'),
 ('blood', 'pressure'),
 ('Science', 'Foundation'),
 ('NIPS', 'Long'),
 ('National', 'Science'),
 ('Information', 'Processing'),
 ('solid', 'harmonic'),
 ('http', 'www'),
 ('Gradient', 'Descent'),
 ('computer', 'vision'),
 ('state-of', 'the-art'),
 ('false', 'positives'),
 ("Hotelling's", 'deﬂation'),
 ('Neural', 'Information'),
 ('multi-view', 'anomaly'),
 ('ground', 'truth'),
 ('fused', 'lasso'),
 ('approval', 'voting'),
 ('diﬀerential', 'privacy'),
 ('gene', 'expression'),
 ('IDS', 'games'),
 ('Accuracy', 'Accuracy'),
 ('visual', 'cortex'),
 ('bounding', 'boxes')

## 13. Removal of context-independent stopwords from generated bigrams

In [14]:
'''
Creating an empty list and traversing through all the generated bigrams of nltk package from total_bigrams list. As our 
bigram list must not include context-independent stopwords, we will check if both the words in the generated total_bigrams,
are present in the stopwords or not. If the generated bigrams are not in the stopwords_set then we will append it to the
created empty list
'''
string1 = []
for word in range(len(total_bigrams)):
    if total_bigrams[word][0] not in stopwords_set and total_bigrams[word][1]  not in stopwords_set:
        string1.append(total_bigrams[word])
string1

[('spike-and', 'slab'),
 ('Barcelona', 'Spain'),
 ('DFC-P', 'ROJ'),
 ('Long', 'Beach'),
 ('https', 'github'),
 ('ENSOR', 'KETCH'),
 ('ELECT', 'ROC'),
 ('symbolic', 'mismatch'),
 ('minwise', 'hashing'),
 ('Monte', 'Carlo'),
 ('Processing', 'Systems'),
 ('Kac-Rice', 'formula'),
 ('dynamical', 'isometry'),
 ('NIPS', 'Barcelona'),
 ('diﬀerentially', 'private'),
 ('nearest', 'neighbor'),
 ('Systems', 'NIPS'),
 ('blood', 'pressure'),
 ('Science', 'Foundation'),
 ('NIPS', 'Long'),
 ('National', 'Science'),
 ('Information', 'Processing'),
 ('solid', 'harmonic'),
 ('http', 'www'),
 ('Gradient', 'Descent'),
 ('computer', 'vision'),
 ('state-of', 'the-art'),
 ('false', 'positives'),
 ("Hotelling's", 'deﬂation'),
 ('Neural', 'Information'),
 ('multi-view', 'anomaly'),
 ('ground', 'truth'),
 ('fused', 'lasso'),
 ('approval', 'voting'),
 ('diﬀerential', 'privacy'),
 ('gene', 'expression'),
 ('IDS', 'games'),
 ('Accuracy', 'Accuracy'),
 ('visual', 'cortex'),
 ('bounding', 'boxes'),
 ('belief', 'propa

## 14. Getting 200 highest frequency bigrams

In [15]:
'''
As we need only 200 bigrams, we will traverse the loop 200 times, obtain each bigram and append it to a bigram_list. 
Now, our generated bigram_list consist of the top 200 highest frequency bigrams.
'''
bigram_list = []
for i in range(0,200):
    bigram_list.append(string1[i])
bigram_list

[('spike-and', 'slab'),
 ('Barcelona', 'Spain'),
 ('DFC-P', 'ROJ'),
 ('Long', 'Beach'),
 ('https', 'github'),
 ('ENSOR', 'KETCH'),
 ('ELECT', 'ROC'),
 ('symbolic', 'mismatch'),
 ('minwise', 'hashing'),
 ('Monte', 'Carlo'),
 ('Processing', 'Systems'),
 ('Kac-Rice', 'formula'),
 ('dynamical', 'isometry'),
 ('NIPS', 'Barcelona'),
 ('diﬀerentially', 'private'),
 ('nearest', 'neighbor'),
 ('Systems', 'NIPS'),
 ('blood', 'pressure'),
 ('Science', 'Foundation'),
 ('NIPS', 'Long'),
 ('National', 'Science'),
 ('Information', 'Processing'),
 ('solid', 'harmonic'),
 ('http', 'www'),
 ('Gradient', 'Descent'),
 ('computer', 'vision'),
 ('state-of', 'the-art'),
 ('false', 'positives'),
 ("Hotelling's", 'deﬂation'),
 ('Neural', 'Information'),
 ('multi-view', 'anomaly'),
 ('ground', 'truth'),
 ('fused', 'lasso'),
 ('approval', 'voting'),
 ('diﬀerential', 'privacy'),
 ('gene', 'expression'),
 ('IDS', 'games'),
 ('Accuracy', 'Accuracy'),
 ('visual', 'cortex'),
 ('bounding', 'boxes'),
 ('belief', 'propa

## 15. Retokenizing the unigrams of each document with bigrams and removal of stopwords

In [16]:
'''
retokenizing the generated unigrams with bigrams for each document using mwe_tokenizer, removing the stopwords from it and
appending it to a list. Finally, changing the token value for each document
'''
txt_token_list = []
for key,value in file_dict.items():
    
    token_listt = stopwords(mwe_tokenizer(bigram_list,file_dict[key]))
    txt_token_list.append(token_listt)
    file_dict[key] = token_listt
txt_token_list

[['accurate',
  'pedestrian',
  'detectors',
  'important',
  'technical',
  'goal',
  'approximately',
  'half-a',
  'million',
  'pedestrians',
  'killed',
  'cars',
  'year',
  'gures',
  'low',
  'resolution',
  'pedestrians',
  'tend',
  'characteristic',
  'appearance',
  'generally',
  'cope',
  'lateral',
  'frontal',
  'views',
  'walk',
  'cases',
  'lollipop',
  'shape',
  'torso',
  'wider',
  'legs',
  'stance',
  'phase',
  'walk',
  'scissor',
  'shape',
  'legs',
  'swinging',
  'walk',
  'encourages',
  'template',
  'matching',
  'early',
  'template',
  'matchers',
  'include',
  'support',
  'vector',
  'machines',
  'applied',
  'wavelet',
  'expansion',
  'variants',
  'neural__network',
  'applied',
  'stereoscopic',
  'reconstructions',
  'chamfer',
  'matching',
  'hierachy',
  'contour',
  'templates',
  'likelihood',
  'threshold',
  'applied',
  'random',
  'field',
  'model',
  'SVM',
  'applied',
  'spatial',
  'wavelets',
  'stacked',
  'frames',
  'give'

## 16. Getting the document frequency of each word 

In [17]:
'''
Traversing through txt_token_list , getting the unique token values for each document and appending it to a set_txt_token_list.
We are doing this process to have unique values of each document, so that getting document frequency will be easier.
'''
set_txt_token_list = []

for listt in txt_token_list:
    set_txt_token_list.extend(set(listt))

#getting the frequency using pd.Series
document_freq = pd.Series(set_txt_token_list).value_counts()
#converting the series dataframe into a list
doc_freq = list(zip(document_freq.index,document_freq))
doc_freq

[('set', 199),
 ('results', 199),
 ('number', 191),
 ('The', 190),
 ('We', 188),
 ('function', 187),
 ('problem', 185),
 ('work', 184),
 ('show', 182),
 ('diﬀerent', 181),
 ('data', 180),
 ('model', 176),
 ('note', 176),
 ('Figure', 176),
 ('based', 175),
 ('shown', 175),
 ('paper', 175),
 ('algorithm', 172),
 ('similar', 171),
 ('small', 170),
 ('large', 169),
 ('performance', 167),
 ('case', 167),
 ('In', 164),
 ('methods', 164),
 ('approach', 164),
 ('distribution', 163),
 ('method', 162),
 ('deﬁned', 158),
 ('result', 157),
 ('random', 157),
 ('linear', 157),
 ('form', 157),
 ('experiments', 157),
 ('learning', 156),
 ('time', 155),
 ('general', 155),
 ('order', 155),
 ('parameters', 155),
 ('assume', 155),
 ('probability', 155),
 ('standard', 153),
 ('parameter', 152),
 ('section', 152),
 ('space', 151),
 ('proposed', 151),
 ('values', 151),
 ('shows', 150),
 ('point', 150),
 ('models', 150),
 ('problems', 150),
 ('simple', 150),
 ('fixed', 150),
 ('vector', 148),
 ('setting', 146

## 17. Removal of tokens based on threshold frequency and length

In [18]:
'''
Threshold means getting the frequency , In the above obtained document frequency of each word, if the word is repeated less
than 6 times i.e 3% and greater than or equal to 190 i.e 95% are removed and the words with length less than 3 are 
removed. The words that do not satisfy the above mentioned criteria are only appended to a list.
'''
final_doc_freq = []
for tuplee in doc_freq:
    if 6<=tuplee[1]<=190 and len(tuplee[0])>3:
        final_doc_freq.append(tuplee[0])
final_doc_freq

['function',
 'problem',
 'work',
 'show',
 'diﬀerent',
 'data',
 'model',
 'note',
 'Figure',
 'based',
 'shown',
 'paper',
 'algorithm',
 'similar',
 'small',
 'large',
 'performance',
 'case',
 'methods',
 'approach',
 'distribution',
 'method',
 'deﬁned',
 'result',
 'random',
 'linear',
 'form',
 'experiments',
 'learning',
 'time',
 'general',
 'order',
 'parameters',
 'assume',
 'probability',
 'standard',
 'parameter',
 'section',
 'space',
 'proposed',
 'values',
 'shows',
 'point',
 'models',
 'problems',
 'simple',
 'fixed',
 'vector',
 'setting',
 'error',
 'high',
 'terms',
 'size',
 'analysis',
 'algorithms',
 'obtained',
 'obtain',
 'find',
 'single',
 'matrix',
 'deﬁne',
 'important',
 'step',
 'denote',
 'figure',
 'previous',
 'average',
 'compared',
 'optimization',
 'provide',
 'optimal',
 'true',
 'Section',
 'functions',
 'fact',
 'compute',
 'finally',
 'present',
 'sample',
 'structure',
 'respect',
 'applications',
 'make',
 'found',
 'compare',
 'cases',
 'set

## 18. Stemming the remaining unigram_tokens using PorterStemmer()

The process of identification and removal of prefixes, sufixes and pluralisation which leaves you with a stem. Inorder to do
stemming we use here PorterStemmer.

Stemming is not performed for the words (i)  which start with capital letter,
                                        (ii) where all the words are capitals and
                                        (iii)which contain bigram.
                                     
By default stemming is performed only for lower case.

In [19]:
'''
Getting the type of words discussed in the above cell using re.search method. If the word consists of all small case then
stemming is performed using porter stemmer
'''
def stemming(final_doc_freq):
    stemmer = PorterStemmer()
    final_tokens = []
    for wordd in final_doc_freq:
        if wordd.isupper():
            final_tokens.append(wordd)
        elif re.search(r'[A-Za-z]+__[A-Za-z]+',wordd):
            final_tokens.append(wordd)
        elif re.search(r'[A-Z]',wordd[0]):
            final_tokens.append(wordd)
        elif re.search(r"(.*)'(.*)",wordd):
            final_tokens.append(wordd)
        else:
            final_tokens.append(stemmer.stem(wordd))
    return final_tokens
    
'''
call the stemming method as the last step of text-preprocessing with the parameter as final_doc_freq which is obtained 
after doing the steps of text-preprocessing like sentence segmentation, tokenization, tokenizing the tokens with bigrams,
removing stopwords, removing context-dependent, rare-tokens and words with length less than 3.
'''
final_tokens_set = set(stemming(final_doc_freq))
final_tokens_set

{'theorem',
 'bottom__row',
 'mechan',
 'denois',
 'highli',
 'forward',
 'semant',
 'contigu',
 'index',
 'discov',
 'patch',
 'theoret',
 'mixtur',
 'present',
 'truth',
 'induct',
 'fulli',
 'large-scal',
 'salient',
 'downstream',
 'act',
 'bottom',
 'figure__shows',
 'Detection',
 'submodular',
 'kind',
 'mult',
 'appendix',
 'stop',
 'attent',
 'chain',
 'pick',
 'bound',
 'algo',
 'tradit',
 'ring',
 'decision__making',
 'dualiti',
 'one-hot',
 'good',
 'singular',
 'great',
 'celebr',
 'code',
 'total__variation',
 'uncertainti',
 'Decomposition',
 'succeed',
 'tightli',
 'popul',
 'slice',
 'comput',
 'indistinguish',
 'addit',
 'locat',
 'decreas',
 'While',
 'distinct',
 'nuclear',
 'Institute',
 'experi',
 'vari',
 'unrealist',
 'corrupt',
 'run',
 'switch',
 'remov',
 'unsupervis',
 'Stochastic',
 'term',
 'desir',
 'alloc',
 'recogn',
 'proven',
 'consist',
 'critic',
 'previous',
 'Mean',
 'proceed',
 'faster',
 'non-parametr',
 'Input',
 'violat',
 'uninform',
 'follow-

In [20]:
'''
Appending the words into three different lists based on its type , sorting the lists, and finally appending it into a 
single list. Now, our final list consists of all capital words at the starting, followed by the word that starts with capital
at the starting and finally with the small words. In all sets of words, as we done sorting we will have the words in the
ascending order
'''
capital_words = []
capital_starting_words = []
small_words = []
for word in final_tokens_set:
    if word.isupper():
        capital_words.append(word)
    elif word[0].isupper():
        capital_starting_words.append(word)
    else:
        small_words.append(word)

final_unique_words = sorted(capital_words) + sorted(capital_starting_words) + sorted(small_words)
final_unique_words

['ADMM',
 'AFOSR',
 'CAREER',
 'CIFAR-10',
 'DARPA',
 'LIBSVM',
 'MCMC',
 'MNIST',
 'MURI',
 'NIPS',
 'RKHS',
 'RMSE',
 'SIFT',
 'Accuracy',
 'Action',
 'Adam',
 'Adaptive',
 'After',
 'Algorithm',
 'Algorithms',
 'Allocation',
 'Also',
 'Although',
 'Amazon',
 'Analysis',
 'Appendix',
 'Approach',
 'Approximate',
 'Approximation',
 'Army',
 'Arora',
 'Assume',
 'Assumption',
 'Assumptions',
 'Average',
 'Award',
 'Based',
 'Baseline',
 'Bayes',
 'Bayesian',
 'Beach',
 'Because',
 'Before',
 'Belief',
 'Benchmark',
 'Bernoulli',
 'Beta',
 'Binary',
 'Block',
 'Blum',
 'Boltzmann',
 'Both',
 'Bottom',
 'Bound',
 'Bounds',
 'Bregman',
 'CNNs',
 'Center',
 'Cesa-Bianchi',
 'Chapter',
 'Chen',
 'China',
 'Chinese',
 'Classiﬁcation',
 'Classiﬁer',
 'Clustering',
 'Coding',
 'Comparison',
 'Complexity',
 'Computation',
 'Computational',
 'Compute',
 'Computer',
 'Conclusion',
 'Condition',
 'Conditional',
 'Conference',
 'Consider',
 'Convergence',
 'Convex',
 'Convolutional',
 'Corollary',


## 19. Getting the index value for each distinct word after doing all the text pre-processing

In [21]:
'''
Create a dictionary called vocab_txt , traverse the for loop for the total length of final_unique_words list , get the word
present in the index and store into vocab_txt dictionary as key-value pair. In vocab_txt dictionary, key is the word and
value is the index. The values of the index are in the ascending order starting from 0 to length of the final_unique_words_list.

'''
vocab_txt = {}
for word in range(len(final_unique_words)):
    vocab_txt[final_unique_words[word]] = word
vocab_txt

{'ADMM': 0,
 'AFOSR': 1,
 'CAREER': 2,
 'CIFAR-10': 3,
 'DARPA': 4,
 'LIBSVM': 5,
 'MCMC': 6,
 'MNIST': 7,
 'MURI': 8,
 'NIPS': 9,
 'RKHS': 10,
 'RMSE': 11,
 'SIFT': 12,
 'Accuracy': 13,
 'Action': 14,
 'Adam': 15,
 'Adaptive': 16,
 'After': 17,
 'Algorithm': 18,
 'Algorithms': 19,
 'Allocation': 20,
 'Also': 21,
 'Although': 22,
 'Amazon': 23,
 'Analysis': 24,
 'Appendix': 25,
 'Approach': 26,
 'Approximate': 27,
 'Approximation': 28,
 'Army': 29,
 'Arora': 30,
 'Assume': 31,
 'Assumption': 32,
 'Assumptions': 33,
 'Average': 34,
 'Award': 35,
 'Based': 36,
 'Baseline': 37,
 'Bayes': 38,
 'Bayesian': 39,
 'Beach': 40,
 'Because': 41,
 'Before': 42,
 'Belief': 43,
 'Benchmark': 44,
 'Bernoulli': 45,
 'Beta': 46,
 'Binary': 47,
 'Block': 48,
 'Blum': 49,
 'Boltzmann': 50,
 'Both': 51,
 'Bottom': 52,
 'Bound': 53,
 'Bounds': 54,
 'Bregman': 55,
 'CNNs': 56,
 'Center': 57,
 'Cesa-Bianchi': 58,
 'Chapter': 59,
 'Chen': 60,
 'China': 61,
 'Chinese': 62,
 'Classiﬁcation': 63,
 'Classiﬁer': 6

## 20. Writing the vocab dictionary into Group060_vocab.txt file

In [22]:
'''
Opening the sample_vocab.txt file and writing the dictionary into it line by line i.e, where each key-value pair comes in one
line.
'''
with open('Group060_vocab.txt','w+') as vocab_writer:
    for k, v in vocab_txt.items():
        vocab_writer.write(str(k) + ':'+ str(v) + '\n')

## 21. Calling the stemming function for tokens of each document

In [23]:
'''
Call the stemming function for each value of file_dict , the value in file_dict is the unigram-tokens and update the value
for each document that is updating the tokens for each document as the value for each document in file_dict are the tokens
'''
for key,value in file_dict.items():
    stemmed_tokens = stemming(file_dict[key])
    file_dict[key] = stemmed_tokens

## 22. Getting the vector count for each word in each document in the form of dictionary

In [24]:
'''
Create an empty dictionary, traverse through all the values i.e all tokens in file_dict, if a particular token is present
in generated sample_vocab.txt file, we will get the index of it and append it to a list. Finally, we will get the frequency 
count of these indexes and store it into dictionary with key as the document name and value as the vector list
'''
vectors_dict = {}
for key,value in file_dict.items():
    individual_vectors = []
    for val in file_dict[key]:
        if val in vocab_txt.keys():
            individual_vectors.append(vocab_txt[val])
    individual_vectors_freq = pd.Series(individual_vectors).value_counts()
    individual_vectors_freq1 = list(zip(individual_vectors_freq.index,individual_vectors_freq))
    vectors_dict[key] = dict(individual_vectors_freq1)
    
vectors_dict

{'PP3210': {1050: 48,
  758: 47,
  2408: 42,
  1233: 41,
  853: 33,
  1823: 32,
  975: 31,
  854: 31,
  1984: 22,
  2159: 21,
  1355: 19,
  1458: 19,
  2003: 17,
  984: 17,
  1485: 16,
  1967: 13,
  1648: 13,
  976: 13,
  1059: 13,
  1716: 12,
  462: 12,
  2221: 12,
  567: 11,
  1532: 10,
  1915: 9,
  1025: 9,
  1064: 9,
  645: 9,
  1618: 9,
  2274: 9,
  1701: 9,
  617: 8,
  2244: 8,
  2374: 8,
  1735: 8,
  2366: 8,
  2021: 8,
  1199: 8,
  1125: 8,
  1974: 8,
  2132: 7,
  2236: 7,
  1944: 7,
  464: 7,
  1761: 6,
  855: 6,
  1242: 6,
  1678: 6,
  420: 6,
  897: 6,
  1225: 6,
  697: 6,
  680: 6,
  466: 6,
  1056: 6,
  2285: 6,
  2087: 6,
  1128: 6,
  974: 5,
  1949: 5,
  2202: 5,
  872: 5,
  679: 5,
  632: 5,
  1444: 5,
  712: 5,
  1664: 5,
  1898: 5,
  947: 5,
  1116: 5,
  1758: 5,
  1102: 5,
  1592: 5,
  2411: 5,
  841: 5,
  1360: 5,
  1859: 5,
  760: 5,
  1383: 5,
  1728: 5,
  900: 4,
  1380: 4,
  979: 4,
  520: 4,
  921: 4,
  565: 4,
  532: 4,
  1344: 4,
  1409: 4,
  481: 4,
  1089: 

## 23. Writing the vector data into Group060_count_vectors.txt file

In [25]:
'''
Writing the data sample_count_vectors.txt file, where each line consists of document name and its vector count of each word
'''
with open('Group060_count_vectors.txt','w+') as vocab_vector_writer:
    for k, v in vectors_dict.items():
        v1 = str(v)
        '''
        Doind some replacement for a string inorder to match the given output in sample_count_vectors.txt file
        '''
        if re.search(r'{',v1):
            v1 = v1.replace('{','')
        if re.search(r'}',v1):
            v1 = v1.replace(r'}','')
        print((str(k) + ','+ v1 + '\n'))
        vocab_vector_writer.write(str(k) + ','+ v1 + '\n')
        

PP3210,1050: 48, 758: 47, 2408: 42, 1233: 41, 853: 33, 1823: 32, 975: 31, 854: 31, 1984: 22, 2159: 21, 1355: 19, 1458: 19, 2003: 17, 984: 17, 1485: 16, 1967: 13, 1648: 13, 976: 13, 1059: 13, 1716: 12, 462: 12, 2221: 12, 567: 11, 1532: 10, 1915: 9, 1025: 9, 1064: 9, 645: 9, 1618: 9, 2274: 9, 1701: 9, 617: 8, 2244: 8, 2374: 8, 1735: 8, 2366: 8, 2021: 8, 1199: 8, 1125: 8, 1974: 8, 2132: 7, 2236: 7, 1944: 7, 464: 7, 1761: 6, 855: 6, 1242: 6, 1678: 6, 420: 6, 897: 6, 1225: 6, 697: 6, 680: 6, 466: 6, 1056: 6, 2285: 6, 2087: 6, 1128: 6, 974: 5, 1949: 5, 2202: 5, 872: 5, 679: 5, 632: 5, 1444: 5, 712: 5, 1664: 5, 1898: 5, 947: 5, 1116: 5, 1758: 5, 1102: 5, 1592: 5, 2411: 5, 841: 5, 1360: 5, 1859: 5, 760: 5, 1383: 5, 1728: 5, 900: 4, 1380: 4, 979: 4, 520: 4, 921: 4, 565: 4, 532: 4, 1344: 4, 1409: 4, 481: 4, 1089: 4, 800: 4, 669: 4, 1892: 4, 1873: 4, 1691: 4, 1985: 4, 1993: 4, 2081: 4, 2154: 4, 1841: 4, 1121: 4, 731: 4, 811: 4, 1170: 4, 456: 4, 1763: 4, 2377: 4, 849: 4, 1063: 4, 1395: 3, 1403: 3,

## 24. Stat generation of authors, titles and abstract

Once we get the list of authors,tokenised words of abstract and titles from task2 function, we need to get the frequency count for the words and authors. To do this we used value_counts() function of pandas.

In [26]:
#converting into series and getting the count of each author
auth_series=pd.Series(filter(None, author_list)).value_counts(ascending=False)
#getiing the top 10 authors with their counts in a list
x=list(zip(auth_series.index,auth_series))[:10]
#sorting it based on number of documents produced if it is equal then sorting it based on their names
x=sorted(x, key=lambda tup:(-tup[1], tup[0]))
#getting the top 10 authors in name a list in specified order
final_auth=[x for x,y in x] 
#getting top 10 most frequent words from titles and abstract
final_title=list(pd.Series(title_words).value_counts().index)[:10]
final_abs=list(pd.Series(abstract_words).value_counts().index)[:10]

## 25. Writing the data into a csv file

In [27]:
df=pd.DataFrame(list(zip(final_abs, final_title, final_auth)),columns=['top10_terms_in_abstracts','top10_terms_in_titles', 'top10_authors'])
df.to_csv('stats.csv',index=False)

In [28]:
df #printing the top 10 abstracts, titles, authors in three different columns

Unnamed: 0,top10_terms_in_abstracts,top10_terms_in_titles,top10_authors
0,learning,learning,Michael I. Jordan
1,model,models,Martin J. Wainwright
2,data,networks,Erik B. Sudderth
3,algorithm,inference,Lawrence Carin
4,show,sparse,Maria-Florina F. Balcan
5,problem,neural,Pradeep K. Ravikumar
6,models,optimization,Raquel Urtasun
7,algorithms,gaussian,Tianbao Yang
8,method,deep,David Woodruﬀ
9,results,stochastic,Novi Quadrianto


## 26. Summary


This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

- **Downloading pdf files into text files**

Using parallel processing, downloading 200 pdf files from given url's using requests method.

- **Segmentation, Tokenization, collocation extraction**. 

By using the `punkt` package breaking the total string into sentences. Using `re` package tokenizing the segmented sentences.
From the combined unigram tokens of all documents, we have to extract the top 200 bigrams. 200 bigrams were generated to further tokenize the initial nltk corpus. The PMI measure was used to detect pairs of words with high probability of appearing together. In addition, bigram filters were also used to refine the bigrams even more.

- **Vocabulary and sparse vector generation**.

A vocabulary covering words from different documents was obtained by  removing stop words, the words which have threshold greater than 95% and less than 3%. Inorder to get the frequency count of each word pd.Series dataframe is used with value_counts function. The sparse vector is generated based on the index value of sparse vocabulary list as key and its frequency in the particular document as value.

## 27. References

- Roman Podlinov(May 22, 2013). *Download-large-file-in-python-with-requests?* [Response to]. Retrieved from https://stackoverflow.com/questions/16694907/

- *Using the Requests Library in Python?* Retrieved from https://www.pythonforbeginners.com/requests/using-requests-in-python

- MaxU(May 5, 2018). *How-do-i-get-the-number-of-occurrences-of-a-list-of-words-substrings-in-a-pandas-dataframe?*
[Response to]. Retrieved from https://stackoverflow.com/questions/50187849

- KernelPanic(Feb 15, 2017). *How-to-sort-list-tuple-of-lists-tuples-by-the-element-at-a-given-index?* [Response to]. Retrieved from https://stackoverflow.com/questions/3121979
