## FIT5196-S2-2019 assessment 2


Date: 12/09/2019

Environment: Python 3.6.5 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* pdfminer (used to extract information from the pdf documents)
* nltk (natural language toolkit used for processing natural language(human language))
* io (used for dealing various I/O operations)
* os (operating system library used to access folder and get pdfs)
* requests (used to download the pdf files from the given link)
* itertools (used for handling iterators)

## Introduction

In this assessment we are given with a pdf document which has the download links for 200 other pdf files, on which we need to perform text pre-processing and feature extraction. In below task we have performed various [Text preprocessing and Feature Extraction](#sec_1) steps. Various processing steps like tokenization, stemming, normalizing, stopwards removal has been used.

## Text preprocessing and Feature Extraction <a class="anchor" id="sec_1"></a>

In the text analysis section we have perfomed various pre-processing steps after reading the data from the 200 pdf files. After performing the preprocessing we have extracted the features from the given document, the steps for these are been followed in the following manner.
1. Extract the data from the pdf files.
2. Sentence Normalization 
3. Word Tokenization
4. Stopwords Removal
5. Unigram creation
6. Bigram creation
7. Retokenization
8. Removing Context Dependent and Rare Tokens
9. Stemming
10. Generating Vocabulary Index File 
11. Generating Sparse Count Vector 

### Extract the data from the pdf files.

1. Importing the required libraries
2. Converting the pdf file to text and downloading the pdfs
3. Extracting the information from pdf

#### Importing all the required libraries

In [12]:
import pdfminer
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
import io
import os
import glob,os
import requests
from nltk.collocations import *
from itertools import chain
import itertools
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.probability import *
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/srikarmanthatti/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### Converting the pdf file to text and downloading the pdfs

Here we are creating an empty list named as `main_data`. After reading the data from the converted text file we need to
make sure that everything comes clearly as the pdfminer will give lot of new lines between each element. So here we are 
replacing the newlines with an empty character and appending to the main_data list.

In [2]:
!pdf2txt.py -o group_053.txt Group053.pdf

In [3]:

main_data = []  
with open('group_053.txt') as f:
    for line in f:
        main_data.append(line.replace('\n','')) #clearing the newlines and appending to the main_data

In the below code we are reading the data from the main_data list and extracting pdf_name and the url provided for that 
particular pdf and storing these items as key value pairs in the dict_main dictionary

In [4]:
#test_string = main_data[4]  
pdf_name = re.compile(r'([\w]*(\.pdf))')  #regex for getting the pdf id 
url = re.compile(r'((http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?)') #regex for extrating the pdf url
dict_main = {}
for line in main_data:  #serching through each item in the main_data list and storing the pdf_name and url values as keyvalue pairs in dic_main()
    match1 = re.search(pdf_name,line)
    match2 = re.search(url,line)
    if match1 and match2:
        dict_main[match1.group(1)] = match2.group(1)


we have all the list of pdf file names and the url's in a dictionary, so we using that dictionary values to download the respective pdf documents
and storing it with the pdf name

In [5]:
for file_name,url in dict_main.items():
    data = requests.get(url, allow_redirects=True)
    open(file_name, 'wb').write(data.content)

#### Extracting the information from pdf

In [6]:
"""Converting the pdf data into text for all downloaded pdf files"""

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

This code block goes to the current working directory, open the pdf files and convert the data in the pdf to the text format
and stores the file data and file name in the file_list and file_names_list respectively

In [7]:
directory = os.getcwd()  # getting the currect working directory 

In [8]:
file_list = [] #creating an empty list to store the data of each pdf file
file_names_list = [] # creating an empty list to store the name of each pdf file 

for file in os.listdir(directory):
    if file.endswith(".pdf"):
        file_re = re.search(r'(PP[\w]+\.pdf)',file) # regex for extracting the file.
        file_names = re.search(r'(PP[\w]+)',file) #regex for getting the particular file name
        if file_re: 
            file_list.append(convert_pdf_to_txt(file_re.group(1))) # if the file exists in place then we are convertign to the text format
            
        if file_names:
            file_names_list.append(file_names.group(1)) #appending the file names to the file_name list
            
            

Extracting paper body from the file_list using `1 Paper Body(.*?)2\sReferences` regular expression and storing into a list called `paper_body`

In [9]:
paper_body = []  #creating an empty list for storing the body of the pdf
for each in file_list:
    paper_regex = re.search(r'1 Paper Body(.*?)2\sReferences', each, re.DOTALL)  ## regex for extracting the body of each pdf file
    match_body = paper_regex.group(1).strip()
    if paper_regex:
        paper_body.append(match_body)
        

## Sentence Normalization

In this section we will convert the words whichever occuring at the starting of a line or a sentence to the lowercase. We do not convert the words ot lowercase which are occuring the in the middle of the sentence

In [10]:
"""This function is used to extract the sentences from the paragraphs.
Number of arguments for this function is 1"""

def sent_tokenizer(paragraphs):
    return (sent_tokenize(paragraphs))

In [13]:
sentence_normalise = []
for each in paper_body:  ##looping through the body of each pdf
    each_sentence = sent_tokenizer(each) #converting paragraphs to sentences and storing in the 
    sent_list=[]
    for each in each_sentence:
        each_list=each.split()
        each_list[0] = each_list[0].lower()  # case normalization lower is executed here
        joined = ' '.join(each_list)
        sent_list.append(joined)
    paperbody_sent = ' '.join(sent_list)
    sentence_normalise.append(paperbody_sent) #storing the normalized sentences in `sentence_normalize` list

## Word Tokenization

In the below task we are performing tokenization (breaking a character sequence into pieces is known as tokenization). 
The tokens created here are all unigrams, but these will contain many stopwords and indentifiers in it. We can say that tokenization gives us the unigram vocab with lot of unwanted features.

From the sentences (`sentence_normalise`) which we have generated in the previous code block, we extract each word as tokens and storing it in the `main_list`

In [14]:
#from nltk.tokenize import RegexpTokenizer 

main_list = []
for each in sentence_normalise:  
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    unigram_tokens = tokenizer.tokenize(each) #extracting tokens by using the regex tokenizer
    main_list.append(unigram_tokens) #appending the tokens to the main_list
    

raw_text dictionary creation 

Creating a dictionary named `raw_text` which stores the token values with respect to the pdf file name

In [15]:
raw_text = {}
for name,text in zip(file_names_list,main_list): #zipping through the pdf file names and the list of tokens that has been extracted
    raw_text[name] = text

## Stopwords removal

Stop words are the words that carry little lexical content. They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. We are removing the stopwords which have been provided in the `stopwords_en.txt` text file.

In [17]:
# loading the stopwords_en.txt file 
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
    
stopwordsSet = set(stopwords) 

After loading the stopwords.txt, we need to remove stopwords from 200 paper bodies which are there in the dictionary `raw_text`. 

In [18]:
for k,v in raw_text.items():   # removal of stopwords from each paper body in dictionary called raw_text
    raw_text[k] =[w for w in v if w not in stopwordsSet]

## Unigram Creation


Concatenating all the tokenized patents using the chain.frome_iterable function. The returned list `words`
by the function contains a list of all the words seprated by white space.

In [19]:
from __future__ import division
from itertools import chain

words = list(chain.from_iterable(raw_text.values())) #lopping through values combining them and storing as a list
vocab = set(words) #getting only the unique values from the words list
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  28611 
Total number of tokens:  430844 
Lexical diversity:  15.058683723043584


## Bigrams Creation

The next task is go generate the bigram collocations, given the tokenized patents.

After removing stopwords from raw_text dictionary, we need to create bigram from the unigram tokens called the raw_text dictionary values.

After getting a list of all tokens we will generate the 200 bigram cllocations. The functions you need include
* BigramAssocMeasures()
* BigramCollocationFinder.from_words()
* apply_word_filter(lambda w: len(w) < 3)
* nbest(bigram_measures.pmi, 200)

In [20]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(words)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-200 bigrams
len(top_200_bigrams)

200

## Retokenize 

The Task is to tokenise the paper body with only unigrams. Now, we introduce 200 collcations. we need to make sure those collocations are not split into two individual words.
After creating the mwetokenizer we are creating a dictionary `colloc_tokens` than has the uni and bigram vocab 

In [21]:
mwetokenizer = MWETokenizer(top_200_bigrams,separator = '__')  ## creatign a MWEtokenizer with the bigrams created
colloc_tokens =  dict((pdf, mwetokenizer.tokenize(tokens)) for pdf,tokens in raw_text.items()) #creating bigram vocab by checking the tokens that are present in th epdf file


In [22]:
all_words_colloc = list(chain.from_iterable(colloc_tokens.values()))
colloc_voc = list(set(all_words_colloc))
print(len(colloc_voc))

28431


## Removing Context Dependent and Rare Tokens

The tokens whose frequency in documents appeared less than 3% is removed and the context dependent words with threshold to 95%.

In [23]:
# finding the frequency count of unigrams and bigrams
fd_3 = FreqDist(all_words_colloc)
less_freq_words = set([ k for k,v in fd_3.items() if v < 6 or v > 190])
    

Now let's remove the context dependent and rare tokens 

In [24]:
#if w not in less_freq_words
for k,v in colloc_tokens.items():
    #first_v = list(v)
    colloc_tokens[k] = [ w for w in v if w not in less_freq_words]
    


### length less than 3 Tokens removal

In [25]:
for k,v in colloc_tokens.items():
    colloc_tokens[k] = [ w for w in v if len(w) >= 3]
    

## Stemming

The nltk stemming(Porter Stemmer) is used to remove pluralisation of nouns, resulting in a decrease in the vocabulary - but not affecting the overall count of tokens. We do this stemming for the text analysis as we get root words for all the tokens we have

In [26]:
from nltk.stem import PorterStemmer
stemmed_tokens = {}
stemmer = PorterStemmer()  # stemming the using PorterStemmer()
for key,value in colloc_tokens.items():
    stemming_allgrams = []
    for token in value:
        if token == token.lower():
            token=stemmer.stem(token)
        else:
            token= stemmer.stem(token)
        stemming_allgrams.append(token)
    stemmed_tokens[key] = stemming_allgrams
    

## Generating Vocabulary Index File 

Generating the vocabulary index from the vocabulary that we have created by performing all the above steps. This file contains the words (features) in the sorted order with the index number

In [27]:
vocabulary = list(chain.from_iterable(stemmed_tokens.values()))
vocabulary_set = sorted(set(vocabulary))

with open('group053_vocab.txt','w', encoding="utf-8") as file:    # opening the file here
    for value in vocabulary_set:
        vocab_inx = (f'{value}:{vocabulary_set.index(value)}\n') 
        file.write(vocab_inx)

## Generating Sparse Count Vector 

Creating a sparse vector from the vocabulary index we have, this vector cotains the count of vocab that is present in each pdf file. Since this is a sparse vector, we only get the value if the vocab is present in the document

In [28]:
with open('group053_count_vectors.txt','w',encoding="utf-8") as vec:
    for p in range(len(stemmed_tokens)):
        name = file_names_list[p]
        vec.write(f'{name}')
       
        #print(stemmed_tokens[name])
        for each in stemmed_tokens[name]:
            if each in vocabulary_set:
                vec.write(f',{vocabulary_set.index(each)}:{stemmed_tokens[name].count(each)}')
        vec.write(f'\n')

## Task 2

In this Task 2, we are asked find statistics of Authors, Titles and Abstracts. To perform statistics, we need to do preprossing steps on the titles and abstracts.

Inorder to do the text preprocessing, we need to first extract titles, authors and abstract.

A function called `title_extract` is defined to extract the titles of the pdf file. In this we use `(.*?)Authored by:` regular expression to extract the titles.

### Methodology for Task B:
The following tasks has been performed in the task B
1. Extracting the required Title, Abstract and Author names for the pdf data
2. Normalizing the extracted data
3. Tokenizing Title and Abstract
4. Removing stopwords
5. Calculate the frequency and getting required metrics
6. Generating a CSV file.

In [29]:
# function to extract titles of the papers
def title_extract(text):
    """
    Function name: title_extract
    Number of arguments: 1
    Arguments: text
    Description: This function is used to extract the titles from the pdf file
    Return value: Title of the pdf file
    """
    titles_re = re.search(r'(.*?)Authored by:', text,re.DOTALL) # re expression to extract title of the paper
    titles_re = titles_re.group(1).strip()   
    if titles_re:
        return (titles_re) # if match we return the title name

A function called `abstract_extract` is defined to extract the abstracts of the pdf file. In this we use `Abstract(.*?)1 Paper Body` regular expression to extract the abstracts.

In [30]:
# function to extract the abstract
def abstract_extract(text):
    """
    Function name: abstract_extract
    Number of arguments: 1
    Arguments: text
    Description: This function is used to extract the abstract from the pdf file
    Return value: Abstract of the pdf file
    """
    abstract_re = re.search(r'Abstract(.*?)1 Paper Body', text,re.DOTALL)  # regular expression to extract abstract
    abstract_re = abstract_re.group(1).strip()
    if abstract_re:      # if match we return the abstract of the paper
        return abstract_re

We proceed to extract author names from the raw file using `Authored by:(.*?)Abstract` regular expression and we get output of list of author names. 

In [31]:
author_list = []
author_list_names = []
for each in file_list:
    author_re = re.search(r'Authored by:(.*?)Abstract', each,re.DOTALL)  ## regex for extracting the author names
    author_re = author_re.group(1).strip()  
    author_re = author_re.split('\n')
    author_list.append(author_re)
for each in author_list:  #looping through the list of authornames and appending them to a list
    for item in each:
        if len(item) != 0:
            author_list_names.append(item) 

Here we are calling the two functions defined above `title_extract` and `abstract_extract` to extract titles and abstracts from teh raw file.

In [32]:
extracted_title = []
extracted_abstract = []
for each in file_list:
    extracted_title.append(title_extract(each))  #function call to extract title
    extracted_abstract.append(abstract_extract(each)) #function call to extract abstract

### Normalizing Titles and Abstract

We need to normalise titles and abstracts before tokenization. For normalizing titles we use `lower()` method to normalise titles as shown below.

In [33]:
normalised_title = []
for each in extracted_title:
    each = each.lower()
    normalised_title.append(each)

In [34]:
abstract_normalise = []
for each in extracted_abstract:
    each_sent = sent_tokenizer(each)
    sent_list=[]
    for each in each_sent:
        each_list=each.split()
        each_list[0] = each_list[0].lower()  # case normalization lower is executed here
        joined = ' '.join(each_list)
        sent_list.append(joined)
    abstract_sent = ' '.join(sent_list)
    abstract_normalise.append(abstract_sent)

## Word Tokenization


The word tokenization must use the following regular expression, r"[A-Za-z]\w+(?:[-'?]\w+)?"

In [35]:
def regex_tokenization(text,a):
    if a == 0:
        tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
        titles = tokenizer.tokenize(text)
        return titles
    else:
        tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
        abstract = tokenizer.tokenize(text)
        return abstract

In [36]:
paper_title = []
paper_abstract = []
# tokenizing title
for each in normalised_title:
    tokenized_title = regex_tokenization(each,0)
    paper_title.append(tokenized_title)
        
# tokenizing abstract
for text in abstract_normalise:
    tokenized_abstract = regex_tokenization(text,1)
    paper_abstract.append(tokenized_abstract)
    

## Stopwords Removal

The context-independent stop words (i.e, stopwords_en.txt) is being removed.

In [37]:
title_stopwords = []
for each in paper_title:
    title_stopwords.append([w for w in each if w not in stopwordsSet])
    
abstract_stopwords = []
for each in paper_abstract:
    abstract_stopwords.append([w for w in each if w not in stopwordsSet])

## Top most Frequent words in Authors

In [38]:
fd_authors = FreqDist(author_list_names)
most_comm_authors = fd_authors.most_common(10)

## Top most Frequent words in Titles

In [39]:
title_tokens = list(chain.from_iterable(title_stopwords))
fd_titles = FreqDist(title_tokens)
most_comm_titles = fd_titles.most_common(10)

## Top most Frequent words in Abstract

In [40]:
abstract_tokens = list(chain.from_iterable(abstract_stopwords))
fd_abstract = FreqDist(abstract_tokens)
most_comm_abstract = fd_abstract.most_common(10)

In [41]:
def extract_freq_words(my_list):
    k,v = zip(*my_list)
    my_freq_words = list(k)
    return my_freq_words

In [42]:
freq_words_abstract = extract_freq_words(most_comm_abstract)
freq_words_titles = extract_freq_words(most_comm_titles)
freq_words_authors = extract_freq_words(most_comm_authors)

## Converting to csv

In [43]:
# creating a dataframe to append all the top 10 frequent words in titles, abstract and authors.
freq_df = pd.DataFrame()
freq_df['top10_terms_in_abstracts'] = freq_words_abstract  # appending list of top 10 abstracts to column called top10_terms_in_abstracts
freq_df['top10_terms_in_titles'] = freq_words_titles   # appending list of  top 10 titles to column called top10_terms_in_titles
freq_df['top10_authors'] = freq_words_authors       # appending list of  top 10 authors to column called top10_authors

In [44]:
freq_df

Unnamed: 0,top10_terms_in_abstracts,top10_terms_in_titles,top10_authors
0,learning,learning,Xi Chen
1,data,models,Han Liu
2,algorithm,regression,Martin J. Wainwright
3,model,networks,Ambuj Tewari
4,problem,inference,Eric P. Xing
5,show,stochastic,Trevor Darrell
6,models,modeling,Bernhard Sch?lkopf
7,approach,gradient,Jonathan W. Pillow
8,method,sparse,Larry Wasserman
9,algorithms,estimation,Robert C. Williamson


In [45]:
# writing the dataframe to csv file 
freq_df.to_csv(r'group053_stats.csv',index=False)  