# FIT5196 Assessment 2
<a id="FIT5196"></a>

#### Student Name: Jiawei Su
#### Student ID: 29590183


#### Student Name: Weiwei Jin
#### Student ID: 28106946

Date: 09/15/2019

Version: 4.8

Environment: Python 3.7.1 and Jupyter notebook

Libraries used:
* pandas (for dataframe, included in Anaconda Python 3.7.1) 
* re (for regular expression, included in Anaconda Python 3.7.1) 
* etc. 

# Table of contents

* [Student Information](#FIT5196)
* [1. Import libraries](#imp)
* [2. Approach](#appr)
   * [2.1. Patterns](#patt)
   * [2.2. Functions](#func)
* [3. Preparation](#prep)
   * [3.1. Read file](#pdfs)
   * [3.2. Build dictionary](#build)
* [4. Sparse Feature Generation](#sfg)
   * [4.1. Tokenization using provided Regex](#regex)
   * [4.2. Re-tokenization by biagrams](#bia)
   * [4.3. Remove context-independent stop words](#indep)  
   * [4.4. Remove context-dependent stop words](#dep) 
   * [4.5. Remove rare tokens](#rare)
   * [4.6. Remove tokens with length less than 3](#three)
   * [4.7. Stemming via the Porter Stemmer](#porter)
* [5. Statistics Generation](#sg)
   * [5.1. Prepare dictionary for Statistics Generation](#prepsg)
   * [5.2. Process titles](#ptitles)
   * [5.3. Process abstract](#pab)
   * [5.4. Process authors](#pau)
* [6. File export](#exp)
* [7. Summary](#sum)

## 1.  Import libraries 
<a id="imp"></a>

In [1]:
# import libraries used in this assessment

# For PDF download
import urllib.request
import importlib
import sys
importlib.reload(sys)

# For PDF reading and NLTK + text analyzing purposes
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import itertools
from itertools import chain
import nltk.data
from nltk import ngrams, FreqDist
from nltk.stem.porter import PorterStemmer

# Re library provides regex related functions
import re

# Pandas library provides high-performance, easy-to-use data structures 
# and data analysis tools
import pandas as pd

## 2. Approach
<a id="appr"></a>

We take the general approach as follows:
* 1 We define critical functions
* 2 We download file programmatically
* 3 We read PDF into dictionary of data iteratively
* 4 We process the dictionary as required
* 5 We export required data in correct format
* 6 We wrote retrospectives and summary for this assignment

### 2.1 Patterns
<a id="patt"></a>
Here are useful `PATTERNS` used later in this assignment.

In [2]:
PDF_SOURCE_PATH = r'Group067.pdf'
TEXT_ORIGINAL_PATH = 'links_orig.txt'
BODY_SECTION_PATTERN = re.compile(r'\n[1-9]\n')
TOKENIZATION_PATTERN = r"[A-Za-z]\w+(?:[-'?]\w+)?"
INDEX_FILE_PATH = "Group067_vocab.txt"
VECTOR_FILE_PATH = "Group067_count_vectors.txt"
STAT_FILE_PATH = "Group067_stats.csv"

### 2.2 Functions
<a id="func"></a>
In this secition, we define a couple of functions prior to processing the file. The goal is to make it simpler when we use the them to extract the data later on.

In [3]:
# Function to parse a PDF file, preprocess it, add the raw content of each line to a text file.
def parse_to_txt(path, outpath):
    # Open file in binary read mode
    fp = open(path, 'rb')
    
    # Use file to breate a PDF parser 
    praser = PDFParser(fp)
    
    # Generate a PDF document
    doc = PDFDocument()
    
    # pass document to parser
    praser.set_document(doc)
    doc.set_parser(praser)

    doc.initialize()
    
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        # Create PDF resource manager
        rsrcmgr = PDFResourceManager()
        
        # Create a PDF device object
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        
        # Create a PDF interpreter object
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        # traverse page list obtained from get_pages()
        for page in doc.get_pages(): 
            interpreter.process_page(page)
            
            layout = device.get_result()
            for x in layout:
                if (isinstance(x, LTTextBoxHorizontal)):
                    with open(outpath, 'a') as f:
                        results = x.get_text()
                        f.write(results + '\n')



# Function to parse a PDF file, preprocess it, prepare it for processing.
def parse_to_proc(path):
    # Open file in binary read mode
    fp = open(path, 'rb')
    
    # Use file to breate a PDF parser 
    praser = PDFParser(fp)
    
    # Generate a PDF document
    doc = PDFDocument()
    
    # pass document to parser
    praser.set_document(doc)
    doc.set_parser(praser)

    doc.initialize()
    
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        # Create PDF resource manager
        rsrcmgr = PDFResourceManager()
        
        # Create a PDF device object
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        
        # Create a PDF interpreter object
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        
        # Define a dictionary to return abstract, titles and body of every PDF file.
        essay_dict = {}
        essay_dict["abstract"] = ""
        essay_dict["titles"] = ""
        essay_dict["authors"] = ""
        essay_dict["body"] = []
        
        return_string = ""

        # Traverse page list obtained from get_pages()
        for page in doc.get_pages(): 
            interpreter.process_page(page)
            
            layout = device.get_result()
            for x in layout:
                if (isinstance(x, LTTextBoxHorizontal)):
                    results = x.get_text()
                    if (len(results) <= 3):
                        pass
                    else:
                        if "Authored by:" in results:
                            temp_str = return_string.strip("\n").replace("\n", " ")
                            final_str = temp_str.replace("- ", "")
                            essay_dict["titles"] = final_str
                            return_string = ""
                            continue
                        elif "Abstract\n" in results:
                            temp_str = return_string.strip("\n")
                            final_str = temp_str.replace("- ", "")
                            essay_dict["authors"] = final_str
                            return_string = ""
                            continue
                        elif "1 Paper Body" in results:
                            temp_str = return_string.strip("\n").replace("\n", " ")
                            final_str = temp_str.replace("- ", "")
                            essay_dict["abstract"] = final_str
                            return_string = ""
                            continue
                        elif "2 References" in results:
                            res = re.split(BODY_SECTION_PATTERN, return_string)
                            for ree in res:
                                temp_str = ree.replace("\n", " ")
                                final_str = temp_str.replace("- ", "")
                                essay_dict["body"].append(final_str)
                            continue
                        return_string += results              
        return essay_dict

                        
# Function to read the a text file, preprocess it, add the content of the text file to a list and return the list.
def read_file(path):        
    content = []
    try:
        with open(path, 'r', encoding = 'utf-8') as fp: 
            for line in fp:
                if 'url' in line:
                    pass
                elif 'filename' in line:
                    pass
                else:
                    if line:
                        content.append(line.strip('\n'))
    finally:
        fp.close()
    return content

# Function to download PDFs and return a list of filenames downloaded
def catch_pdf(file_list):
    COUNT = 1
    filenames = []
    for entry in file_list:
        url = entry.split()[1]
        print('downloading file No. ' + str(COUNT) + ' with urllib...')
        COUNT += 1
        filenames.append(entry.split()[0])
        OUTPUT_PATH = "pdf/" + entry.split()[0]
        urllib.request.urlretrieve(url, OUTPUT_PATH)
    return filenames

# Function to convert words to lowercase except the capital tokens appearing in the middle of a sentence/line
def to_lower(sentence):
    new_sen = [word if word.isupper() else word.lower() for word in sentence.split()]

    return " ".join(word for word in new_sen)

# Function to read the stopwords_en.txt, add the content of the text file to a list and return the list of stop words.
def read_independent_stop_words():        
    content = []                                         
    try:
        with open('stopwords_en.txt', 'r', encoding = 'utf-8') as fp: 
            for line in fp:
                content.append(line.strip("\n"))
    finally:
        fp.close()
    return content

# Function to catch a list of word which overpass the threshold of 0.95
def get_dependent_tokens(dic, num):
    a_list = []
    word_list = []
    for key, value in dic.items():
        a_list += list(set(value))
    fd = FreqDist(a_list)
    for word, value in fd.items():
        if value > (num * 200):
            word_list.append(word)
    return word_list

# Function to catch a list of rare word token which has literal frequency of less than 0.03
def get_rare_tokens(dic, num):
    a_list = []
    word_list = []
    for key, value in dic.items():
        a_list += list(set(value))
    fd = FreqDist(a_list)
    for word, value in fd.items():
        if value < (num * 200):
            word_list.append(word)
    return word_list

# Function to take a list of tokens and a number \num\ an return top \num\ bigram collocations
def generate_bigram_col(tokens, num):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
    bigram_finder.apply_freq_filter(20)
    bigram_finder.apply_word_filter(lambda w: len(w) < 5)
    
    return bigram_finder.nbest(bigram_measures.pmi, num)

## 3. Preparation
<a id="prep"></a>

Here in this section, we prepare some data for this assignment. We started from the file we are provided which is `Group067.pdf`, consists of a list of 200 essays in PDF format.

### 3.1 Download PDFs
<a id="pdfs"></a>
Use the given URLs to programmatically download the PDF files, files are saved to the directory `/pdf`, also a list consists of all file names called `filenames` is also generated.

In [4]:
# Read provided file to text format locally
parse_to_txt(PDF_SOURCE_PATH, TEXT_ORIGINAL_PATH)

# Use the text file above and start downloading
str_list = list(filter(None, read_file(TEXT_ORIGINAL_PATH))) 

In [5]:
filenames = catch_pdf(str_list)

downloading file No. 1 with urllib...
downloading file No. 2 with urllib...
downloading file No. 3 with urllib...
downloading file No. 4 with urllib...
downloading file No. 5 with urllib...
downloading file No. 6 with urllib...
downloading file No. 7 with urllib...
downloading file No. 8 with urllib...
downloading file No. 9 with urllib...
downloading file No. 10 with urllib...
downloading file No. 11 with urllib...
downloading file No. 12 with urllib...
downloading file No. 13 with urllib...
downloading file No. 14 with urllib...
downloading file No. 15 with urllib...
downloading file No. 16 with urllib...
downloading file No. 17 with urllib...
downloading file No. 18 with urllib...
downloading file No. 19 with urllib...
downloading file No. 20 with urllib...
downloading file No. 21 with urllib...
downloading file No. 22 with urllib...
downloading file No. 23 with urllib...
downloading file No. 24 with urllib...
downloading file No. 25 with urllib...
downloading file No. 26 with urlli

### 3.2 Build dictionary
<a id="build"></a>
Now, we have the 200 essays in PDF we are ready to use `PDFMINER3K` to process them. We now loop `filenames`, and tokenize each essay's body. We also save the essay's title, author and abstract in a dictionary and return at the same time

In [6]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer = RegexpTokenizer(TOKENIZATION_PATTERN)
token_dict = {}
#for_TF_IDF = []

# Process files
for name in filenames:
    FULL_PATH = 'pdf/' + name
    uni_sents = []
    
    for section in parse_to_proc(FULL_PATH)["body"]:
        sentences = sent_detector.tokenize(section.strip())
        for sent in sentences:
            uni_sent = tokenizer.tokenize(to_lower(sent))
            uni_sents.append(uni_sent)
            
    #for entry in uni_sents:
        #for_TF_IDF.append(' '.join(token for token in entry))
    
    word_tokens = list(chain.from_iterable(li for li in uni_sents))
    token_dict[name.strip(".pdf")] = word_tokens

# Returned dictionary
token_dict

{'PP3206': ['machine',
  'learning',
  'has',
  'been',
  'applied',
  'to',
  'number',
  'of',
  'tasks',
  'involving',
  'an',
  'input',
  'domain',
  'with',
  'special',
  'topology',
  'one-dimensional',
  'for',
  'sequences',
  'two-dimensional',
  'for',
  'images',
  'three-dimensional',
  'for',
  'videos',
  'and',
  'for',
  'capture',
  'some',
  'learning',
  'algorithms',
  'are',
  'generic',
  'working',
  'on',
  'arbitrary',
  'unstructured',
  'vectors',
  'in',
  'such',
  'as',
  'ordinary',
  'svms',
  'decision',
  'trees',
  'neural',
  'networks',
  'and',
  'boosting',
  'applied',
  'to',
  'generic',
  'learning',
  'algorithms',
  'on',
  'the',
  'other',
  'hand',
  'other',
  'learning',
  'algorithms',
  'successfully',
  'exploit',
  'the',
  'speciﬁc',
  'topology',
  'of',
  'their',
  'input',
  'sift-based',
  'machine',
  'vision',
  'convolutional',
  'neural',
  'networks',
  'time-delay',
  'neural',
  'networks',
  'it',
  'has',
  'been',

## 4. Sparse Feature Generation
<a id="sfg"></a>

In this section we need to retrieve the sparse representation, but before that there are a number of pre-processing steps we need to perform. Includes but not limited to:
* Tokenization
* Strip stop words
* Stemming

### 4.1 Tokenization using provided Regex
<a id="regex"></a>
First as we have already have the dictionary `token_dict` of filename to tokens from Section 3, I have also identified that there are words that contains a `?` which is not usable. So after finishing tokenization, I will do a minor cleanup of bad tokens which consist `?`.

In [7]:
# Generate refine_token_dict which exclude tokens which consist a question mark.
refine_token_dict = {}
for key, value in token_dict.items():
    refine_token_dict[key] = [word for word in value if "?" not in word]

# a list of all tokens and a set of unique tokens for further usage
all_tokens = list(chain.from_iterable(refine_token_dict.values()))
uni_all_tokens = list(set(all_tokens))

### 4.2 Re-tokenization by biagrams
<a id="bia"></a>
The next step is that we use the privous dictionary `token_dict` and apply re-tokenization to it using Top 200 biagrams, generate a new dictionary `bia_dict`. Notice that since biagrams should not include context-independent stopwords as part of them, here I pre-calculate the `stopwords_set`.

<div class="alert alert-block alert-warning">

**`biagram_tokenize`**: The function takes `a dictionary` and `a list of biagrams` as variables, retokenize using `MWETokenizer`.

In [10]:
# Biagrams should be separated using double underscore i.e. “__” when retokenization happens
def biagram_tokenize(dic, biagram_list):
    new_dic = {}
    mwe_tokenizer = MWETokenizer(biagram_list, separator='__')
    for key, value in dic.items():
        new_dic[key] = mwe_tokenizer.tokenize(value)
    return new_dic

In [12]:
# Generate the top 500 bigram collocations 
# Note that with 5 set as the word filter, any bigram appears less than 5 times will not be considered
bia_list = []
top_500_bigrams = generate_bigram_col(all_tokens, 500)

# Generate the list of context-independent stop words
stop_list = read_independent_stop_words()
stopwords_set = set(stop_list)

# Bigrams should not include context-independent stopwords as part of them
stopped_bia_list = [bia for bia in top_500_bigrams if bia[0] not in stopwords_set and bia[1] not in stopwords_set]

# Generate new dictionary
bia_dict = biagram_tokenize(refine_token_dict, stopped_bia_list[:200])

In [13]:
#bia_dict

### 4.3 Remove context-independent stop words
<a id="indep"></a>
Next, after tokenization process we then need to remove context-independent words which comes from the text file `stopwords_en.txt` provided from the teaching team. A new dictionary `stopped_tokens_dict` will be generated as a result.

<div class="alert alert-block alert-warning">

**`content_in_stop`**: The function takes `a dictionary` as a variable, removes all context-independent stop words from it and returns a new dictionary.

In [14]:
def content_in_stop(dic):
    new_dic = {}
    for key, value in dic.items():
        new_dic[key] = [word for word in value if word not in stopwords_set]
    return new_dic

In [15]:
# Apply content_in_stop to remove context-independent stop words, note that stopwords_set is generated in section 4.2
stopped_tokens_dict = content_in_stop(bia_dict)
#stopped_tokens_dict

### 4.4 Remove context-dependent stop words
<a id="dep"></a>
Next we continue to get rid of stop words, this time we aim at context-dependent we calculate using the following step:
1. `get_dependent_tokens` function as mentioned above, takes the dictionary `stopped_tokens_dict` from last step and make sure each word only appears 1 time in the list of each file.
2. The second argument of `get_dependent_tokens` function makes it return all words which shows up in equal or more than `95%` of the 200 documents in a list format
3. Lastly we remove these words from the token dictionary and returns a new dictionary `full_stopped_tokens_dict`

<div class="alert alert-block alert-warning">

**`content_de_stop`**: The function takes `a dictionary` and `a word list` as a variable, removes all context-dependent stop words in that list from the dictionary and returns a new dictionary.

In [16]:
def content_de_stop(dic, word_list):
    new_dic = {}
    for key, value in dic.items():
        new_dic[key] = [word for word in value if word not in word_list]
    return new_dic

In [17]:
# Generate list of context-dependent stop words
dependent_stop_list = get_dependent_tokens(stopped_tokens_dict, 0.95)

full_stopped_tokens_dict = content_de_stop(stopped_tokens_dict, dependent_stop_list)
#full_stopped_tokens_dict

### 4.5 Remove rare tokens
<a id="rare"></a>
Next we will strip rare tokens from the token dictionary `full_stopped_tokens_dict`. The definition of rare tokens are the token which does not surpass the threshold of 0.03. The step to get word frequency is the same as section 4.4. The return dictionary is called `final_stopped_tokens_dict`.

<div class="alert alert-block alert-warning">

**`rare_token_stop`**: The function takes `a dictionary` and `a word set` as a variable, removes all rare tokens in that list from the dictionary and returns a new dictionary.

In [18]:
def rare_token_stop(dic, word_list):
    new_dic = {}
    for key, value in dic.items():
        new_dic[key] = [word for word in value if word not in word_list]
    return new_dic

In [19]:
# Generate list of rare tokens
rare_token_list = get_rare_tokens(full_stopped_tokens_dict, 0.03)
rare_token_set = set(rare_token_list)

final_stopped_tokens_dict = rare_token_stop(full_stopped_tokens_dict, rare_token_set)
#final_stopped_tokens_dict

### 4.6 Remove tokens with length less than 3
<a id="three"></a>
Next, we will scan the tokens and simply remove any token with the length less than 3. We will return a dictionary called `long_final_stopped_tokens_dict`.

<div class="alert alert-block alert-warning">

**`short_token_stop`**: The function takes `a dictionary` the variable, removes all tokens with a length of 2 or less from the dictionary and returns a new dictionary.

In [20]:
def short_token_stop(dic):
    new_dic = {}
    for key, value in dic.items():
        new_dic[key] = [word for word in value if len(word) > 3]
    return new_dic

In [21]:
long_final_stopped_tokens_dict = short_token_stop(final_stopped_tokens_dict)
#long_final_stopped_tokens_dict

### 4.7 Stemming via the Porter Stemmer
<a id="porter"></a>
Finally, we will perform Porter Stemmer. The steamming process helps us to eliminate the words which are in different form. A dictionary called `stemmed_long_final_stopped_tokens_dict` will be returned as a result.

<div class="alert alert-block alert-warning">

**`stemming`**: The function takes `a dictionary` the variable, perform `stemming` for words in the dictionary and returns a new dictionary. Notice that stemming performs lower casing by default, so there is a condition that is a word is in `UpperCASE`, this means that this is a special term and should be kept the same without going through port stemmer.

In [22]:
def stemming(dic):
    new_dic = {}
    for key, value in dic.items():
        porter_stemmer = PorterStemmer()
        new_dic[key] = [word if word.isupper() else porter_stemmer.stem(word) for word in value]
    return new_dic

In [23]:
stemmed_long_final_stopped_tokens_dict = stemming(long_final_stopped_tokens_dict)
stemmed_long_final_stopped_tokens_dict

{'PP3206': ['machine__learn',
  'appli',
  'task',
  'involv',
  'input',
  'domain',
  'special',
  'topolog',
  'one-dimension',
  'sequenc',
  'two-dimension',
  'imag',
  'video',
  'captur',
  'learn',
  'algorithm',
  'gener',
  'work',
  'arbitrari',
  'unstructur',
  'vector',
  'ordinari',
  'svm',
  'decis',
  'tree',
  'neural__network',
  'boost',
  'appli',
  'gener',
  'learn',
  'algorithm',
  'hand',
  'learn',
  'algorithm',
  'success',
  'exploit',
  'speciﬁc',
  'topolog',
  'input',
  'machin',
  'vision',
  'convolutional__neur',
  'network',
  'neural__network',
  'two-dimension',
  'structur',
  'natur',
  'imag',
  'strong',
  'prior',
  'requir',
  'huge',
  'bit',
  'start',
  'complet',
  'uniform',
  'prior',
  'permut',
  'question',
  'studi',
  'two-dimension',
  'structur',
  'natur',
  'imag',
  'strong',
  'prior',
  'learn',
  'exampl',
  'small',
  'exampl',
  'discov',
  'structur',
  'conjectur',
  'imag',
  'topolog',
  'incorrect',
  'answer',
 

## 5. Statistics Generation
<a id="sg"></a>

Now the second task of this assignment is to analyzi the `title`, `abstract` and the `authors` of all the essays. The approach we use is similar to the previous section.

### 5.1 Prepare dictionary for Statistics Generation
<a id="prepsg"></a>
First we prepare the three dictionaries, each stores key of `file name` and `tokens` or `names` as value correspondant to that file name key.

In [24]:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer = RegexpTokenizer(TOKENIZATION_PATTERN)
title_dict = {}
abstract_dict = {}
author_dict = {}

# Process files
for name in filenames:
    FULL_PATH = 'pdf/' + name
    target_dic = parse_to_proc(FULL_PATH)
    
    title_dict[name.strip(".pdf")] = tokenizer.tokenize(target_dic["titles"].lower())
    
    sentences = sent_detector.tokenize(target_dic["abstract"].strip())
    for sent in sentences:
        uni_sent = tokenizer.tokenize(to_lower(sent))
        uni_sents.append(uni_sent)
            
    #for entry in uni_sents:
        #for_TF_IDF.append(' '.join(token for token in entry))
    
    temp_tokens = list(chain.from_iterable(li for li in uni_sents))
    abstract_dict[name.strip(".pdf")] = temp_tokens
    
    author_dict[name.strip(".pdf")] = target_dic["authors"].split("\n")

### 5.2 Process titles
<a id="ptitles"></a>
Now we need to process titles of these essays in the dictionary `title_dict`. We use the same function as we did in sectin `4.3`.

In [25]:
# Simply apply content_in_stop we defined and used before for Section 4.3
stopped_title_dict = content_in_stop(title_dict)
sum_list = list(chain.from_iterable(stopped_title_dict.values()))
fd_title = FreqDist(sum_list)

# Print 10 most common tokens using FreqDist()
print(fd_title.most_common(10))

[('learning', 47), ('models', 13), ('optimization', 12), ('data', 11), ('neural', 11), ('inference', 11), ('eﬃcient', 10), ('bayesian', 9), ('gaussian', 9), ('hierarchical', 9)]


### 5.3 Process abstract
<a id="pad"></a>
Then, we need to process abstract of each these essays in the dictionary `abstract_dict`. Again we can use the same technique as `Section 4.3` and `Section 5.2`

In [26]:
stopped_abstract_dict = content_in_stop(abstract_dict)
sum_abs_list = list(chain.from_iterable(stopped_abstract_dict.values()))
fd_abs = FreqDist(sum_abs_list)

# Print 10 most common tokens using FreqDist()
print(fd_abs.most_common(10))

[('matrix', 22296), ('learning', 19761), ('algorithm', 18966), ('problem', 16007), ('CCD', 14600), ('method', 13284), ('rate', 12579), ('RCD', 12200), ('show', 12129), ('problems', 11850)]


### 5.4 Process authors
<a id="pau"></a>
Lastly, we process author list of these essays in the dictionary `author_dict`. 

In [27]:
sum_au_list = list(chain.from_iterable(author_dict.values()))
fd_au = FreqDist(sum_au_list)

# Print 10 most common tokens using FreqDist()
print(fd_au.most_common(10))

[('R?mi Munos', 3), ('Lawrence Carin', 3), ('David M. Blei', 3), ('Han Liu', 3), ('Remi Munos', 3), ('Yoshua Bengio', 2), ('Bhiksha Ra j', 2), ('Phil Blunsom', 2), ('Alain Rakotomamonjy', 2), ('Constantine Caramanis', 2)]


## 6. File export
<a id="exp"></a>
This section, we export our result to out_put files

<div class="alert alert-block alert-warning">

**`dict_to_file`**: The function takes `a dictionary` and `a string` the variable, export the content of the dictionary in the correct form as per the assignment's requirement.

In [28]:
def dict_to_file(dic, NAME):
    a_list = list(chain.from_iterable(dic.values()))
    final_list = sorted(list(set(a_list)), key=lambda s: s.casefold())
    
    save_file = open(NAME, 'w')
    for index, word in enumerate(final_list):
        save_file.write(word + ':' + str(index) + '\n')
    save_file.close()

<div class="alert alert-block alert-info">
    Export Group067_vocab.txt

In [29]:
dict_to_file(stemmed_long_final_stopped_tokens_dict, INDEX_FILE_PATH)

<div class="alert alert-block alert-warning">

**`to_count_vectors`**: The function takes `a dictionary` and `a string` the variable, export the vector content in the correct form as per the assignment's requirement.

In [30]:
def to_count_vectors(namelist, PATH):
    a_list = list(chain.from_iterable(stemmed_long_final_stopped_tokens_dict.values()))
    final_list = sorted(list(set(a_list)), key=lambda s: s.casefold())
    
    save_file = open(PATH, 'w')
    for name in namelist:
        fd = FreqDist(token_dict[name.strip(".pdf")])
        a_str = name.strip(".pdf")
        for word, value in fd.items():
            for index, w in enumerate(final_list):
                if word == w:
                    a_str += ","
                    temp_str = str(index) + ":" + str(value)
                    a_str += temp_str
        save_file.write(a_str + "\n")
    save_file.close()

<div class="alert alert-block alert-info">
    Export Group067_count_vectors.txt

In [31]:
to_count_vectors(filenames, VECTOR_FILE_PATH)

<div class="alert alert-block alert-info">
    Export Group067_stats.csv

In [32]:
df_new = pd.DataFrame()

# Add column for terms in abstract
freq_abs = []
for term, value in fd_abs.most_common(10):
    freq_abs.append(term)
df_new["top10_terms_in_abstracts"] = freq_abs

# Add column for terms in titles
freq_titles = []
for term, value in fd_title.most_common(10):
    freq_titles.append(term)
df_new["top10_terms_in_titles"] = freq_titles

# Add column for trending authors
freq_au = []
for term, value in fd_au.most_common(10):
    freq_au.append(term)
df_new["top10_authors"] = freq_au

# Display the table
df_new

# Export to .csv format
df_new.to_csv(STAT_FILE_PATH, index=False)

## 7. Summary
<a id="sum"></a>
Give a short summary of your work done above, such as your findings.

<div class="alert alert-block alert-info">

In this assessment, we are asked to extract document content from multiple pdf documents and `pre-process` the content and then convert the raw data into more-structured data for further processing like ML. 

After getting the raw data, we analyze the specific part of the content `differently`.

To pre-process the data, the main flow of content is: 
* First, use regular expressions to mark some part of content, divide the file into different parts. 
* Then, during this process, data is also treated for the purpose of easy-read. 
* Next, tokens are extracted. There are several rules we followed during this process
* Next, we need to delete the "stop words" - such as English prepositions as there is no practical meaning for those words. Also we strip words which are merely in the different forms of another word. 
* Finally, tokens are adjusted based on the word frequency, and to top that up word order are adjusted. Sometimes, we also need to consider whether we need to treat Uppercase words differently, then sentence segmentation is also needed to satisfy that requirement.