# Plan
- Preprocessing
    - do data analysis on tokens

- Test which preprocessing is best w/ simple model
    - pick best by accuracy on task using simple model (Naive B)
    - do data analysis on the vectors

- Try out a bunch of different models; pick best three
- Model 1 section
    - discuss
    - train
    - hyperparameter optimization (using F1 averaged over classes)
    - evaluate in detail
- Model 2 section
- Model 3 section

- Try ensembling (evaluate using F1 again, decide if it's worth it)

- Apply final model to test set

# Intro to Text Analytics - Homework 2
### Group 1: Boluwade Alabi, Elizabeth Burke, Lilah Koudelka, Michael Mullen, Cathal Weakliam

## Table of Contents
- [1.0 - Preprocessing](#preprocess)
     - [1.1 - Token Analysis](#token_analy)
- [2.0 - Comparing Preprocessing Methods](#comp_methods)
     - [2.1 - Highest Accuracy Produced Using a Simple Model](#best_model)
     - [2.2 - Vector Data Analysis](#vect_analy)
- [3.0 - Initial Model Building](#initial_models)
     - [3.1 - Selecting the Top 3 Models](#top3)
- [4.0 - Model #1](#m1)
     - [4.1 - Discussion](#m1_discuss)
     - [4.2 - Training](#m1_train)
     - [4.3 - Hyperparameter Optimization](#m1_optimize)
     - [4.4 - Evaluation](#m1_eval)
- [5.0 - Model #2](#m2)
     - [5.1 - Discussion](#m2_discuss)
     - [5.2 - Training](#m2_train)
     - [5.3 - Hyperparameter Optimization](#m2_optimize)
     - [5.4 - Evaluation](#m2_eval)
- [6.0 - Model #3](#m3)
     - [6.1 - Discussion](#m3_discuss)
     - [6.2 - Training](#m3_train)
     - [6.3 - Hyperparameter Optimization](#m3_optimize)
     - [6.4 - Evaluation](#m3_eval)
- [7.0 - Ensembling](#ensemble)
- [8.0 - Applying the Final Model to the Test Set](#apply_final)

<a id='preprocess'></a>

## 1.0 - Preprocessing

In [None]:
import bs4
import collections
import math
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import wordnet
import numpy as np
import os
import pandas as pd
import requests
import sklearn
import string
import wordcloud as wc
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

## Gather Data into Data Frame

The very first thing we need to do before we can analyse our data is to gather the data from the input files and enter it into a Pandas DataFrame.

### Read Documents Into a Data Frame

We read the CSV files into a Pandas DataFrame object so we can easily analyse the data.

In [None]:
# encoding=utf-8
df_handle = pd.read_csv("trainingset.csv", sep="^", header=0)

df_handle.head()

### Save Data Frame to CSV File

Now that we have read the data into our dataframe, we will need a way to export the data. To do this, we write the dataframe to a CSV (comma-separated values) file.

In [None]:
df_handle.to_csv("h2_tokens.csv", sep='^', index=None)

We then read our data from the file we just wrote to, in order to ensure that it was written to correctly.

In [None]:
reread_data = pd.read_csv("h2_tokens.csv", sep="^")

assert(reread_data.equals(df_handle))

We have now successfully loaded our corpus into a data frame. 
Now we can do some cleaning and pre-processing steps on the data contained within.

## Extract Tokens From Raw Text

Now that we have our data, we want to process it. However, textual data is unstructured. This means that it can come in many different forms. We will have to develop a process by which to normalize the text so that the algorithms that we apply to each document will treat each document equally.


To do this, we will split each document's textual content into a list of words, called tokens. This process is called *tokenization*. After tokenizing, we will perform some data cleaning and transformation (like removing special characters and converting every word to lowercase, etc.)

We define each of our transformation functions individually. We also define a function, `normalize_text()`, that performs all of the transformation functions on a dataframe series.

We have the individual functions so that we can go through and discuss them one-by-one in this section. In later sections, however, we will use the conglomerate function for brevity.

### Transformation Functions

Firstly, we define some utility functions that are needed by the transformation functions.

In [None]:
def get_stop_words():
    """
    Returns a list of stopwords that should be removed when preprocessing text.
    Included are nltk stopwords, salutations and a list of stopwords shown in the labs.
    """
    stop_words = nltk.corpus.stopwords.words('english')
    
    # Add salutations to the stop words list.
    salutations = ['mr','mrs','mss','dr','phd','prof','rev']
    stop_words.extend(salutations)
        
    additional_stop_words = ["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]
    stop_words.extend(additional_stop_words)
    
    return stop_words


def pos_to_wordnet_tag(pos_tag):
    """
    Convert an NLTK.pos_tag to a WordNet tag that the lemmatizer uses.
    """
    if pos_tag.lower().startswith('j'):
        return 'a'
    elif pos_tag.lower().startswith('v'):
        return 'v'
    elif pos_tag.lower().startswith('n'):
        return 'n'
    elif pos_tag.lower().startswith('r'):
        return 'r'
    else:
        # Default POS for lemmatization is noun.
        return 'n'

Now we define the transformation functions themselves.

In [None]:
def remove_hyphens(series):
    """
    Takes a string of text and returns a string of text with hyphens deleted.
    We do this so words like "e-mail", "wi-fi", etc. are kept together.
    """
    return series.replace("-", "")


def tokenize(series):
    """
    Takes a string of text and returns a list of string tokens.
    """
    # Pattern matches one or more alphanumeric characters (or underscores).
    TOKENIZER_REGEX = r'\w+'
    tokenizer = nltk.tokenize.RegexpTokenizer(TOKENIZER_REGEX)  
    return tokenizer.tokenize(series)


def decapitalize(series):
    """
    Takes a list of tokens and returns a list of tokens, with every token in lowercase form.
    """
    # Map the string to-lowercase method to every value in the series.
    return [word.lower() for word in series]


def remove_numbers(series):
    """
    Takes a list of tokens and returns a list of tokens that do not consist solely of numeric characters.
    For example, "3" is removed, but not "3G" or "Three".
    """
    return [word for word in series if not word.isnumeric()]


def remove_special_chars(series):
    """
    Takes a list of tokens and returns that list, without special characters.
    
    The only special characters that are left in the text are underscores since we used
    the regex '\w+'.
    """    
    return [word.replace("_", "") for word in series]


def remove_punctuation(series):
    """ 
    Takes a list of tokens and returns a list of tokens, with any punctuation stripped.
    
    This is not included in the normalize_text() function since it is not needed - the
    punctuation is removed in the tokenize() function anyway. However, it is included 
    here for completeness and is used in the n-gram analysis later.
    """
    results = []
    
    for word in series:
        new_word = "".join(character for character in word if character not in string.punctuation)
        
        if new_word is not "":
            results.append(new_word)
    
    return results

stop_words = get_stop_words()

def remove_stop_words(series):
    """
    Takes a list of tokens and returns a list of those tokens that are not contained 
    in the stop words list.
    """
    return [word for word in series if word not in stop_words]
    
    
def lemmatize(series):   
    """
    Takes a list of tokens and returns a list of the same tokens, transformed using the 
    WordNet Lemmatizer. This lemmatization process converts words into a common root.
    For example, it converts plural words into singular form, or past-tense verbs into their root.
    """
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    lemmatized_words = []
    for word, tag in nltk.pos_tag(series):
        wordnet_tag = pos_to_wordnet_tag(tag)
        lemmatized_words.extend([lemmatizer.lemmatize(word, wordnet_tag)])
        
    return lemmatized_words


def normalize_text(series, keep_stop_words=False, lemmatization=True):
    """ 
    Takes a pandas Series object and returns a list of tokens for that series.
    Gives an option (default=True) on whether to do lemmatization or not.
    """
    newseries = (series.apply(remove_hyphens)
                  .apply(tokenize)
                  .apply(remove_special_chars)
                  .apply(decapitalize)
                  .apply(remove_numbers))
    
    # Repeat removal of numbers after lemmatization. This is discussed below.
    if lemmatization:
        newseries = newseries.apply(lemmatize)
        newseries = newseries.apply(remove_numbers)
    
    if not keep_stop_words:
        newseries = newseries.apply(remove_stop_words)

    return newseries

We now go through applying each transformation function step-by-step.

#### Tokenization

Our first step is to tokenize the words in the text of each article. We will create a new column in our dataframe and call it `tokens`, filling it with the result of applying our `tokenize()` function on the `contents` column.

We define a token using the regular expression `\w+`, which matches any alphanumeric character or an underscore, "_".

In [None]:
df_handle['tokens'] = df_handle['content'].apply(remove_hyphens).apply(tokenize)
df_handle.head()

We see that we have successsfully split the `content` column into its component words.

#### Remove Punctuation and Special Characters

Because of the regex pattern that we used, `\w+`, we have already separated tokens into only alphanumeric words. This means that we don't have to do any further filtering to remove punctuation and special characters. We also remove any tokens that are just empty strings.

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(remove_special_chars)
df_handle['tokens'].head()

#### Decapitalization

Our data is still not fully normalized though. We want all tokens to have the same form, but some of the tokens are uppercase, some lowercase, and some a mixture of both. To ensure consistency in our data, we are going to convert every token into its lowercase form.

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(decapitalize)
df_handle['tokens'].head()

#### Remove Numbers

We can see that some of our tokens are just numeric strings, which we want to remove.

Note that we will not remove tokens that contain both letters and numbers. For example, "3g" considered a word and it would lose its meaning if it were changed to just "g".

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(remove_numbers)
df_handle['tokens'].head()

#### Lemmatization

Consider the second document. We have tokens like "games" and "pixels" - they are plural nouns. However, if we want to count the occurrences of these tokens, they should not be considered as different words to their singular counterparts. 

In [None]:
first_article = df_handle.tokens.iloc[1].copy()

In [None]:
count_games = first_article.count("games")
count_game = first_article.count("game")

print("Count of 'games' = {}".format(count_games))
print("Count of 'game'  = {}".format(count_game))

We perform an operation known as *lemmatization*, which takes a word and alters it to its root form. For example, it will convert plural nouns to singular nouns, or convert past-tense verbs to the radical form.

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(lemmatize)
df_handle['tokens'].head()

Let's look at some of the lemmatizer results.

In [None]:
" ".join(df_handle.tokens.iloc[1])

We look to see if this has amalgamated the words 'game' and 'games':

In [None]:
count_game = df_handle['tokens'].iloc[1].count("game")
print("Count of 'game'  = {}".format(count_game))

It has, so our lemmatization has been successful.

However, we notice that numbers have reappeared in our token set. This is because our remove_numbers() function didn't remove a word like "20s" because it wasn't all numeric, however, it was lemmatized back to "20".

In [None]:
df_handle['tokens'].iloc[1].count("20")

In [None]:
lemmatize(['20s'])

To combat this, we are going to rerun our remove_numbers function after having applied lemmatization.

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(remove_numbers)

df_handle['tokens'].iloc[1].count("20")

#### Remove Stop Words

We have now cleaned up our tokens so that they only contain real words and in lemmatized form. However, some words are not as valuable as others. For example, if we want to calculate the most-common word in an article, it is likely that it will be a word like 'a', 'the', etc. These words are called *stop words*, they do not gives us valuable information since they are so prevalent in the English language. So we decide that we are going to remove them from our list of tokens.

In [None]:
df_handle['tokens'] = df_handle['tokens'].apply(remove_stop_words)
df_handle['tokens'].head()

### Tests
#### Test Cleaned Data

It is very important that we run tests to ensure our data is as we expect it to be. Failing to carry out tests can lead to problems in analysis further on.

We define and run a test now to ensure that our data has been tokenized successfully and those tokens are cleaned and normalized as we expect.

In [None]:
def columnIsClean(series):
    return all([tokensAreClean(tokens) for tokens in series])
   

def tokensAreClean(tokens):    
    if any([token.isupper() for token in tokens]):
        print("Some tokens are not uppercase")
        return False
    
    elif any([not token.isalnum() for token in tokens]):
        print("Some tokens contain special (non-alphanumeric) characters")
        return False
    
    elif any([token.isnumeric() for token in tokens]):
        print("Some tokens are numeric")
        return False
    
    elif any([True for token in tokens if token in get_stop_words()]):
        print("Some stop words were not removed")
        return False
    
    else:
        # Data has been cleaned successfully.
        return True

In [None]:
data_cleaned_successfully = columnIsClean(df_handle['tokens'])

# Throw an error if test fails.
assert(data_cleaned_successfully)

if data_cleaned_successfully:
    print("Data Cleaned and Tokenized Successfully!")

#### Do Tokenization in One Go

Now that we have shown that our data has been successfully cleaned, we also need to test that our function to do the normalization in one go (which we will use later) is equivalent to what we have done step by step. 

In [None]:
oldtokens = df_handle['tokens'].copy()

df_handle['tokens'] = normalize_text(df_handle['content'])

series_equal = oldtokens.equals(df_handle['tokens'])

# Throw an error if not equal.
assert(series_equal)

if series_equal:
    print("The one-by-one and all-at-once methods are equal!")

<a id='token_analy'></a>

### 1.1 - Token Analysis

<a id='comp_methods'></a>

## 2.0 - Comparing Preprocessing Methods

<a id='best_model'></a>

### 2.1 - Highest Accuracy Produced Using a Simple Model

<a id='vect_analy'></a>

### 2.2 - Vector Data Analysis

<a id='initial_models'></a>

## 3.0 - Initial Model Building

<a id='top3'></a>

### 3.1 - Selecting the Top 3 Models

<a id='m1'></a>

## 4.0 - Model #1

<a id='m1_discuss'></a>

### 4.1 - Discussion

<a id='m1_train'></a>

### 4.2 - Training

<a id='m1_optimize'></a>

### 4.3 - Hyperparameter Optimization

<a id='m1_eval'></a>

### 4.4 - Evaluation

<a id='m2'></a>

## 5.0 - Model #2

<a id='m2_discuss'></a>

### 5.1 - Discussion

<a id='m2_train'></a>

### 5.2 - Training

<a id='m2_optimize'></a>

### 5.3 - Hyperparameter Optimization

<a id='m2_eval'></a>

### 5.4 - Evaluation

<a id='m3'></a>

## 6.0 - Model #3

<a id='m3_discuss'></a>

### 6.1 - Discussion

<a id='m3_train'></a>

### 6.2 - Training

<a id='m3_optimize'></a>

### 6.3 - Hyperparameter Optimization

<a id='m3_eval'></a>

### 6.4 - Evaluation

<a id='ensemble'></a>

## 7.0 - Ensembling

<a id='apply_final'></a>

## 8.0 - Applying the Final Model to the Test Set