## Neural Network Pipeline: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [12]:
%matplotlib inline
import numpy as np
import pandas as pd
import string
import time
import os.path
import pickle

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import label_binarize


from sklearn import metrics

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 

#NLTK imports

import nltk
from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk.tokenize import punkt as punkt
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

# These imports enable the use of NLTKPreprocessor in an sklearn Pipeline
from sklearn.base import BaseEstimator, TransformerMixin


#scipy imports
from scipy.sparse import hstack
from scipy.stats import ttest_ind

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [13]:
# read frames localy through csv
train_df = pd.read_csv("../data/new_train.csv")
test_df = pd.read_csv("../data/new_test.csv")

np.random.seed(455)

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#Split up data
train_data, train_labels = train_df["comment_text"], train_df[target_names]
test_data, test_labels = test_df["comment_text"], test_df[target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape
print 'training label shape:', train_labels.shape

print 'test data shape:', test_data.shape
print 'test labels shape:', test_labels.shape
print 'labels names:', target_names

total training observations: 111828
training data shape: (111828L,)
training label shape: (111828, 6)
test data shape: (47743L,)
test labels shape: (47743, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Text Processing

The following cell contains the class based method and functions for text processing that reflects the final ETL pipeline for this case study.  

This text process is the global text processing step for the final submission.  The processed data sets, prior to training, are saved to the data folder for other models to use in order to gain a more apples-to-apples comparision. 

In [14]:
nltk.download('stopwords')

class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    """Text preprocessor using NLTK tokenization and Lemmatization

    This class is to be used in an sklean Pipeline, prior to other processers like PCA/LSA/classification
    Attributes:
        lower: A boolean indicating whether text should be lowercased by preprocessor
                default: True
        strip: A boolean indicating whether text should be stripped of surrounding whitespace, underscores and '*'
                default: True
        stopwords: A set of words to be used as stop words and thus ignored during tokenization
                default: built-in English stop words
        punct: A set of punctuation characters that should be ignored
                default: None
        lemmatizer: An object that should be used to lemmatize tokens
    """

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        """Initialize method for NLTKPreprocessor instance

        Simple initialization of specified instance variables:

        Args:
            self 
            stopwords: set of words to ignore as stop words, or a default set for English will be used
            punct: set of punctuation characters to strip, or a default set will be used
            lower: indicator of whether to convert all characters to lowercase, defaults to True
            strip: indicator of whether to strip whitespace, defaults to True

        Returns:
            N/A: instance initializer
        """
        
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()
        

    def fit(self, X, y=None):
        """Fit model with X and optional y

        This function does nothing but return self, since as a processor in the sklearn Pipeline this preprocessor
        has nothing analogous to "fit" logic. The tokenization logic is independent of specific dataset training, 
        and is fully realized in the transform() function. 
        This function exists as implementation of sklearn.BaseEstimator, for use in Pipeline.

        Args:
            self 
            X (array-like): independent variable
            y (array-like): dependent variable
            
        Returns:
            NLTKPreprocessor: self
        """
        return self

    def inverse_transform(self, X):
        """Function exists as implementation of sklearn.BaseEstimator, for use in Pipeline.
        This is simply for complying with interface.

        Args:
            self 
            X (array-like): input documents
            
        Returns:
            string: joined documents
        """
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        """Transform input X to produce output to be processed by next element in sklearn Pipeline

        This triggers the tokenization/lemmatization of the source documents.
        This is invoked by the sklearn Pipeline.

        Args:
            self 
            X: input documents to be tokenized
            
        Returns:
            list: tokenized documents reduced to simplest lemma form
        """
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    
    def tokenize(self, document):
        """Tokenize an input document, converting from a block of text into sentences, into tagged tokens,
        generating a set of lemmas.

        This method does the preprocessing work of sentence-based tokenization and then reduces words to lemmas

        Args:
            self 
            X (array-like): independent variable
            y (array-like): dependent variable
            
        Returns:
            Iterator[str]: an iterator over the tokens produced from the input documents
        """
        # Break the document into sentences. This is necessary for part-of-speech tagging.
        for sent in sent_tokenize(unicode(document,'utf-8')):

            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                yield lemma

                
    def lemmatize(self, token, tag):
        """Convert a token into the appropriate lemma

        Method uses the NLTK WordNetLemmatizer for part-of-speech tag-based lemmatization of words.

        Args:
            self 
            token: input word
            tag: part-of-speech tag
            
        Returns:
            string: lemma
        """
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

def identity(arg):
    """ Simple identity function works as a passthrough.

        This function will be used with the Vectorizer classes, when tokenization will have been performed already.
        In this scenario, the Vectorizer class will call this function in the place of its normal tokenization feature
        and this function will simply return the input token.
        
        Args:
            token (string): text token being evaluated by CountVectorizer or TfidfVectorizer
            
        Returns:
            string: input token unchanged (processed earlier by NLTK) will tbe returned
    """
    return arg

[nltk_data] Downloading package stopwords to C:\Users\Joseph
[nltk_data]     Lee\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Final Text Preprocessing - training data

### Text Preprocessing
This block uses the NLTKPreprocessor to tokenize the input data and then the TfidfVectorizer to vectorize it. The NLTKPreprocessor will ignore English stop words and will lemmatize where possible. The vectorizer ignores words occuring in fewer than 5 documents, which sufficed to reduce the size of the words vector significantly. Also, the vectorizer will limit the total features (words) to 15000, prioritizing the most valuable ones with highest TF-IDF score.

Note that in this case the tokenization available by default in TfidfVectorizer is disabled, since that is handled by the NLTKPreprocessor. This made it clear that tokenization is by far more expensive (time) than vectorization.

In [15]:
np.random.seed(455)


In [16]:
def get_preprocessed_vectors(train_data, test_data, max_features=None):
    """Preprocess (tokenize, vectorize) train and test datasets
    
    Datasets will be tokenized using NLTK and vectorized using TfidfVectorizer

    Args:
        train_data (array-like): training data to preprocess
        test_data (array-like): test data to preprocess
        max_features (int): maximum number of features to be produced by process

    Returns:
        N/A: instance initializer
    """

    pp = pprint.PrettyPrinter(indent=4)

    # This preprocessor will be used to process data prior to vectorization
    nltkPreprocessor = NLTKPreprocessor()
    
    # Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
    # This is to account for the NLTKPreprocessor already taking care of tokenization.
    tfidfVector = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=.7, max_features=max_features,
                                  tokenizer=identity, preprocessor=None, lowercase=False, stop_words={'english'})

    print("\nget_preprocessed_vectors() : Preprocessing input data, generating output of "+ str(max_features) + 
          " maximum features.")

    # Check if there is a serialized copy of the preprocessed training data, and if not the perform text preprocessing and
    # save the serialized result for reuse.
    pickle_file_name = 'train_preproc_data.pickle'
    if (not os.path.exists(pickle_file_name)):
        print "Starting preprocessing of training data..."
        start_train_preproc = time.time()
        nltkPreprocessor.fit(train_data)
        train_preproc_data = nltkPreprocessor.transform(train_data)
        finish_train_preproc = time.time()
        print "Completed tokenization/preprocessing of training data in {:.2f} seconds".format(finish_train_preproc-start_train_preproc)
    
        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(train_preproc_data,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            train_preproc_data = pickle.load(pickle_file)

    tfidfVector.fit(train_preproc_data)
        
    # Check if there is a serialized copy of the vectorized counts, and if not regenerate the matrix and save the
    # serialized result for reuse.        
    pickle_file_name = 'train_tfidf_counts.'+str(max_features)+'.pickle'
    if (not os.path.exists(pickle_file_name)):
    
        print "Starting vectorization of training data..."
        start_train_vectors = time.time()
        train_tfidf_counts = tfidfVector.transform(train_preproc_data)
        finish_train_vectors = time.time()
        print "Completed vectorization of training data in {:.2f} seconds".format(finish_train_vectors-start_train_vectors)
    
        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(train_tfidf_counts,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            train_tfidf_counts = pickle.load(pickle_file)
    
    # Check if there is a serialized copy of the preprocessed test data, and if not the perform text preprocessing and
    # save the serialized result for reuse.
    pickle_file_name = 'test_preproc_data.pickle'
    if (not os.path.exists(pickle_file_name)):
        print "\nStarting preprocessing of  data..."
        start_test_preproc = time.time()
        nltkPreprocessor.fit(test_data)
        test_preproc_data = nltkPreprocessor.transform(test_data)
        finish_test_preproc = time.time()
        print "Completed tokenization/preprocessing of test data in {:.2f} seconds".format(finish_test_preproc-start_test_preproc)

        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(test_preproc_data,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            test_preproc_data = pickle.load(pickle_file)
    
    pickle_file_name = 'test_tfidf_counts.'+str(max_features)+'.pickle'
    if (not os.path.exists(pickle_file_name)):

        print "Starting vectorization of test data..."
        start_test_vectors = time.time()
        test_tfidf_counts = tfidfVector.transform(test_preproc_data)
        finish_test_vectors = time.time()
        print "Completed vectorization of test data in {:.2f} seconds".format(finish_test_vectors-start_test_vectors)
        
        print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
        vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
        vocab_entries = pd.Series(vocab_entries).to_frame()
        vocab_entries.columns = ['count']
        vocab_entries = vocab_entries.sort_values(by='count')

        print("Sample vocabulary from TfidfVectorizer:")
        print(pp.pprint(vocab_entries.head(10)))
        print("...")
        print(pp.pprint(vocab_entries.tail(10)))
        print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(test_tfidf_counts,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            test_tfidf_counts = pickle.load(pickle_file)

    # Print sample column wise sum, we can see that an observation can have multiple classes.
    count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
    count_df = count_df[((count_df["counts"] >= 1))]
    count_df.head(10)
    
    return train_tfidf_counts, test_tfidf_counts


### Final MLPClassifier Training and Submission

### Text Classification with Neural Net (sklearn.MLPClassifier)
In choosing a neural net model for text classification, the output layer should have the same number of nodes as the number of classification labels. In this case, there are 6 labels and as such not only will the output layer have 6 nodes, but the final hidden layer as well. The input layer will have the same number of nodes as features, normally, and ideally the initial hidden layer will be between that and the number of classes.

In this case, we're limiting our feature set to 5,000 principal components, and it was not possible to use a number of initial hidden layer nodes at all close to that, running this process on a Macbook. So, setting the initial hidden layer to 12 gave at least some benefit of being less than the number of features and greater than the number of output classes. This (12,6) model is the one that ended up producing best (most accurate) results.

Note that, nod toward deeper learning, a (10,8,6) model was also tested, but this ended up demonstrating overfitting, with a signficantly higher accuracy score on test data than on dev data.

In [17]:
def getClassifier():
    """Get an MLPClassifier configured as per the results of best parameter analysis
    This classifier will be used by testing logic in this notebook.

    Args:
        N/A 
            
    Returns:
        MLPClassifier: configured classifier for testing
    """
    return MLPClassifier(hidden_layer_sizes=(12,6), solver='adam', early_stopping=False, activation='relu',
                         tol=1e-13, alpha=1, learning_rate='adaptive', learning_rate_init=0.01 )

In [18]:
def get_roc_auc(classifier, train_roc_data, train_roc_labels, test_roc_data, test_roc_labels):
    """Get ROC AUC scores for multi-label data by binarizing labels

    Args:
        classifier (sklearn classifier): model for which to report ROC AUC scores
        train_roc_data (array-like): array of training data
        train_roc_labels (array-like): array of labels assiociated with training data
        test_roc_data (array-like): array of test data
        test_roc_labels (array-like): array of labels assiociated with test data
            
    Returns:
        list: per-label ROC AUC scores
    """
    
    # Fitting again with binarized labels and predicting again to support per-label roc_auc scores
    binarized_train_labels = label_binarize(train_roc_labels, classes=[0, 1, 2, 3, 4, 5])
    binarized_test_labels = label_binarize(test_roc_labels, classes=[0, 1, 2, 3, 4, 5])
    
    print "\nget_roc_auc(): train_roc_data size {}, test_roc_data size {}".format(train_roc_data.shape,
                                                                                      test_roc_data.shape)

    # While this is multilabel data, the sklearn ROC AUC scoring feature doesn't support multilabel data directly.
    # So, instead the model will be re-trained with binarized training labels and the predicted probabilities used
    # To derive ROC AUC for each label. This is mainly for comparison with other models, since these numbers won't
    # be directly related to the multilabel classification.
    print("Re-fitting and scoring for per-label roc_auc scores...")
    y_score = classifier.fit(train_roc_data, binarized_train_labels).predict_proba(test_roc_data)
    fpr = dict()
    tpr = dict()
    roc_auc = []
    for ind, label in enumerate(target_names):
        fpr[ind], tpr[ind], _ = metrics.roc_curve(binarized_test_labels[:, ind], y_score[:, ind])
        roc_auc.append(metrics.auc(fpr[ind], tpr[ind]))

    print "pre-label ROC AUC scores: ", roc_auc
    return roc_auc



In [19]:
def test_overfitting_MLP(classifier, alpha=.1):
    """Evaluate a classifier and return the significance of the difference between limited scoring and full scoring
    This significance is calculated using a paired t-test

    Args:
        classifier (sklearn classifier): Pre-configured classifier to be evaluated for over/underfitting.
        alpha (float): error value as threshold for significance
            
    Returns:
        float: t-test statistic
        float: p-value
    """    
    
    # First take a 30% split of input data for limited focus
    X_limited, X_ignore, y_limited, y_ignore = train_test_split(train_tfidf_counts, train_labels, 
                                                                test_size=0.3, random_state=455)
    print "test_overfitting_MLP(): Shape of focused data and remaining data: ", X_limited.shape, X_ignore.shape
    
    # Then, take a 80%/20% train/test split of the limited focus data
    X_train, X_test, y_train, y_test = train_test_split(X_limited, y_limited, test_size=0.2, random_state=455)
    print "Shape of limited train and test data for focused testing: ",X_train.shape, y_train.shape, X_test.shape
    
    # Perform cross-validation with the focused train data subset
    train_scores = cross_val_score(classifier, X_train, y_train,
                             scoring="neg_mean_squared_error", cv=10)
    

    # deal with any nans and negative values, get mean for simple comparison
    train_scores = np.nan_to_num(train_scores)
    train_scores = np.absolute(train_scores)
    mean_train_score = np.mean(train_scores)
    
    print "train mean squared error score: {:.2f}".format(mean_train_score)

    # Perform cross-validation with the full post-preprocessed train_data
    test_scores = cross_val_score(classifier, train_tfidf_counts, train_labels,
                             scoring="neg_mean_squared_error", cv=10)    
    
    # deal with any nans and negative values, get mean for simple comparison
    test_scores = np.nan_to_num(test_scores)
    test_scores = np.absolute(test_scores)
    mean_test_score = np.mean(test_scores)
    
    print "test mean squared error score: {:.2f}".format(mean_test_score)
    
    if mean_test_score < mean_train_score:
        error_case = "underfitting"
    else:
        error_case = "overfitting"
    
    t_stat = ttest_ind(train_scores, test_scores)
    print "t test statistic for both per-label score sets: ", t_stat
    
    
    if t_stat.pvalue < alpha:
        print "p-value {:.2f} <= alpha {:.2f}, potentially indicating {}.".format(t_stat.pvalue, alpha, error_case)
            
    return t_stat
    


In [20]:
def generate_MLP_scores(train_input_data, train_input_labels, test_input_data, test_inputlabels):  
    
    print "\ngenerate_MLP_scores()"
    # This MLPClassifier will be fit using training data and subsequently used to predict labels using test
    # data, for scoring.
    classifier = getClassifier()

    # Fit using the training data and time the process for reference
    full_train_start = time.time()
    classifier.fit(train_input_data, train_labels)
    full_train_stop = time.time()

    duration = (full_train_stop-full_train_start)/60
    print('Fitting train data completed, after {:.2f} minutes.'.format(duration))

    # Generate predictions using the test TFIDF data and collect a series of scores
    test_pred = classifier.predict(test_input_data)
    acc_score = metrics.accuracy_score(test_labels, test_pred)

    # Note that, since this is multilabel data, an F1 score must be evaluated with either results weighted across labels or
    # as samples taken from each.
    precision_recall_fscore = metrics.precision_recall_fscore_support(test_labels, test_pred, average=None)
    precision = metrics.precision_score(test_labels, test_pred, average=None)
    recall = metrics.recall_score(test_labels, test_pred, average=None)

    # Prediction probabilities will be saved for comparison with other models and processing by ensembles
    predict_probs = classifier.predict_proba(test_input_data)

    print("Accuracy score from test predict: {}".format(acc_score))
    print("Precision score from test predict: {}".format(precision))
    print("Recall score from test predict: {}".format(recall))

    roc_auc = get_roc_auc(classifier, train_input_data, train_input_labels, test_input_data, test_input_labels)

    print "ROC AUC score from test predict: ", roc_auc

    # In order to Save the complete collection of scores, a pandas.DataFrame will be created and used to create
    # "scoring.csv".
    scoring_arr = np.asarray(precision_recall_fscore)
    scoring_arr = np.vstack([scoring_arr,roc_auc])
    scoring_submission = pd.DataFrame(data=scoring_arr, columns=target_names, index=['precision', 'recall', 
                                                                                     'fbeta_score', 'support',
                                                                                     'roc_auc'])
    print("Precision, recall, fbeta_score, support and ROC AUC:")
    print(scoring_submission)
    scoring_submission.to_csv("../data/NN.scoring."+str(max_features)+".csv")

    # The predicted probabilities from the initial version of the model will be saved in CSV file "submission.csv"
    prediction_submission = pd.DataFrame(data=predict_probs,columns=target_names)
    print(prediction_submission[0:10]) # print frame output 
    prediction_submission.to_csv("../data/NN.submission."+str(max_features)+".csv")


### Final LSA Feature Selection - training data

### PCA/LSA
Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA) are both operations that use Singular Value Decomposition to reduce the dimensionality of a dataset. PCA is applied to a term-covariance matrix, whereas LSA is applied to a term-document matrix. As such, LSA is appropriate for machine learning algorithms using scikit-learn TfidfVectorizer. Additionally PCA, as implemented in scikit-learn, cannot handle the sparse matrices that are produced by such vectorization tools.

In [21]:
# Set the number of principal components to identify for use in classification processes
target_components = 4000

def get_LSA(target_components, train_data, test_data):

    print ""
    # Check if there is a serialized copy of the Principal Components data for the training dataset, and if not then
    # perform LSA processing and save the serialized result for reuse.
    pickle_file_name = 'lsa_train_counts.'+max_features+'.pickle'
    if (not os.path.exists(pickle_file_name)):
        svd = TruncatedSVD(n_components=target_components, algorithm='arpack')
        print "Starting LSA on train counts with {} components...".format(target_components)
        train_start=time.time()
        lsa_train_counts = svd.fit_transform(train_tfidf_counts)
        train_stop=time.time()
        print "Train counts transform took {:.2f} minutes.".format((train_stop-train_start)/60)
    
        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(lsa_train_counts,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            lsa_train_counts = pickle.load(pickle_file)


    # Check if there is a serialized copy of the Principal Components data for the test dataset, and if not then
    # perform LSA processing and save the serialized result for reuse.
    pickle_file_name = 'lsa_test_counts.'+max_features+'pickle'
    if (not os.path.exists(pickle_file_name)):
        print "Starting LSA on dtest counts with {} components...".format(target_components)
        test_start=time.time()
        lsa_test_counts = svd.fit_transform(test_tfidf_counts)
        test_stop=time.time()
        print "Test counts transform took {:.2f} minutes.".format((test_stop-test_start)/60)
    
        with open(pickle_file_name,'w') as pickle_file:
            pickle.dump(lsa_test_counts,pickle_file)
    else:
        # If the serialized file already exists, simply load it for the next step of the process.
        with open(pickle_file_name,'r') as pickle_file:
            lsa_test_counts = pickle.load(pickle_file) 
            
    return lsa_train_counts, lsa_test_counts
        
#lsa_train_counts, lsa_test_counts = get_LSA(target_components, train_tfidf_counts, test_tfidf_counts)

** Overfit Testing**

In [None]:
#
# The following logic will iterate through a cycle of preprocessing input data for a specific set of maximum features,
# then test the MLPClassifier with the preprocessed data and produce score and probability output as CSV files in
# ../data. The maximum feature values to be tested will include 3000, 4000, 5000, 6000, 10000, and unlimited (None).
#
# After this loop is complete, a test will be executed to determine whether there is overfitting with the MLPClassifier.

for max_features in [3000, 4000, 5000, 6000, 10000, None]:
    train_tfidf_counts, test_tfidf_counts = get_preprocessed_vectors(train_data, test_data, 
                                                                     max_features=max_features)
    generate_MLP_scores(train_tfidf_counts, train_labels, test_tfidf_counts, test_labels)
    
t_stat = test_overfitting_MLP(getClassifier(), alpha=.1)

print "classifier overfitting t-test result: "


    

## Model Conclusion 

Overall, Neural Networks show signficant promise for classifying toxic language.  Both train and test sets show high average performance scores across AUC scores that are similar in range (we can assume overfit it relatively minimal).

However, despite high AUC scores, the recall is decent but the precision is fairly low in comparison (~0.50 Avg. Precision), meaning that this model has high number of False Positives and may be struggling with the subsetted intererestions between the target classes,  

It is interesting to note that with there were signs of overfit with using LSA methods as shown by the t-test function in the cell above shows signficant evidence of overfit.  The compression through LSA may have removed some variance that are critical for classifying between certain target classes.

Another interesting source of clear overfitting was from deepening the architecture and incorporating an additional layer, which led to deviations between train and test scores to grow in magnitude.  The additional of an additional layer likely increased the models fit to noise unique to the train set but not seen in the test set. 

The best model that we came up with given the time constraint had the node architecture of (12,6) and using the `adam` solver and `relu` activation function.  

Rectifer's like `relu` gave performance lift compared to `sigmoid` activation functions for this case study.   Rectifiers can often speed up training and due to it's one-sided nature can be optimal for sparse data sets compared to `sigmoids` that are two-sided and conforms everything on a probability scale.  

Additionally, n-grams between 1 and 2 seem optimal for this case, while any n > 3 decreases performance. This is interesting because it may mean that future reserach should consider using n-chars.  This would explode the feature space but with deeplearning and Convolutional architecture, it may offer additional performance lift.  
