## Baseline Submission: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



### Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

1. Exercise and build upon concepts covered in class and test out at least 3 kinds of supervised models:
    a. Regression (LASSO, Logistic)
    b. Trees (RF, XGBoost)
    c. DeepLearning (Tensorflow)
2. Using stacking/ensembling methods for improving prediction metrics (K-Means, anomaly detection)
3. Using unsupervised methods for feature engineering/selection

For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge




### Data Ingestion

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import time
import os.path
import pickle

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import auc, roc_auc_score, accuracy_score, f1_score

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 


#NLTK imports
import string

from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk.tokenize import punkt as punkt
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

# These imports enable the use of NLTKPreprocessor in an sklearn Pipeline
from sklearn.base import BaseEstimator, TransformerMixin


#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

#General imports
import pprint

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [11]:
# read frames localy through csv
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

np.random.seed(455)

# Random index generator for splitting training data
# Note: Each rerun of cell will create new splits.
randIndexCut = np.random.rand(len(train_df)) < 0.7

#S plit up data
test_data = test_df["comment_text"]
dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

print 'total training observations:', train_df.shape[0]
print 'training data shape:', train_data.shape

print 'training label shape:', train_labels.shape

print 'dev label shape:', dev_labels.shape

print 'labels names:', target_names

total training observations: 159571
training data shape: (111906,)
training label shape: (111906, 6)
dev label shape: (47665, 6)
labels names: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


### Exploratory Data Analysis

#### Class Imbalance

Let's see how imblanced the label set is in order to have a better understanding with the label quality of the given data set. 

In [4]:
from bokeh.io import push_notebook
from bokeh.plotting import figure, show, output_file, output_notebook

target_counts = dev_labels.apply(np.sum,0)
target_counts

output_notebook()


p = figure(x_range=target_names)
p.vbar(x=target_names, top = target_counts, width=0.9)

show(p)

train_labels.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
6,1,1,1,0,1,0


The data is fairly imbalanced when counting label occurrences. 

Ideas to consider
- Sampling methods
- Custom Cross Validation

### Feature Engineering/Selection (WIP)
....

### Modeling

### Text Processing

In [5]:
import nltk

nltk.download('stopwords')

class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    """Text preprocessor using NLTK tokenization and Lemmatization

    This class is to be used in an sklean Pipeline, prior to other processers like PCA/LSA/classification
    Attributes:
        lower: A boolean indicating whether text should be lowercased by preprocessor
                default: True
        strip: A boolean indicating whether text should be stripped of surrounding whitespace, underscores and '*'
                default: True
        stopwords: A set of words to be used as stop words and thus ignored during tokenization
                default: built-in English stop words
        punct: A set of punctuation characters that should be ignored
                default: None
        lemmatizer: An object that should be used to lemmatize tokens
    """

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):

        # Break the document into sentences
        for sent in sent_tokenize(unicode(document,'utf-8')):

            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                
                # S
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

def identity(arg):
    """
    Simple identity function works as a passthrough.
    """
    return arg

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/burgew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Text Preprocessing
This block uses the NLTKPreprocessor to tokenize the input data and then the TfidfVectorizer to vectorize it. The NLTKPreprocessor will ignore English stop words and will lemmatize where possible. The vectorizer ignores words occuring in fewer than 5 documents, which sufficed to reduce the size of the words vector significantly. Also, the vectorizer will limit the total features (words) to 15000, prioritizing the most valuable ones with highest TF-IDF score.

Note that in this case the tokenization available by default in TfidfVectorizer is disabled, since that is handled by the NLTKPreprocessor. This made it clear that tokenization is by far more expensive (time) than vectorization.

In [6]:
import os, errno

def remove_file(filename):

    try:
        os.remove(filename)
    except OSError as e:
        pass

In [7]:
# Uncomment these these statements to generate new Preprocessing/Vectorization results each time        
# Leaving them commented will result in pickling of these results and reusing them for the next run
# 
#remove_file('train_preproc_data.pickle')
#remove_file('train_tfidf_counts.pickle')
#remove_file('dev_preproc_data.pickle')
#remove_file('dev_tfidf_counts.pickle')


In [8]:
import nltk

pp = pprint.PrettyPrinter(indent=4)

np.random.seed(455)

# This preprocessor will be used to process data prior to vectorization
nltkPreprocessor = NLTKPreprocessor()
    
# Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
# This is to account for the NLTKPreprocessor already taking care of these.
tfidfVector = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=.7, max_features=10000,
                              tokenizer=identity, preprocessor=None, lowercase=False, stop_words={'english'})

pickle_file_name = 'train_preproc_data.pickle'
if (not os.path.exists(pickle_file_name)):
    print "Starting preprocessing of training data..."
    start_train_preproc = time.time()
    nltkPreprocessor.fit(train_data)
    train_preproc_data = nltkPreprocessor.transform(train_data)
    finish_train_preproc = time.time()
    print "Completed tokenization/preprocessing of training data in {:.2f} seconds".format(finish_train_preproc-start_train_preproc)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(train_preproc_data,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        train_preproc_data = pickle.load(pickle_file)

pickle_file_name = 'train_tfidf_counts.10000.pickle'
if (not os.path.exists(pickle_file_name)):
    
    # Generating new TF-IDF train counts means we need to then re-apply LSA to the results, so remove the LSA results
    #remove_file('lsa_train_counts.pickle')
    
    print "Starting vectorization of training data..."
    start_train_vectors = time.time()
    train_tfidf_counts = tfidfVector.fit_transform(train_preproc_data)
    finish_train_vectors = time.time()
    print "Completed vectorization of training data in {:.2f} seconds".format(finish_train_vectors-start_train_vectors)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(train_tfidf_counts,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        train_tfidf_counts = pickle.load(pickle_file)
    
pickle_file_name = 'dev_preproc_data.pickle'
if (not os.path.exists(pickle_file_name)):
    print "\nStarting preprocessing of dev data..."
    start_dev_preproc = time.time()
    nltkPreprocessor.fit(dev_data)
    dev_preproc_data = nltkPreprocessor.transform(dev_data)
    finish_dev_preproc = time.time()
    print "Completed tokenization/preprocessing of dev data in {:.2f} seconds".format(finish_dev_preproc-start_dev_preproc)

    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(dev_preproc_data,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        dev_preproc_data = pickle.load(pickle_file)
    
pickle_file_name = 'dev_tfidf_counts.10000.pickle'
if (not os.path.exists(pickle_file_name)):
    
    
    # Generating new TF-IDF dev counts means we need to then re-apply LSA to the results, so remove the LSA results
    #remove_file('lsa_dev_counts.pickle')
    
    print "Starting vectorization of dev data..."
    start_dev_vectors = time.time()
    dev_tfidf_counts = tfidfVector.transform(dev_preproc_data)
    finish_dev_vectors = time.time()
    print "Completed vectorization of dev data in {:.2f} seconds".format(finish_dev_vectors-start_dev_vectors)


    print("\nVocabulary (tfidf) size is: {}").format(len(tfidfVector.vocabulary_))
    vocab_entries = {k: tfidfVector.vocabulary_[k] for k in tfidfVector.vocabulary_.keys()}
    vocab_entries = pd.Series(vocab_entries).to_frame()
    vocab_entries.columns = ['count']
    vocab_entries = vocab_entries.sort_values(by='count')

    print("Sample vocabulary from TfidfVectorizer:")
    print(pp.pprint(vocab_entries.head(10)))
    print("...")
    print(pp.pprint(vocab_entries.tail(10)))
    print("Number of nonzero entries in matrix: {}").format(train_tfidf_counts.nnz)

    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(dev_tfidf_counts,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        dev_tfidf_counts = pickle.load(pickle_file)


# sample column wise sum, we can see that an observation can have multiple classes.
count_df = pd.DataFrame(train_labels.apply(np.sum,1), columns = ["counts"])
count_df = count_df[((count_df["counts"] >= 1))]
count_df.head(10)


Unnamed: 0,counts
6,4
12,1
16,1
42,4
43,3
44,1
51,2
55,4
56,3
58,2


### PCA/LSA
    Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA) are both operations that use Singular Value Decomposition to reduce the dimensionality of a dataset. PCA is applied to a term-covariance matrix, whereas LSA is applied to a term-document matrix. As such, LSA is appropriate for machine learning algorithms using scikit-learn TfidfVectorizer. Additionally PCA, as implemented in scikit-learn, cannot handle the sparse matrices that are produced by such vectorization tools.

In [23]:
# Uncomment these these statements to generate new LSA Feature Reduction results each time        
# Leaving them commented will result in pickling of these results and reusing them for the next run
# 
#remove_file('lsa_train_counts.pickle')
#remove_file('lsa_dev_counts.pickle')


In [9]:
target_components = 5000

pickle_file_name = 'lsa_train_counts.5000.pickle'
if (not os.path.exists(pickle_file_name)):
    svd = TruncatedSVD(n_components=target_components, algorithm='arpack')
    print "Starting LSA on train counts with {} components...".format(target_components)
    train_start=time.time()
    lsa_train_counts = svd.fit_transform(train_tfidf_counts)
    train_stop=time.time()
    print "Train counts transform took {:.2f} minutes.".format((train_stop-train_start)/60)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(lsa_train_counts,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        lsa_train_counts = pickle.load(pickle_file)
    
pickle_file_name = 'lsa_dev_counts.5000.pickle'
if (not os.path.exists(pickle_file_name)):
    print "Starting LSA on dev counts with {} components...".format(target_components)
    dev_start=time.time()
    lsa_dev_counts = svd.fit_transform(dev_tfidf_counts)
    dev_stop=time.time()
    print "Dev counts transform took {:.2f} minutes.".format((dev_stop-dev_start)/60)
    
    with open(pickle_file_name,'w') as pickle_file:
        pickle.dump(lsa_dev_counts,pickle_file)
else:
    with open(pickle_file_name,'r') as pickle_file:
        lsa_dev_counts = pickle.load(pickle_file)       

### MLPClassifier (Neural Net) - shallow - both Train and Dev

### Text Classification with Neural Net (sklearn.MLPClassifier)
In choosing a neural net model for text classification, the output layer should have the same number of nodes as the number of classification labels. In this case, there are 6 labels and as such not only will the output layer have 6 nodes, but the final hidden layer as well. The input layer will have the same number of nodes as features, normally, and ideally the initial hidden layer will be between that and the number of classes.

In this case, we're limiting our feature set to 15,000 features (words), and it was not possible to use a number of initial hidden layer nodes at all close to that, running this process on a Macbook. So, setting the initial hidden layer to 12 gave at least some benefit of being less than the number of features and greater than the number of output classes. This (12,6) model is the one that ended up producing best (most accurate) results.

Note that, nod toward deeper learning, a (10,8,6) model was also tested, but this ended up demonstrating overfitting, with a signficantly higher accuracy score on test data than on dev data.

In [12]:
import sklearn
from sklearn import preprocessing

       
# These are the parameters and options being optimized
parameters = [{'solver': ['sgd'], 'learning_rate': ['invscaling'], 'momentum': [.9],
               'nesterovs_momentum': [True], 'learning_rate_init': [0.2]},
              {'solver': ['sgd'], 'learning_rate': ['invscaling'], 'momentum': [.9],
               'nesterovs_momentum': [False], 'learning_rate_init': [0.2]},
              {'solver': ['adam'], 'alpha': [1,10], 'tol': [0.0000000000001],
               'hidden_layer_sizes' : [(12,6)],
                #'early_stopping': [True, False]
               'early_stopping': [False],'learning_rate_init': [0.01]}]


#scoring = { 'AUC' : 'roc_auc', 'F1': 'f1_weighted'}
scoring = 'f1_weighted'
    

print("Modelling with MLPClassifier (shallow/wide Net)")
print(sklearn.__version__)

Train = False

if Train:
    # Testing testing/cross-val with shallow/wide Neural Net for both train and dev dataprediction_output = []
    scores_output = []
    full_CV_start = time.time()
    for name in target_names:
        label_CV_start = time.time()

        # This Multi-Layer Perceptron classifier will be setup with hidden layers of 6 and 6 each, with tanh activation
        # Running a 3-way cross-validation for a single label takes between 10 and 20 minutes, dependenging on the machine.
        # The mean AUC for train and dev was 93%.
    
        # Changing the Net to (12,6) hidden layers gave an AUC of 94%. This was likely aided by the LSA that wasn't in place
        # for the earlier 93% test.
    
        # Changing to try (18,6) hidden layers resulted in 93% again, for both Train and Dev
    
        # Changed back to (12,6) for both
    
        classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='relu', learning_rate='adaptive')
        classifier.fit(lsa_train_counts, train_labels[name])
        cv_score = np.mean(cross_val_score(
            classifier, lsa_train_counts, train_labels[name], cv=3, scoring=scoring))
        scores_output.append(cv_score)
        label_CV_finish = time.time()
    print('Train data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
    full_CV_finish = time.time()
    print("Full shallow/wide Train Neural Net cross-val across all labels with train data took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))

    print("Mean shallow/wide Train ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))
    
    
gridSearch = False
AllTogether = True

if (gridSearch):
    
    print("LSA dev counts shape: ", lsa_dev_counts.shape)
        
    # Create a GridSearchCV pipeline for Tfidf vectorizing and evaluating LR classifier
    # at different param values.
    pipeline = Pipeline([
        ('clf', MLPClassifier(activation='relu', learning_rate='adaptive'))
    ])

    # Create a GridSearchCV with the above defined pipeline
    gsCV = GridSearchCV(pipeline, param_grid=parameters,
                        error_score=0, scoring=scoring)

    prediction_output = []
    scores_output = []
    full_CV_start = time.time()

    for name in target_names:
        label_CV_start = time.time()
        classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='relu', learning_rate='adaptive')
        classifier.fit(lsa_dev_counts, dev_labels[name])
        label_CV_finish = time.time()
        
    cv_score = np.mean(cross_val_score(classifier, lsa_dev_counts, dev_labels[name], cv=3, scoring=scoring))
    scores_output.append(cv_score)

    print('DEV data CV score is {:.2f}, after {:.2f} minutes.'.format(cv_score,(label_CV_finish-label_CV_start)/60))
    full_CV_finish = time.time()
    print("Full Neural Net cross-val across all labels took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))
    print("Mean DEV ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))

elif AllTogether:
    
    prediction_output = []
    scores_output = []
    full_CV_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='relu',
                               learning_rate='adaptive', learning_rate_init=0.01)
    clf = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=scoring)
    clf.fit(lsa_train_counts, train_labels)
    print("Best Train GridSearchCV score: " + str(clf.best_score_))
    print("Best params: " + str(clf.best_params_))
    
#    cv_score = np.mean(cross_val_score(classifier, lsa_train_counts, train_labels, 
#                                       cv=3, scoring=scoring))
#    full_CV_finish = time.time()
#    print("Full shallow/wide Neural Net cross-val across all labels with train data took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))
#    print("Mean shallow/wide train ROC_AUC for MLPClassifier: {:.2f}".format(cv_score))
    
    print("Now testing with train-fit and dev-predict...")
#    classifier = MLPClassifier(nesterovs_momentum=True, learning_rate='invscaling', learning_rate_init=0.2,
#                               solver='sgd', hidden_layer_sizes=(12,6), activation='relu')
#    classifier.fit(lsa_train_counts, train_labels)
    dev_pred = clf.best_estimator_.predict(lsa_dev_counts)
    acc_score = accuracy_score(dev_labels, dev_pred)
    f_one_score_w = f1_score(dev_labels, dev_pred, average='weighted')
    f_one_score_s = f1_score(dev_labels, dev_pred, average='samples')
    print("Accuracy Score from dev predict: {}".format(acc_score))
    print("F1 score (weighted) from dev predict: {}".format(f_one_score_w))
    print("F1 score (samples) from dev predict: {}".format(f_one_score_s))    

#    prediction_output = []
#    scores_output = []
#    full_CV_start = time.time()
#    classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='relu', learning_rate='adaptive')
#    clf = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=scoring)
#    clf.fit(lsa_dev_counts, dev_labels)
#    print("Best Dev GridSearchCV score: " + str(clf.best_score_))
#    print("Best params: " + str(clf.best_params_))
    
#    cv_score = np.mean(cross_val_score(classifier, lsa_dev_counts, dev_labels,
#                                       cv=3, scoring=scoring))
#    full_CV_finish = time.time()
#    print("Full shallow/wide Neural Net cross-val across all labels with dev data took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))
#    print("Mean shallow/wide DEV ROC_AUC for MLPClassifier: {:.2f}".format(cv_score))
    
else:

    prediction_output = []
    scores_output = []
    full_CV_start = time.time()
    for name in target_names:
        label_CV_start = time.time()
        classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='relu', learning_rate='adaptive',
                                   learning_rate_init=0.01, tol=1e-13, alpha=1)
        classifier.fit(lsa_dev_counts, dev_labels[name])
        cv_score = np.mean(cross_val_score(
            classifier, lsa_dev_counts, dev_labels[name], cv=3, scoring=scoring))
        scores_output.append(cv_score)
        label_CV_finish = time.time()
        print('DEV data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
    full_CV_finish = time.time()
    print("Full shallow/wide Neural Net cross-val across all labels with dev data took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))
    print("Mean shallow/wide DEV ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))



Modelling with MLPClassifier (shallow/wide Net)
0.19.1


  'precision', 'predicted', average, warn_for)


Best Train GridSearchCV score: 0.598915161061
Best params: {'solver': 'adam', 'hidden_layer_sizes': (12, 6), 'early_stopping': False, 'tol': 1e-13, 'alpha': 1, 'learning_rate_init': 0.01}
Now testing with train-fit and dev-predict...
Accuracy Score from dev predict: 0.899087380678
F1 score (weighted) from dev predict: 0.344605470343
F1 score (samples) from dev predict: 0.0178802955148


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Evaluating spark-sklearn

from sklearn import svm, grid_search, datasets
from spark-sklearn import GridSearchCV

prediction_output = []
scores_output = []
full_CV_start = time.time()
classifier = spark_sklearn.MLPClassifier(hidden_layer_sizes=(12,6), activation='relu', 
                                         learning_rate='adaptive', learning_rate_init=0.01)
clf = spark_sklearn.GridSearchCV(estimator=classifier, param_grid=parameters, scoring=scoring)
clf.fit(lsa_train_counts, train_labels)
print("Best Train GridSearchCV score: " + str(clf.best_score_))
print("Best params: " + str(clf.best_params_))
        
print("Now testing with train-fit and dev-predict...")

dev_pred = clf.best_estimator_.predict(lsa_dev_counts)
acc_score = spark_sklearn.metrics.accuracy_score(dev_labels, dev_pred)
f_one_score_w = spark_sklearn.metrics.f1_score(dev_labels, dev_pred, average='weighted')
f_one_score_s = spark_sklearn.metrics.f1_score(dev_labels, dev_pred, average='samples')
print("Accuracy Score from dev predict: {}".format(acc_score))
print("F1 score (weighted) from dev predict: {}".format(f_one_score_w))
print("F1 score (samples) from dev predict: {}".format(f_one_score_s))  

### MLPClassifier (Neural Net) - shallow - just Train

import time
from sklearn.metrics import auc
# SK-learn libraries for cross validation
#from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 



print("Training with MLPClassifier (shallow/wide Net)")
classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='tanh', learning_rate='adaptive')

# Training with shallow/wide Neural Net 
for name in target_names:
    label_CV_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(lsa_train_counts, train_labels[name])
    label_CV_finish = time.time()
    print('Train data for class {} completed, after {:.2f} minutes.'.format(name, cv_score,
                                                                            (label_CV_finish-label_CV_start)/60))

### MLPClassifier (Neural Net) - deep

import time
from sklearn.metrics import auc
# SK-learn libraries for cross validation
#from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 



print("Modelling with MLPClassifier (deep/thinner)")

# Testing testing/cross-val with deep/thinner Neural Net for both train and dev data
prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()
    
    classifier = MLPClassifier(hidden_layer_sizes=(10,8,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(lsa_train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, lsa_train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('Train data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full deep/thin Neural Net Train cross-val across all labels took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))

print("Mean deep/thin Train ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))

prediction_output = []
scores_output = []
full_CV_start = time.time()
for name in target_names:
    label_CV_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(10,8,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(lsa_dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, lsa_dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    label_CV_finish = time.time()
    print('DEV data CV score for class {} is {:.2f}, after {:.2f} minutes.'.format(name, cv_score, 
                                                                                (label_CV_finish-label_CV_start)/60))
full_CV_finish = time.time()
print("Full deep/thin Neural Net DEV cross-val across all labels took {:.2f} minutes.".format((full_CV_finish-full_CV_start)/60))
print("Mean neep/thin DEV ROC_AUC for MLPClassifier: {:.2f}".format(np.mean(scores_output)))


### First Pass Logistic Regression with sag

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'sag'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### First Pass Logistic Regression with saga

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))
    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))
        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Here's the same using tfidf and saga

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver)
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver) 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))


### Original counts with saga and L1

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

### Tfidf with saga and L1

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

solver = 'saga'

print("Modelling with {} solver".format(solver))
prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1')
    classifier.fit(train_tfidf_counts, train_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, train_tfidf_counts, train_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Training data CV score for class {} is {}'.format(name, cv_score))

    
print("Mean Training ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

prediction_output = []
scores_output = []
for name in target_names:
    classifier = LogisticRegression(solver=solver,penalty='l1') 
    classifier.fit(dev_tfidf_counts, dev_labels[name])
    cv_score = np.mean(cross_val_score(
        classifier, dev_tfidf_counts, dev_labels[name], cv=3, scoring='roc_auc'))
    scores_output.append(cv_score)
    print('Dev data CV score for class {} is {}'.format(name, cv_score))

        
print("Mean Dev ROC_AUC for {} solver: {}").format(solver, np.mean(scores_output))

#### Testing on Dev Data

from sklearn.metrics import auc, roc_curve
from sklearn import metrics

dev_Vector = CountVectorizer(ngram_range=(1,1))
dev_counts = countVector.fit_transform(dev_data)

pred_dt = pd.DataFrame()
scores_dev = []
for name in target_names:
    classifier = LogisticRegression(solver='sag') 
    classifier.fit(dev_counts, dev_labels[name])
    scores_dev.append(cv_score)
    output = classifier.predict(dev_counts)
    fpr, tpr, thresholds = metrics.roc_curve(dev_labels[name], output)
    print('Dev score for class {} is {}'.format(name, metrics.auc(fpr,tpr)))
    pred_dt[name] = classifier.predict_proba(dev_counts)[:, 1]
    
    
print("Mean(dev) ROC_AUC: {}").format(np.mean(scores_dev))

Score on dev set is worse than training set, thus evidence of overfitting and a need for performance improvement.

The target is multi-label since each observation can be classified as multiple fields.  This is an important distinction from multi-class where each prediction can only be one label.  

## Evaluation

count_df
train_labels["toxic"]

### Final Text Preprocessing - training data

import nltk

pp = pprint.PrettyPrinter(indent=4)

np.random.seed(455)

# This preprocessor will be used to process data prior to vectorization
nltkPreprocessor = NLTKPreprocessor()
    
# Note that this vectorizer is created with a passthru tokenizer(identity), no preprocessor and no lowercasing
# This is to account for the NLTKPreprocessor already taking care of these.
tfidfVector = TfidfVectorizer(ngram_range=(1,1), min_df=5, max_features=15000,
                              tokenizer=identity, preprocessor=None, lowercase=False)

print "Starting final preprocessing of training data..."
start_train_preproc = time.time()
trainPreprocData = nltkPreprocessor.fit_transform(train_df["comment_text"])
finish_train_preproc = time.time()
print "Completed tokenization/preprocessing of training data in {:.2f} seconds".format((finish_train_preproc-start_train_preproc))

print "Starting final preprocessing of test data..."
start_test_preproc = time.time()
testPreprocData = nltkPreprocessor.transform(test_df["comment_text"])
finish_test_preproc = time.time()
print "Completed tokenization/preprocessing of test data in {:.2f} seconds".format((finish_test_preproc-start_test_preproc))

print "Starting vectorization of training data..."
start_train_vectors = time.time()
finalTrainCounts = tfidfVector.fit_transform(trainPreprocData)
finish_train_vectors = time.time()
print "Completed vectorization of training data in {:.2f} seconds".format((finish_train_vectors-start_train_vectors))

print "Starting vectorization of test data..."
start_test_vectors = time.time()
finalTestCounts = tfidfVector.transform(testPreprocData)
finish_test_vectors = time.time()
print "Completed vectorization of test data in {:.2f} seconds".format((finish_test_vectors-start_test_vectors))

### Final LSA Feature Selection - training data

target_components = len(tfidfVector.vocabulary_)/10
svd = TruncatedSVD(n_components=target_components, algorithm='arpack')
print "Starting LSA on train counts with {} components...".format(target_components)
train_start=time.time()
lsaTrainCounts = svd.fit_transform(finalTraincounts)
train_stop=time.time()
print "Train counts transform took {:.2f} seconds.".format(train_stop-train_start)

target_components = len(tfidfVector.vocabulary_)/10
svd = TruncatedSVD(n_components=target_components, algorithm='arpack')
print "Starting LSA on test counts with {} components...".format(target_components)
train_start=time.time()
lsaTestCounts = svd.fit_transform(finalTestCounts)
train_stop=time.time()
print "Test counts transform took {:.2f} seconds.".format(train_stop-train_start)

### Final MLPClassifier Training and Submission

import time
from sklearn.metrics import auc
# SK-learn libraries for cross validation
#from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

print("Training with MLPClassifier (shallow/wide Net)")
classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='tanh', learning_rate='adaptive')

# Training with shallow/wide Neural Net 
for name in target_names:
    
    label_train_start = time.time()
    classifier = MLPClassifier(hidden_layer_sizes=(12,6), activation='tanh', learning_rate='adaptive')
    classifier.fit(lsaTrainCounts, train_df[name])
    label_train_finish = time.time()
    print('Training for class {} completed, after {:.2f} minutes.'.format(name, 
                                                                          (label_train_finish-label_train_start)/60))
    label_predict_start = time.time()
    prediction_submission[name] = classifier.predict_proba(lsaTestCounts)[:, 1]
    label_predict_finish = time.time()
    print('Prediction for class {} completed, after {:.2f} minutes.'.format(name,
                                                                    (label_predict_finish-label_predict_start)/60))


    

    
print(prediction_submission.head(10)) # print frame output 
prediction_submission.to_csv("submission.csv")

### Submission - based on test preprocessing, LSA feature selection and MLPClassifier training

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
prediction_submission.to_csv("submission.csv")

### Submission

from sklearn.metrics import auc
# SK-learn libraries for cross validation
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split 
# Basic Logistic Regression Model/MultiLabel Edition

prediction_submission = pd.DataFrame()
prediction_submission["id"] = test_df["id"]

# new vector object for all train data for submission
finalTrainVector = CountVectorizer()
finalTrainCount = finalTrainVector.fit_transform(train_df["comment_text"])

# TODO: Using pipelines can clean up repetitive processes
# test set up
#testVector = CountVectorizer()
testCount = finalTrainVector.transform(test_df["comment_text"])

for name in target_names:
    classifier = LogisticRegression(solver='sag') #sag is one kind of solver optimize for multi-label
    clf = classifier.fit(finalTrainCount, train_df[name])
    prediction_submission[name] = clf.predict_proba(testCount)[:, 1]
    #print(prediction_submission)

    
print(prediction_submission.head(10)) # print frame output 
prediction_submission.to_csv("submission.csv")

The frame contains the output for each class and is saved in a pandas data frame.  