# Naive Bayes Exploration: Toxic Language Classification 
**w207 Spring 2018 - Final Project Baseline**

**Team: Paul, Walt, Yisang, Joe**



## Project Description 

Our challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate.  The toxic language data set is sourced from Wikipedia and available as a public kaggle data set. 

Our goal is to use various machine learning techniques used in class to develop high quality ML models and pipelines.  

Our research deliverables will be the following:
1. Exercise and build upon concepts covered in class and test out the following supervised models:
    a. Regression (LASSO, Logistic)
    b. Naive Bayes
    c. Trees (XGBoost)
    d. Neural Networks MPI
    c. KNN
2. Using stacking/ensembling methods (simple blending)


For the baseline proposal, this file contains a first pass run through from data preprocessing to model evaluation using a regression model pipeline. 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge



### Environment Setup

In [1]:
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

#scipy imports
from scipy.sparse import hstack

#Visualization imports
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import bokeh
#! pip install bokeh

import datetime
import time
import string
from sklearn import metrics
import ast

# target classes
target_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']



In [2]:
# read frames localy through csv
train_df = pd.read_csv("../data/new_train.csv")
test_df = pd.read_csv("../data/new_test.csv")

# # Random index generator for splitting training data
# # Note: Each rerun of cell will create new splits.
# randIndexCut = np.random.rand(len(train_df)) < 0.7

# #S plit up data
# test_data = test_df["comment_text"]
# dev_data, dev_labels = train_df[~randIndexCut]["comment_text"], train_df[~randIndexCut][target_names]
# train_data, train_labels = train_df[randIndexCut]["comment_text"], train_df[randIndexCut][target_names]

dev_data, dev_labels = test_df["comment_text"], test_df[target_names]
train_data, train_labels = train_df["comment_text"], train_df[target_names]

print('total training observations:', train_df.shape[0])
print('training data shape:', train_data.shape)
print('training label shape:', train_labels.shape)
print('dev label shape:', dev_labels.shape)
print('labels names:', target_names)

('total training observations:', 111828)
('training data shape:', (111828L,))
('training label shape:', (111828, 6))
('dev label shape:', (47743, 6))
('labels names:', ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'])


## Text Pre-Processing

The following cell is the text procesing used for the final NB model. 
Many of the functions were drawn from Walt's Neural Network code found here. 

Future improvements would be to pipe in the Walt's Neural Network classes as a package to minimize code redundancies. 

In [3]:
# Courtesy of Walt

import nltk
# These imports enable the use of NLTKPreprocessor in an sklearn Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk.tokenize import punkt as punkt
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

# nltk.download('stopwords')
# nltk.download('punkt')


class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    """Text preprocessor using NLTK tokenization and Lemmatization

    This class is to be used in an sklean Pipeline, prior to other processers like PCA/LSA/classification
    Attributes:
        lower: A boolean indicating whether text should be lowercased by preprocessor
                default: True
        strip: A boolean indicating whether text should be stripped of surrounding whitespace, underscores and '*'
                default: True
        stopwords: A set of words to be used as stop words and thus ignored during tokenization
                default: built-in English stop words
        punct: A set of punctuation characters that should be ignored
                default: None
        lemmatizer: An object that should be used to lemmatize tokens
    """

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):

        # Break the document into sentences
        for sent in sent_tokenize(unicode(document, 'utf8')):

            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                
                # S
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

def identity(arg):
    """
    Simple identity function works as a passthrough.
    """
    return arg

In [4]:
# Calculation of scores on dev set and training set
def score_classifier_train_on_dev(dev_vector, train_vector, dev_labels, train_labels, label, ctype, pscoring):
    """This function takes two vectors, one for training and one for dev, trains them
    on the selected Naive Bayes model, then depending on the scoring required it
    finds the optimal alpha for the particular scoring and calculates that score from
    predictions on the dev set.
    
    Args:
        dev_vector: the processed vector of dev data
        train_vector: the processed vector of training data
        dev_labels: the vector of each of the 6 lables for the dev set
        train_labels: the vector of labels for the training set
        label (string) : the label name to test
        ctype: multi, gaus or bern, choses between multinomial, gaussian or bernoulli Naive Bayes
        scoring: should be one of roc_auc, precision, or recall
        
    Returns:
        alpha: the best alpha value for this classifier
        score: the score when using this classifier to predict dev
    """
    alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 15.0, 20.0, 50.0, 100.0]}

    if pscoring != 'precision' and pscoring != 'recall' and pscoring != 'roc_auc' and pscoring != 'f1':
        print('score_classifier_train_on_dev: Invalid input parameter %s' %(pscoring))
        return
    
    if ctype == 'multi':
        nb_class = MultinomialNB().fit(train_vector, train_labels[label])
    elif ctype == 'bern':
        nb_class = BernoulliNB().fit(train_vector, train_labels[label])
    elif ctype == 'gaus':
        nb_class = GaussianNB().fit(train_vector, train_labels[label])
    else:
        print('ctype = %s, error' % (ctype))
        return
    
    # use this to generate the best fitting model for AUC scoring
    clf = GridSearchCV(nb_class, param_grid = alphas, scoring=pscoring)
    clf.fit(train_vector, train_labels[label])
    
    # Predict the dev vector
    predicted_labels_dev = clf.predict(dev_vector)
    
    rscore = 0 # return score
    # now calculate the score of interested based on the function parameter pscoring
    if pscoring == 'precision':
        rscore = metrics.precision_score(dev_labels[label], predicted_labels_dev)
    elif pscoring == 'recall':
        rscore = metrics.recall_score(dev_labels[label], predicted_labels_dev)
    elif pscoring == 'f1':
        rscore = metrics.f1_score(dev_labels[label], predicted_labels_dev)
    else:
        rscore = metrics.roc_auc_score(dev_labels[label], predicted_labels_dev)

    return clf.best_params_, rscore

In [5]:
def create_all_vectors(my_feature_sizes = [None, 3000, 4000, 5000, 6000, 10000],
                       my_stop_words = [None, 'english'],
                       my_strip_accents = [None, 'ascii', 'unicode'],
                       my_lowercase = [True, False],
                       basetrain_data=[],
                       basedev_data=[],
                       preprocessedtrain_data=[],
                       preprocesseddev_data=[],
                       verbose=False):
    """This loops through the lists in the parameters creating 2 vector sets for each combination, 
    one CountVectorizer and one for TfidfVectorizer.  It allows for both preprocessed and unprocessed
    input data, and in the case of pre-processed it does not use the options to strip_accents or
    lowercase the data, those options are assumed to have occurred when the data was preprocessed.
    
    Args:
        my_feature_size (list of sizes): Non-empty list of feature sizes to use in vectors
        my_stop_words (list of stop_words): Provided to the vectorizers
        my_strip_accents (list of options): Provided to the vectorizers
        my_lowercase (bool): Provided to the vectorizer
        basetrain_data (Opt: list of input data): this is the base training data, no preprocessing
        basedev_data (Opt: list of input data): this is the base dev data, no preprocessing
        preprocessedtrain_data (Opt: list of input data): this is the base training data that received preprocessing
        preprocesseddev_data= (Opt: list of input data): this is the base dev data that received preprocessing
        verbose (bool): to write progress outputs
        
    Returns: 
        vectors_all (Pandas Datafram) : a dataframe where each line contains the unique count or tfidf vector
            along with the set of parameters that were used to create it.
    """
    
    vectors_all=pd.DataFrame(columns=['vectortrain', 'vectordev','type','preprocessor', 'tokenizer',
                                      'max_features', 'stop_words', 'lowercase', 'strip_accents' ])

    index=1
    if len(basetrain_data) != 0 and len(basedev_data) != 0:
        # we have unprocessed data so create vectors for it
        for i in my_feature_sizes:
            for x in my_stop_words:
                for y in my_strip_accents:
                    for z in my_lowercase:
                        if (verbose == True):
                            print("%s: Processing the next vector from base data, index %d" % (str(datetime.datetime.now().time()),index))
                            index +=1
#                        ### Count Vectorizer removed as the results are identical (to millions of a %) for count and tfidf
#                        ### vectorizer.  We stick with just tfidf as this works best for the other models
#                        # Create a count vectorizer with the provided parameters
#                        vect = CountVectorizer(max_features=i, stop_words=x, strip_accents=y, lowercase=z)
#                        # Train the unpreprocessed training set
#                        vect_train = vect.fit_transform(train_data)
#                        # Transform the dev set for fuuture predictions
#                        vect_dev = vect.transform(dev_data)
#                        # add into the output data frame with the list of vectors chosen
#                        vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'count', 0, 0, i, x, z, y]

                        # Now create tfidf vectorizer for the set of parameters
                        vect = TfidfVectorizer(max_features=i, stop_words=x, strip_accents=y, analyzer='word',lowercase=z)
                        # Train the unpreprocessed training set
                        vect_train = vect.fit_transform(train_data)
                        # Transform the dev set for fuuture predictions
                        vect_dev = vect.transform(dev_data)
                        # add into the output data frame with the list of vectors chosen
                        vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'tfidf', 0, 0, i, x, z, y]

    index=1
    if len(preprocessedtrain_data) != 0 and len(preprocesseddev_data) != 0:
        # Separate loop for the preprocessed data as we cannot set the lowercase or strip accents parameters on these
        for i in my_feature_sizes:
            for x in my_stop_words:
                print("%s: Processing the next vector from preprocessed data, index %d" % (str(datetime.datetime.now().time()),index))
                index +=1
#               ### Count Vectorizer removed as the results are identical (to millions of a %) for count and tfidf
#               ### vectorizer.  We stick with just tfidf as this works best for the other models
#                # Same but with the preprocessed data
#                vect = CountVectorizer(tokenizer=identity, max_features=i, stop_words=x,strip_accents=None, lowercase=False)
#                vect_train= vect.fit_transform(train_preproc_data)
#                vect_dev= vect.transform(dev_preproc_data)
#                vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'count', 1, 0, i, x, False, None]

                # Create a tfidf but fit with the preprocessed data
                vect = TfidfVectorizer(tokenizer=identity,max_features=i, stop_words=x,strip_accents=None, lowercase=False)
                vect_train = vect.fit_transform(train_preproc_data)
                vect_dev = vect.transform(dev_preproc_data)
                vectors_all.loc[vectors_all.shape[0]] = [vect_train, vect_dev, 'tfidf', 1, 0, i, x, False, None]

    if verbose == True:
        print('%s: Completed create_all_vectors' % (str(datetime.datetime.now().time())))
        
    return vectors_all


## Modeling

The code below contains the wrapper functions that trains a variety of Naive Bayes models and report the performance metrics that were established in the main notebook. 

In [6]:
def calculate_score_all_models (vectors_all, score_types = ['precision', 'recall', 'roc_auc'], 
                                model_types = ['multi', 'bern'], verbose=False):
    """This function takes a vector of type vectors_all (defined above) acts as a wrapper
    to send each vector to the score_classifier_train_on_dev function for scoring.  The
    resulting scores are stored in a dataframe and returned
    
    Args:
        vectors_all (dataframe) : a dataframe defined above that stores the vector data in each row
        score_types (list) : a list of scoring types to be passed to the scoring
        model_types (list) : a list of the Naive Bayes model types to create when scoring these vectors
        verbose (bool): print out progress when true
    Returns
        dataframe: A dataframe of all the resulting scores and the details for each model
    """
    data_all=pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                   'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                   'score_type', 'score'])
#     score_types = ['precision', 'recall', 'roc_auc']
#     model_types = ['multi', 'bern']
    
    if verbose == True:
        print('%s: Starting calculate_score_all_models' % (str(datetime.datetime.now().time())))        
    for index,row in vectors_all.iterrows():
        if verbose == True:
            print('%s: checking row %d' % (str(datetime.datetime.now().time()),index))
        for name in target_names:
            for score_type in score_types:
                for model_type in model_types:
                    # Calculate the score for this pair of vectors with a variety of scoring
                    # parameters and types of NB classifier
                    alpha, score = score_classifier_train_on_dev(train_vector=row['vectortrain'], 
                                        dev_vector=row['vectordev'], dev_labels=dev_labels,
                                        train_labels=train_labels, label=name, ctype=model_type, pscoring=score_type )
                    
                    # Store all the results in the dataframe
                    data_all.loc[data_all.shape[0]] = [index,name,model_type,alpha,row['type'], 
                                        row['preprocessor'], row['tokenizer'], row['max_features'], 
                                        row['stop_words'], row['lowercase'], row['strip_accents'],
                                        score_type, score]
    if verbose == True:
        print('%s: finished calculate_score_all_models' % (str(datetime.datetime.now().time())))               
    return data_all
    

In [7]:
def write_out_predictions(vector_all_vectors,vectors_to_predict, output_file, verbose=True):
    """This function takes a list of all the vectors and a subset of results.  Using the subset
    it extracts the training and dev vectors then creates the models, trains them and writes the predictions
    to the output file.  This does a prediction of each label and writes them out to the file.
    
    Args:
        vector_all_vectors (dataframe of vector information): This is the dataframe used to store the 
            vectors.  It has the following fields:
                ['vectortrain', 'vectordev','type','preprocessor', 'tokenizer',
                    'max_features', 'stop_words', 'lowercase', 'strip_accents' ]
        vectors_to_predict (dataframe of results): This is hte dataframe used to store parameters and
            results.  It has the following fields:
                ['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                    'max_features', 'stop_words', 'lowercase', 'strip_accents',
                    'score_type', 'score'])
        output_file (string): output file name
        verbose (bool): extra printing of information
    Returns: None
    """
    
    df_store = pd.DataFrame()
    if verbose == True:
        print("%s: Starting write_out_predictions to %s" % (str(datetime.datetime.now().time()),output_file))
    for index,row in vectors_to_predict.iterrows():
        stored_vector_entry = vector_all_vectors.loc[row['vectorno']]
        if row['model'] == 'multi':
            nb_class = MultinomialNB(alpha=row['alpha'].get('alpha')).fit(stored_vector_entry['vectortrain'], train_labels['toxic'])
        elif row['model'] == 'bern':
            nb_class = BernoulliNB(alpha=row['alpha'].get('alpha')).fit(stored_vector_entry['vectortrain'], train_labels[row['label']])
        else:
            print('%s: Error, row is %s' % (str(datetime.datetime.now().time()),row['model'] ))
        result_tmp=nb_class.predict(stored_vector_entry['vectordev'])
        df_store[row['label']] = result_tmp
    if verbose == True:
        print("%s: Predictions done, writing out" % (str(datetime.datetime.now().time())))
    df_store.to_csv(output_file, index=False)
        


In [8]:
def write_out_results(result_df):
    """ This is a wrapper function around the writing out of results.  It writes out a large number
    of csv and parm files.  These files are the prediction files and the parameters for each of the
    classifiers in the predictions.  The idea here is to write out the following sets of results
    1) Absolute best scores
    2) Best scores with TFIDF (once I've removed the CountVectorizer this will be the same as 1)
    3) Top scores for the NTLK preprocessed data
    4) Top scores for TFIDF sorted by feature sizes
    5) Top scores for TFIDF with NLTK preprocessing sorted by feature size
    With each of these output sets we also write out a parm file which can be rearead later using 
    the function read_create_classifiers to recreate the vectors and classifiers to match these results.
    
    Args: 
        result_df (dataframe of results)
    
    Returns: None
    """

    score_types = ['precision', 'recall', 'roc_auc']

    top_score_results = pd.DataFrame(columns=['vectorno', 'label', 'model','alpha', 'type', 'preprocessor', 'tokenizer', 
                                       'max_features', 'stop_words', 'lowercase', 'strip_accents',
                                       'score_type', 'score'])
    filename_prefix = 'NB_predictions_'
    # Top scores irrespective
    for score_type in score_types:
        top_score_results_tmp = top_score_results[0:0]  # empty dataframe in each loop
        for label in target_names:
            df_tmp = result_df[(result_df['label'] == label) & (result_df['score_type'] == score_type)]
            top_score_results_tmp.loc[top_score_results_tmp.shape[0]] = df_tmp.loc[df_tmp['score'].idxmax()]
        filename_csv = filename_prefix + 'top_scores_nofilter_' + score_type + '.csv'
        filename_parm = filename_prefix + 'top_scores_nofilter_' + score_type + '.parm'
        write_out_predictions(vectors_all,top_score_results_tmp, filename_csv, verbose)
        top_score_results_tmp.to_csv(filename_parm, index=False)
        print("Output written to %s" %(filename_csv))
        top_score_results_tmp

    # Top Scores for TFIDF - Should be identical
    for score_type in score_types:
        top_score_results_tmp = top_score_results[0:0]  # empty dataframe in each loop
        for label in target_names:
            df_tmp = result_df[(result_df['label'] == label) & (result_df['score_type'] == score_type) &
                              (result_df['type'] == 'tfidf')]
            top_score_results_tmp.loc[top_score_results_tmp.shape[0]] = df_tmp.loc[df_tmp['score'].idxmax()]
        filename_csv = filename_prefix + 'top_scores_tfidf_' + score_type + '.csv'
        filename_parm = filename_prefix + 'top_scores_tfidf_' + score_type + '.parm'
        write_out_predictions(vectors_all,top_score_results_tmp, filename_csv, verbose)
        top_score_results_tmp.to_csv(filename_parm, index=False)
        print("Output written to %s" %(filename_csv))
        top_score_results_tmp

    # Top Scores for TFIDF with NTLK preprocessed data
    for score_type in score_types:
        top_score_results_tmp = top_score_results[0:0]  # empty dataframe in each loop
        for label in target_names:
            df_tmp = result_df[(result_df['label'] == label) & (result_df['score_type'] == score_type) &
                              (result_df['type'] == 'tfidf') & (result_df['preprocessor'] == 1)]
            top_score_results_tmp.loc[top_score_results_tmp.shape[0]] = df_tmp.loc[df_tmp['score'].idxmax()]
        filename_csv = filename_prefix + 'top_scores_tfidf_preproc_' + score_type + '.csv'
        filename_parm = filename_prefix + 'top_scores_tfidf_preproc_' + score_type + '.parm'
        write_out_predictions(vectors_all,top_score_results_tmp, filename_csv, verbose)
        top_score_results_tmp.to_csv(filename_parm, index=False)
        print("Output written to %s" %(filename_csv))

    sizes=[3000, 4000, 5000, 6000, 10000]
    # Top Scores for TFIDF filtered by size
    for size in sizes:
        for score_type in score_types:
            top_score_results_tmp = top_score_results[0:0]  # empty dataframe in each loop
            for label in target_names:
                df_tmp = result_df[(result_df['label'] == label) & (result_df['score_type'] == score_type) &
                                  (result_df['type'] == 'tfidf') & (result_df['max_features'] == size)]
                top_score_results_tmp.loc[top_score_results_tmp.shape[0]] = df_tmp.loc[df_tmp['score'].idxmax()]
            filename_csv = filename_prefix + 'top_scores_tfidf_' + score_type + '_' + str(size) + '.csv'
            filename_parm = filename_prefix + 'top_scores_tfidf_' + score_type + '_' + str(size) + '.parm'
            write_out_predictions(vectors_all,top_score_results_tmp, filename_csv, verbose)
            top_score_results_tmp.to_csv(filename_parm, index=False)
            print("Output written to %s" %(filename_csv))

    # Top Scores for TFIDF with NTLK preprocessed data
    for size in sizes:
        for score_type in score_types:
            top_score_results_tmp = top_score_results[0:0]  # empty dataframe in each loop
            for label in target_names:
                df_tmp = result_df[(result_df['label'] == label) & (result_df['score_type'] == score_type) &
                                  (result_df['type'] == 'tfidf') & (result_df['preprocessor'] == 1) &
                                  (result_df['max_features'] == size)]
                top_score_results_tmp.loc[top_score_results_tmp.shape[0]] = df_tmp.loc[df_tmp['score'].idxmax()]
            filename_csv = filename_prefix + 'top_scores_tfidf_preproc_' + score_type + '_' + str(size) + '.csv'
            filename_parm = filename_prefix + 'top_scores_tfidf_preproc_' + score_type + '_' + str(size) + '.parm'
            write_out_predictions(vectors_all,top_score_results_tmp, filename_csv, verbose)
            top_score_results_tmp.to_csv(filename_parm, index=False)
            print("Output written to %s" %(filename_csv))

In [9]:
#######################################################################################
# To be used after a full run through
#######################################################################################

def read_create_classifiers(parm_file, train_data, preproc_train_data, train_labels):
    """Once the notebook has been run once and the results finalized this is the
    only necessary function.  It can be used with any of the parm files written out in by the
    write_out_results function to recreate the classifiers and return them for use.
    
    Args:
        parm_file (string filename): a stored dataframe of parameters to create the classifier
        train_data (unprocessed training data): the unprocessed training data
        preproc_train_data: preprocessed training data (some parameters in call to vectorizer are
            different when dealing with preprocessed data)
        train_labels: a set of training labels that match the training data
    Returns:
        dictionary of classifiers: the index is the label of the classifier.  These are fitted classifiers
            and can be used for predictions.
    """
    input_parameters = pd.read_csv(parm_file)
    return_classifiers = pd.DataFrame(columns=['label', 'classifier'])
    return_classifiers = {}
    
#     vectorno,label,model,alpha,type,preprocessor,tokenizer,max_features,stop_words,lowercase,strip_accents,score_type,score
    for index,row in input_parameters.iterrows():
        if row['stop_words'] == None or row['stop_words'] is np.nan:
            stop_words = None
        else:
            stop_words = row['stop_words']
        if row['preprocessor'] == 1:
            vect = TfidfVectorizer(tokenizer=identity, max_features=row['max_features'],
                                   stop_words=stop_words,
                                   lowercase=False,
                                   strip_accents=None)
            vect_train = vect.fit_transform(preproc_train_data)

        else:
            vect = tfidfVectorizer(tokenizer=identity, max_features=row['max_features'],
                                   stop_words=stop_words,
                                   strip_accents=row['strip_accents'],
                                   lowercase=row['lowercase'])
            vect_train = vect.fit_transform(train_data)

        alpha_loc = ast.literal_eval(row['alpha']).get('alpha')
        if row['model'] == 'bern':
            nb_class = BernoulliNB(alpha=alpha_loc).fit(vect_train, train_labels[row['label']])
        elif row['model'] == 'multi':
            nb_class = MultinomialNB(alpha=alpha_loc).fit(vect_train, train_labels[row['label']])
        # add the classifier with the labels to the return dataframe
        #return_classifiers.loc[return_classifiers.shape[0]] = [row['label'],nb_class]
        return_classifiers[row['label']] = nb_class
    return return_classifiers

In [10]:
#############################################
# MAIN
#############################################
# This block does the following:
# 1. Create the NLTK Preprocessed data
# 2. Create the set of vectors with the unprocessed and preprocessed data
# 3. Calculates the scores on all models

# More progress printing when this is True
verbose=True


# Create the NLTK preprocessed data
if verbose == True:
    print('%s: transforming training data with NLTK preprocessor' %(str(datetime.datetime.now().time())))
train_preproc_data = NLTKPreprocessor().fit(train_data).transform(train_data)
if verbose == True:
    print('%s: transforming dev data with NLTK preprocessor' %(str(datetime.datetime.now().time())))
dev_preproc_data = NLTKPreprocessor().fit(dev_data).transform(dev_data)
if verbose == True:
    print('%s: completed NLTK preprocessor' %(str(datetime.datetime.now().time())))

# Create the set of vectors:
vectors_all = create_all_vectors(basetrain_data=train_data, basedev_data=dev_data,
                        preprocessedtrain_data=train_preproc_data, preprocesseddev_data=dev_preproc_data,
                        verbose=verbose)

# calculate the scores for all the models
result_df = calculate_score_all_models(vectors_all,verbose=verbose)

# Write out all the results
result_df.to_csv('all_results_' + str(datetime.date.today()) + ".csv", index=False)

# this stores the results of the predictions and parameters to a set of output files
write_out_results(result_df)
if verbose == True:
    print('%s: completed main' %(str(datetime.datetime.now().time())))

19:28:32.479261: transforming training data with NLTK preprocessor
19:41:00.173452: transforming dev data with NLTK preprocessor
19:46:14.309278: completed NLTK preprocessor
19:46:14.310868: Processing the next vector from base data, index 1
19:46:28.044005: Processing the next vector from base data, index 2
19:46:41.774401: Processing the next vector from base data, index 3
19:46:56.332996: Processing the next vector from base data, index 4
19:47:11.109611: Processing the next vector from base data, index 5
19:47:26.406526: Processing the next vector from base data, index 6
19:47:41.756125: Processing the next vector from base data, index 7
19:47:54.912967: Processing the next vector from base data, index 8
19:48:08.327294: Processing the next vector from base data, index 9
19:48:22.463954: Processing the next vector from base data, index 10
19:48:36.896665: Processing the next vector from base data, index 11
19:48:51.550509: Processing the next vector from base data, index 12
19:49:0

  'setting alpha = %.1e' % _ALPHA_MIN)
  'precision', 'predicted', average, warn_for)


20:08:20.712151: checking row 1
20:13:57.323610: checking row 2
20:18:15.929765: checking row 3
20:23:52.200339: checking row 4
20:28:10.851198: checking row 5
20:33:47.516424: checking row 6
20:36:58.029656: checking row 7
20:40:35.683043: checking row 8
20:43:46.497611: checking row 9
20:47:23.706024: checking row 10
20:50:34.406519: checking row 11
20:54:11.858759: checking row 12
20:56:24.494422: checking row 13
20:58:36.292511: checking row 14
21:00:48.932721: checking row 15
21:03:00.922222: checking row 16
21:05:13.528664: checking row 17
21:07:25.381345: checking row 18
21:08:54.423134: checking row 19
21:10:25.935328: checking row 20
21:11:54.979336: checking row 21
21:13:26.591130: checking row 22
21:14:55.738207: checking row 23
21:16:27.322204: checking row 24
21:18:43.489297: checking row 25
21:20:59.462087: checking row 26
21:23:15.726494: checking row 27
21:25:31.468703: checking row 28
21:27:47.665754: checking row 29
21:30:03.290140: checking row 30
21:31:35.610025: ch

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


23:14:26.172047: Starting write_out_predictions to NB_predictions_top_scores_nofilter_precision.csv
23:14:26.616365: Predictions done, writing out
Output written to NB_predictions_top_scores_nofilter_precision.csv
23:14:28.227745: Starting write_out_predictions to NB_predictions_top_scores_nofilter_recall.csv
23:14:28.631949: Predictions done, writing out
Output written to NB_predictions_top_scores_nofilter_recall.csv
23:14:30.232677: Starting write_out_predictions to NB_predictions_top_scores_nofilter_roc_auc.csv
23:14:30.579803: Predictions done, writing out
Output written to NB_predictions_top_scores_nofilter_roc_auc.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


23:14:32.190522: Starting write_out_predictions to NB_predictions_top_scores_tfidf_precision.csv
23:14:32.634026: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_precision.csv
23:14:34.239227: Starting write_out_predictions to NB_predictions_top_scores_tfidf_recall.csv
23:14:34.642737: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_recall.csv
23:14:36.246106: Starting write_out_predictions to NB_predictions_top_scores_tfidf_roc_auc.csv
23:14:36.592876: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_roc_auc.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


23:14:38.207255: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_precision.csv
23:14:38.525341: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_precision.csv
23:14:40.134936: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_recall.csv
23:14:40.511821: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_recall.csv
23:14:42.120809: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_roc_auc.csv
23:14:42.466789: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_roc_auc.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


23:14:44.077492: Starting write_out_predictions to NB_predictions_top_scores_tfidf_precision_3000.csv
23:14:44.335300: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_precision_3000.csv
23:14:45.944104: Starting write_out_predictions to NB_predictions_top_scores_tfidf_recall_3000.csv
23:14:46.333994: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_recall_3000.csv
23:14:47.944812: Starting write_out_predictions to NB_predictions_top_scores_tfidf_roc_auc_3000.csv
23:14:48.326525: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_roc_auc_3000.csv
23:14:49.945189: Starting write_out_predictions to NB_predictions_top_scores_tfidf_precision_4000.csv
23:14:50.234221: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_precision_4000.csv
23:14:51.843286: Starting write_out_predictions to NB_predictions_top_scores_tfidf_recall_4000.csv
23:14:52.238531: Predictions done, writ

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


23:15:13.579723: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_precision_3000.csv
23:15:13.815960: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_precision_3000.csv
23:15:15.435596: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_recall_3000.csv
23:15:15.804047: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_recall_3000.csv
23:15:17.433851: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_roc_auc_3000.csv
23:15:17.782386: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_roc_auc_3000.csv
23:15:19.408235: Starting write_out_predictions to NB_predictions_top_scores_tfidf_preproc_precision_4000.csv
23:15:19.647390: Predictions done, writing out
Output written to NB_predictions_top_scores_tfidf_preproc_precision_4000.csv
23:15:21.272249: Starting write_out_predictions to NB_predictions_top_scor

In [11]:
# A quick test that we can reread and create classifiers with the read_create_classifiers function
if verbose == True:
    print('%s: start test read_create_classifiers' %(str(datetime.datetime.now().time())))
result_classifiers = read_create_classifiers('NB_predictions_top_scores_tfidf_preproc_roc_auc.parm', 
                                             train_data, train_preproc_data, train_labels)

if verbose == True:
    print('%s: completed test read_create_classifiers' %(str(datetime.datetime.now().time())))
    
result_classifiers

23:15:41.534500: start test read_create_classifiers
23:16:04.147927: completed test read_create_classifiers


{'identity_hate': BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True),
 'insult': BernoulliNB(alpha=10.0, binarize=0.0, class_prior=None, fit_prior=True),
 'obscene': BernoulliNB(alpha=10.0, binarize=0.0, class_prior=None, fit_prior=True),
 'severe_toxic': BernoulliNB(alpha=2.0, binarize=0.0, class_prior=None, fit_prior=True),
 'threat': BernoulliNB(alpha=0.5, binarize=0.0, class_prior=None, fit_prior=True),
 'toxic': BernoulliNB(alpha=15.0, binarize=0.0, class_prior=None, fit_prior=True)}




## Evaluation & Model Conclusion
A number of things were tested for this model

* A variety of different parameters and vectorizers
  * Count and Tfidf vectors
  * Variety of feature sizes
  * Data preprocessed or not
  * Removing stop words and accents
* Calculation of three types of scoring, precision, recall and auc

Lessons learned
* Experience is key: being novice has made this process take much longer

* Interpreters (I'm sure everyone knows this) are slow, and for the moment this is single threaded so even slower.  A side requiment of this is that hardware matters, particularly faster cores and plenty of memory.

* While we can in some cases brute force the best parameters there are situations where unless we have time and a cluster for processing power, we must instead rely on educated guesswork and compromises.  The educated guesswork can obviously be helped by research and experience.

* Gaussian Naive Bayes is not suitable for the very sparse inputs we are using so not testing these out.


Results:  

<TABLE>
<TR><TH> label</TH><TH> model</TH><TH> alpha</TH><TH> type</TH><TH> preprocessor</TH><TH> tokenizer</TH><TH> max_features</TH><TH> stop_words</TH><TH> lowercase</TH><TH> strip_accents</TH><TH> score_type</TH><TH> score </TH></TR>
<TR><TD> toxic</TD><TD> multi</TD><TD> 10</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> None</TD><TD> None</TD><TD> TRUE</TD><TD> None</TD><TD> precision</TD><TD> 1 </TD></TR>
<TR><TD> severe_toxic</TD><TD> multi</TD><TD> 10</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> 6000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> precision</TD><TD> 0.8 </TD></TR>
<TR><TD> obscene</TD><TD> multi</TD><TD> 2</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> None</TD><TD> None</TD><TD> TRUE</TD><TD> None</TD><TD> precision</TD><TD> 1 </TD></TR>
<TR><TD> threat</TD><TD> multi</TD><TD> 0.5</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> None</TD><TD> None</TD><TD> FALSE</TD><TD> None</TD><TD> precision</TD><TD> 1 </TD></TR>
<TR><TD> insult</TD><TD> multi</TD><TD> 2</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> None</TD><TD> None</TD><TD> FALSE</TD><TD> None</TD><TD> precision</TD><TD> 1 </TD></TR>
<TR><TD> identity_hate</TD><TD> multi</TD><TD> 2</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 3000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> precision</TD><TD> 0.884615 </TD></TR>
<TR><TH> Average</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> 0.947435833 </TH></TR>
</TABLE>

<TABLE>
<TR><TH> label</TH><TH> model</TH><TH> alpha</TH><TH> type</TH><TH> preprocessor</TH><TH> tokenizer</TH><TH> max_features</TH><TH> stop_words</TH><TH> lowercase</TH><TH> strip_accents</TH><TH> score_type</TH><TH> score </TH></TR>
<TR><TD> toxic</TD><TD> bern</TD><TD> 1</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> 10000</TD><TD> english</TD><TD> TRUE</TD><TD> unicode</TD><TD> recall</TD><TD> 0.886398 </TD></TR>
<TR><TD> severe_toxic</TD><TD> bern</TD><TD> 0.5</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> 10000</TD><TD> english</TD><TD> TRUE</TD><TD> None</TD><TD> recall</TD><TD> 0.953782 </TD></TR>
<TR><TD> obscene</TD><TD> bern</TD><TD> 1</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 10000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> recall</TD><TD> 0.897975 </TD></TR>
<TR><TD> threat</TD><TD> bern</TD><TD> 0.5</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> 3000</TD><TD> None</TD><TD> TRUE</TD><TD> None</TD><TD> recall</TD><TD> 0.868421 </TD></TR>
<TR><TD> insult</TD><TD> bern</TD><TD> 1</TD><TD> tfidf</TD><TD> 0</TD><TD> 0</TD><TD> 10000</TD><TD> english</TD><TD> TRUE</TD><TD> unicode</TD><TD>recall</TD><TD> 0.885738 </TD></TR>
<TR><TD> identity_hate</TD><TD> bern</TD><TD> 0.5</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> None</TD><TD> FALSE</TD><TD> None</TD><TD> recall</TD><TD> 0.88785 </TD></TR>
<TR><TH> Average</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> 0.896694 </TH></TR>
</TABLE>

<TABLE>
<TR><TH> label</TH><TH> model</TH><TH> alpha</TH><TH> type</TH><TH> preprocessor</TH><TH> tokenizer</TH><TH> max_features</TH><TH> stop_words</TH><TH> lowercase</TH><TH> strip_accents</TH><TH> score_type</TH><TH> score </TH></TR>
<TR><TD> toxic</TD><TD> bern</TD><TD> 15</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.858937 </TD></TR>
<TR><TD> severe_toxic</TD><TD> bern</TD><TD> 2</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.93949 </TD></TR>
<TR><TD> obscene</TD><TD> bern</TD><TD> 10</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.888569 </TD></TR>
<TR><TD> threat</TD><TD> bern</TD><TD> 0.5</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> None</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.899946 </TD></TR>
<TR><TD> insult</TD><TD> bern</TD><TD> 10</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 3000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.872004 </TD></TR>
<TR><TD> identity_hate</TD><TD> bern</TD><TD> 1</TD><TD> tfidf</TD><TD> 1</TD><TD> 0</TD><TD> 4000</TD><TD> english</TD><TD> FALSE</TD><TD> None</TD><TD> roc_auc</TD><TD> 0.886601 </TD></TR>
<TR><TH> Average</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> &nbsp;</TH><TH> 0.8909245 </TH></TR>
</TABLE>
