# You're Toxic, I'm Slippin' Under: Toxic Comment Classification Challenge

#### STINTSY S13 Group 8
- VICENTE, Francheska Josefa
- VISTA, Sophia Danielle S.

## Import Libraries
Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

#### Basic Libraries 
- `numpy` contains a large collection of mathematical functions
- `pandas` contains functions that are designed for data manipulation and data analysis

In [1]:
import numpy as np
import pandas as pd

#### Natural Language Processing Libraries 
- `re` is a module that allows the use of regular expressions
- `nltk` provides functions for processing text data
- `stopwords` is a corpus from NLTK, which includes a compiled list of stopwords
- `Counter` is from Python's `collections` module, which is helpful for tokenization
- `string` contains functions for string operations
- `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features 
- `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [2]:
import re
import nltk
import string

from nltk.corpus import stopwords
from collections import Counter
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from nltk.tokenize.casual import TweetTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Machine Learning Libraries
The following code block can be used to install **scikit-multilearn** without restarting Jupyter Notebook. The `sys` module is used to access the *executable* function of the interpreter, which would run the installation of scikit-multilearn.

In [3]:
import sys
!{sys.executable} -m pip install scikit-multilearn



The following libraries are multi-label classification modules that would allow the usage of one model that can classify one instance as more than one class.
- `ClassifierChain` chains binary classifiers in a way that its predictions are dependent on the earlier classes
- `BinaryRelevance` uses binary classifiers to classify the classes independently
- `MultiOutputClassifier` fits one classifier per target class 
- `OneVsRestClassifier` fits one class against the other classes

In [4]:
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

The following classes are classifiers that implement different methods of classification.
- `RandomForestClassifier` is a class under the ensemble module that trains by fitting using a number of decision trees
- `GradientBoostingClassifier` is a class under the ensemble module that optimizes arbitrary differentiable loss functions
- `AdaBoostClassifier` is a class under the ensemble module that implements AdaBoost-SAMME
- `MultinomialNB` is a class under the Naive Bayes module that allows the classification of discrete features
- `LogisticRegression` is a class under the linear models module that implements regularized logistic regression
- `SGDClassifier` is a class under the linear models module that implements regularized linear models with stochastic gradient descent (SGD) learning

In [5]:
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

Meanwhile, the following classes are used for hyperparameter tuning.
- `ParameterGrid` is a class that allows the iteration over different combinations of parameter values 
- `GridSearchCV` is a cross-validation class that allows the exhaustive search over all possible combinations of hyperparameter values
- `RandomizedSearchCV` is a cross-validation class that allows a random search over some possible combinations of hyperparameter values
- `train_test_split` divides the dataset into two subsets

In [6]:
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

And lastly, these classes computes different scores about how well a model works.
- `log_loss` computes the Logistic loss given the true values and the predicted values
- `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
- `accuracy_score` computes the accuracy by determining how many classes were correctly predicted

In [7]:
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

The warnings module is used to ignore any ConvergenceWarnings that might appear when doing hyperparameter tuning. As these models will not be chosen due to low accuracy scores, the warnings would only clutter the output.

In [8]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category = ConvergenceWarning)

### Load Files
The csv files to be loaded here contains the datasets that have already gone through the data cleaning and preprocessing techniques discussed in the main notebook.

In [9]:
train = pd.read_csv('cleaned_data/cleaned_train.csv')
test = pd.read_csv('cleaned_data/cleaned_test.csv')

## Initialize Datasets
Before using these datasets, we would need to convert the values in the `comment_text` column into either "str, unicode or file objects", according to the documentation of TF-IDF vectorizer and Count vectorizer.

In [10]:
test ['comment_text'] = test ['comment_text'].apply(lambda x: np.str_(x))
train ['comment_text'] = train ['comment_text'].apply(lambda x: np.str_(x))

Then, we would be declaring our **X_train**, **y_train**, and **X_test**.

In [11]:
X_train = train ['comment_text']
y_train = train.loc [ : , 'toxic' : ]

X_test = test ['comment_text']

Afterwards, we would be declaring the different classes that our model would need to predict. This can be found in the **train** data's column names.

In [12]:
classes = train.columns [2:]

## Vectorizing Data
As explained in the **Feature Engineering** part of the main notebook, three types of vectorizers would be used: (1) Count Vectorizer, (2) TF-IDF Vectorizer, and (3) Average Word2Vec Vectors.

Two types of CountVectorizer and TF-IDF Vectorizers were made in consideration of the more complex estimators: one with no **max_features** parameter, and one with a **max_features** parameter that is equal to 5000. Limiting the number of max features would lessen the time and space complexity from training the estimators; this would lessen the burden on our machines.

#### Count Vectorizer

In [13]:
count_vectorizer = CountVectorizer()                     # creating the Vectorizer with no max features
count_train = count_vectorizer.fit_transform(X_train)    # fitting the vectorizer according to the train data, and then
                                                         # returning the transformed train data
count_test = count_vectorizer.transform(X_test)          # returning the transformed test data

In [14]:
count_vectorizer_5000 = CountVectorizer(max_features = 5000)     # creating the Vectorizer with max features = 5000
count_train_5000 = count_vectorizer_5000.fit_transform(X_train)  # fitting the vectorizer according to the train data, and then
                                                                 # returning the transformed train data
count_test_5000 = count_vectorizer_5000.transform(X_test)        # returning the transformed test data

#### TF-IDF Vectorizer

In [15]:
tfidf_vectorizer = TfidfVectorizer()                    # creating the Vectorizer with no max features
tfidf_train = tfidf_vectorizer.fit_transform(X_train)   # fitting the vectorizer according to the train data, and then
                                                        # returning the transformed train data
tfidf_test = tfidf_vectorizer.transform(X_test)         # returning the transformed test data

In [16]:
tfidf_vectorizer_5000 = TfidfVectorizer(max_features = 5000)    # creating the Vectorizer with max features = 5000
tfidf_train_5000 = tfidf_vectorizer_5000.fit_transform(X_train) # fitting the vectorizer according to the train data, and then
                                                                # returning the transformed train data
tfidf_test_5000 = tfidf_vectorizer_5000.transform(X_test)       # returning the transformed test data

#### Average Word2Vec Vectors

Before building a Word2Vec model, the data must be tokenized to produce a list of lists of tokens as indicated in Gensim's documentation. 

In [19]:
def tokenize(data):
    t = TweetTokenizer()                     # initialize tokenizer
    tokens_list = []                         # initialize empty list
    
    for text in data:
        tokens_list += [t.tokenize(text)]    # add tokenized sentence to list
        
    return tokens_list

The train set is tokenized using NLTK's `TweetTokenizer`.

In [20]:
tokens_train = tokenize(X_train)

The Word2Vec model is trained using this tokenized list, which would transform these words into word vectors.

In [21]:
wrd2v_model = Word2Vec(tokens_train, epochs=30, sg=0, workers=4)

To transform the word vectors into usable features, these vectors are averaged for all words in the model's vocabulary.

In [22]:
def vectorize_word2vec(model, tokens_list):
    vectors = []
    
    for tokens in tokens_list:                    # iterate through each sentence
        feat = np.zeros(100)                      # initializes a list that will hold the vectors
        count = 0                                 # initializes the word count for a sentence
        
        for token in tokens:                      # iterate through each word in the sentence
            if token in model.wv.index_to_key:    # if the word is in the model's vocabulary...
                feat += model.wv[token]           # ...add word vectors to list and...
                count += 1                        # ...update the word count
        
        if count > 1:                             # if sentence contains more than 1 word in the model...
            feat /= count                         # ...divide word vectors by word count to get the average
            
        vectors.append(feat)                      # add the averaged vectors to the list
        
    return vectors

Using the defined function, the train data can be vectorized using the word vectors.

In [23]:
wrd2v_train = vectorize_word2vec(wrd2v_model, tokens_train)

The test data is also vectorized as follows:

In [24]:
tokens_test = tokenize(X_test)
wrd2v_test = vectorize_word2vec(wrd2v_model, tokens_test)

## Training and Tuning Different Models <a class="anchor" id="toc"></a>
As an experiment to find the model with the highest ROC AUC score in the test set, we would be training and predicting using different models. Two approaches were tested: creating an array of models (i.e., fitting one classifier for each of the labels), and utilizing SKLearn's pre-made multi-label classifiers to create one classifier for the whole task. 

Mostly, Logistic Regression, Multinomial Naive Bayes, and Random Forest Classifiers are used as base models, as these are the common models used for text classification. The experiment was expanded further by testing how the type of feature fed on the model affects the accuracy score. This was done by creating two classifiers each for the models, wherein one utilized a TF-IDF vector as its input, and another used a Count vector. 

#### Variable and Function Declarations
* [**Helper Functions**](#functs)
* [**Hyperparameters**](#params)

### Six Single-Label Classifiers
* [**Logistic Regression**](#lr)
* [**Multinomial Naive Bayes**](#mn)
* [**Random Forest Classifier**](#rf)
* [**Gradient Boosting Classifier**](#gbc)
* [**eXtreme Gradient Boosting Classifier**](#xgb)
* [**AdaBoostClassifier Boosting Classifier**](#adb)
* [**Stochastic Gradient Descent Classifier**](#sgd)

### Multi-Label Classifiers
* [**OneVsRest Classifier: Logistic Regression**](#oc_lr)
* [**OneVsRest Classifier: Multinomial Naive Bayes**](#oc_mn)
* [**MultiOutput Classifier: Logistic Regression**](#mo_lr)
* [**MultiOutput Classifier: Multinomial Naive Bayes**](#mo_mn)

### Declaring Helper Functions <a class="anchor" id="functs"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
Helper functions that would be repeatedly used throughout the notebook, will be declared and discussed here.

#### Submission Template Functions
The following `to_submission_csv` functions are used to create CSV files with the correct submission template. The first function is used by almost all models, while a modified version is used for the MultiOutput Classifier.

In [18]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id']

In [19]:
def to_submission_csv(predictions, filename):
    for i in range (6):
        sample_submission[classes [i]] = predictions[:, i : i + 1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False) 

In [20]:
def to_submission_csv_multiclass(predictions, filename):
    for i in range (6):
        temp = list(zip(*predictions[i]))
        sample_submission[classes [i]] = temp[1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False)     

#### Display Functions
The `format_results` function is used to compute for the final test accuracy and to display the final results as a DataFrame.

In [21]:
index = ['model', 'vector', 'tuned', 'private', 'public', 'test accuracy']
all_results = pd.DataFrame(index=index)

def update_results(all_results, results):
    for result in results:
        # private score accounts for 90% of the test data, while the remaining 10% is the public score
        results[result] += [round((results[result][3]*9 + results[result][4]) / 10, 5)]

    return pd.concat([pd.DataFrame(results, index=index), all_results], axis=1)

#### Training and Tuning Functions

As the task requires us to give predictions for six classes, the `train_models` function would train multiple classifiers that will give predictions for a given class. The function would fit each classifier using the passed train set, compute for the training accuracies for each class, then predict the classes of the passed test set.

The function will return the trained models and their predictions.

In [22]:
def train_models(model, X_train, X_test):
    """Trains six models using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted

    Returns
    -------
    models
        a list of fitted estimator objects
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    test_predictions = np.zeros((len(test), len(classes)))                  # initialize empty list for predictions
    models = []                                                             # initialize empty list for models
    train_accuracy = []
    
    
    print('Fitting', str(model) + '...')
    
    for i in range(6):                                                      # loop for each of six classes
        
        model.fit(X_train, y_train[classes[i]])                             # fit the model
        
        train_predictions = model.predict(X_train)                          # predict using train data
        accuracy = accuracy_score(train_predictions, y_train[classes[i]])   # get training accuracy 
        print(classes[i] + ':', accuracy)
        
        test_predictions[:,i] = model.predict_proba(X_test)[:,1]            # predict using test data
        
        models += [model]
        train_accuracy += [accuracy]
    
    print('\nOverall training accuracy:', np.mean(train_accuracy))
    
    return models, test_predictions

Similarly, the `tune_and_train_models` function will train multiple classifiers with the addition of hyperparameter tuning to achieve a better training accuracy. Hyperparameter tuning will be done using a `GridSearchCV` for a more comprehensive search.

In [23]:
def tune_and_train_models(model, hyperparameters, X_train, X_test, scoring='accuracy', cv=2):
    """Tunes six models using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    hyperparameters : estimator object
        the hyperparameters used for tuning the model  
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    scoring : 
        the metric for deciding the best combination of parameters

    Returns
    -------
    models
        a list of fitted estimator objects
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    test_predictions = np.zeros((len(test), len(classes)))                  # initialize empty list for predictions
    models = []                                                             # initialize empty list for models
    train_accuracy = []
    
    print('Tuning', str(model) + '...')
    
    for i in range(6):                                                              # loop for each of six classes
        model_cv = GridSearchCV(model, hyperparameters, 
                                cv=cv, scoring=scoring)
        model_cv.fit(X_train, y_train[classes[i]])
        
        train_predictions = model_cv.predict(X_train)                               # predict using train data
        accuracy = accuracy_score(train_predictions, y_train[classes[i]])           # get training accuracy 
        print(classes[i] + ':', accuracy, model_cv.best_params_)
        
        test_predictions[:,i] = model_cv.predict_proba(X_test)[:,1]                 # predict using test data
        
        models += [model_cv.best_estimator_]
        train_accuracy += [accuracy]
    
    print('\nOverall training', scoring + ':', np.mean(train_accuracy))
    
    return models, test_predictions

Multi-label classifiers would follow the same pipeline of training the model, getting the training predictions, and predicting the classes of the test data. As such, the `train_model` and `tune_and_train_model` functions will forgo the loop and proceed to train and/or tune the model using the whole **y_train**.

In [24]:
def train_model(model, X_train, X_test, dense=False):
    """Trains a model using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    dense : 
        a boolean value indicating if the predictions need to be converted to dense

    Returns
    -------
    model
        a fitted estimator object
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    print('Fitting', str(model) + '...')
    
    model.fit(X_train, y_train)                                               # fit the model
    train_predictions = model.predict(X_train)                                # predict using train data
    
    if dense:                                           
        train_predictions = train_predictions.todense()                       # convert predictions to dense
        
    accuracy = accuracy_score(train_predictions, y_train)                     # get training accuracy 
    print(accuracy)                                                        
    
    test_predictions = model.predict_proba(X_test)                            # predict using test data
    
    return model, test_predictions

As with the previous functions, the `tune_and_train_model` function will tune a single multi-label classifier using a `GridSearchCV` to increase the training accuracy.

In [25]:
def tune_and_train_model(model, hyperparameters, X_train, X_test, scoring='accuracy', dense=False, cv=2):
    """Tunes a model using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    hyperparameters : estimator object
        the hyperparameters used for tuning the model  
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    scoring : 
        the metric for deciding the best combination of parameters
    dense : 
        a boolean value indicating if the predictions need to be converted to dense

    Returns
    -------
    model
        a  fitted estimator object
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    print('Tuning', str(model) + '...')

    model_cv = GridSearchCV(model, hyperparameters, 
                            cv=cv, scoring=scoring)
    model_cv.fit(X_train, y_train)

    train_predictions = model_cv.predict(X_train)                                # predict using train data
    
    if dense:                                           
        train_predictions = train_predictions.todense()                          # convert predictions to dense
        
    accuracy = accuracy_score(train_predictions, y_train)                        # get training accuracy 
    print(accuracy, model_cv.best_params_)                                                        
    
    test_predictions = model_cv.predict_proba(X_test)                            # predict using test data
    
    return model_cv, test_predictions

### Declaring Hyperparameter Values <a class="anchor" id="params"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
As hyperparameters for each base estimator will remain constant, these will be declared here.

#### Logistic Regression Hyperparameters <a class="anchor" id="param_lr"></a>
Tuning Logistic Regression models mostly involve altering the C, which controls the regularization strength, and the maximum number of iterations. 

For the C value, different powers of the default value (1) were tested to see if a stronger or weaker regularization strength can affect the results.

During earlier testing stages, it was determined that the default number of max iterations (100) resulted in a ConvergenceWarning. With this, higher values were considered.

In [26]:
parameters_lr = [{
    'C' : [0.01, 0.1, 1, 10],
    'max_iter' : [50, 100, 300, 600, 900] 
}]

As the OneVsRest Classifier and MultiOutput Classifier require a slightly altered format, this was declared as a different variable.

In [27]:
parameters_lr_mo = [{
    'estimator__C': [0.01, 0.1, 1, 10],           
    'estimator__max_iter': [50, 100, 300, 600, 900] 
}]

The Binary Relevance classifier also requires its own separate format, as seen below.

In [28]:
parameters_lr_multi = [{
    'classifier': [LogisticRegression()],
    'classifier__C': [0.01, 0.1, 1, 10],            
    'classifier__max_iter': [50, 100, 300, 600, 900] 
}]

#### Multinomial Naive Bayes Hyperparameters <a class="anchor" id="param_mnb"></a>
For Multinomial Naive Bayes hyperparameters, we would be tuning the alpha and fit_prior hyperparameters. The value of alpha indicates the value that would be used as the additive smoothing, while the fit_prior determines if the class prior probabilities would be learned. 

As we would be experimenting with an n-classifier and 1-classifier approach, we would need to declare different sets of hyperparameters to take into account the needs of the classifiers. First, we would be declaring the hyperparameters that would be used by the n-classifier approach.

In [29]:
parameters_mnb = [{
    'alpha' : [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
    'fit_prior' : [True, False]
}]

Next, we would be declaring the hyperparameters to be used by the `MultiOutputClassifier` and `OneVsRestClassifier`.

In [30]:
parameters_mn_mo = [{
    'estimator__alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000], 
    'estimator__fit_prior': [True, False]
}]

Last is the hyperparameters used by the `ClassifierChain` and `BinaryRelevance` classifiers.

In [31]:
parameters_mn_multi = [{
    'classifier': [MultinomialNB()],
    'classifier__alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],  
    'classifier__fit_prior': [True, False]
}]

#### Random Forest Classifier Hyperparameters <a class="anchor" id="param_rf"></a>
For Random Forest Classifier hyperparameters, the `n_estimators` parameter refers to the number of trees in the forest while `max_features` is the size of the random subsets of features to consider when splitting a node.

In [201]:
parameters_rf = [{
    'n_estimators' : [100, 200],
    'max_features' : ['sqrt', 'log2'], 
    'max_depth' : [1000],
    'max_leaf_nodes' : [100]
}]

#### Gradient Boosting Classifier Hyperparameters <a class="anchor" id="param_gbc"></a>
For  Gradient Boosting Classifier hyperparameters, the learning rate and the number of estimators are tuned.

In [33]:
parameters_gbc = [{
    'n_estimators' : [50, 100, 250],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

#### XGBoost Classifier Hyperparameters <a class="anchor" id="param_xgb"></a>
For XGBoost Classifier hyperparameters, only the learning rate was tuned to lessen the time complexity.

In [34]:
parameters_xgb = [{
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

#### Adaboost Classifier Hyperparameters <a class="anchor" id="param_adb"></a>
For Adaboost Classifier hyperparameters, the learning rate and the number of estimators are tuned.

In [165]:
parameters_adb = {
    'n_estimators' : [10, 25, 50, 100],
    'learning_rate' : [0.01, 0.1, 1, 1.2]
}

#### SGDClassifier Hyperparameters <a class="anchor" id="param_sgd"></a>
For SGDClassifier hyperparameters, the loss function and the alpha are tuned.

In [36]:
parameters_sgd = [{
    'loss' : ['log', 'modified_huber'],
    'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
}]

## Model Experimentation
It should be noted that all models used below will have these constant values for parameters, if applicable:
* `n_jobs = -1`, which would ensure that all CPU cores will be used for faster processing,
* `class_weight='balanced'`, which would take into account the imbalance between the classes, and
* `random_state=8`, which would ensure that the output can be reproduced

### Logistic Regression <a class="anchor" id="lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
A `LogisticRegression()` object is first initialized to be used as the base classifier.

In [211]:
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')

The model is then trained using the Count vectorized train data.

In [212]:
%%time
lr_models_count, predictions_lr_count = train_models(lr, count_train, count_test)
to_submission_csv(predictions_lr_count, 'submission_lr_count')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9616847672822756
severe_toxic: 0.9756534708687669
obscene: 0.9744627783243822
threat: 0.9962963195066773
insult: 0.9615468976192416
identity_hate: 0.9769820330761855

Overall training accuracy: 0.9744377111129215
Wall time: 1min 2s


Next, the model will be trained using the TF-IDF vectorized data as shown below.

In [213]:
%%time
lr_models_tfidf, predictions_lr_tfidf = train_models(lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_lr_tfidf, 'submission_lr_tfidf')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9574233413339517
severe_toxic: 0.9794260861936066
obscene: 0.9807421147952949
threat: 0.9937519975434133
insult: 0.9681019734162223
identity_hate: 0.9810241209242281

Overall training accuracy: 0.9767449390344529
Wall time: 52.8 s


Lastly, the Word2Vec vectorized data will be used to train the model.

In [214]:
%%time
lr_models_wrd2v, predictions_lr_wrd2v = train_models(lr, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_lr_wrd2v, 'submission_lr_wrd2v')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.8984088587525302
severe_toxic: 0.948117139079156
obscene: 0.92264258543219
threat: 0.9272236183266382
insult: 0.9160687092266139
identity_hate: 0.9084670773511477

Overall training accuracy: 0.9201546646947126
Wall time: 4min 6s


#### Hyperparameter Tuning
A `LogisticRegression()` object with default parameters will serve as the base estimator, which will be tuned using the [`parameters_lr`](#param_lr) hyperparameters. This will first be tuned using Count vectors.

In [70]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_count_tuned, predictions_lr_count_tuned = tune_and_train_models(lr, parameters_lr, count_train, count_test)
to_submission_csv(predictions_lr_count_tuned, 'submission_lr_count_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9836123105075484 {'C': 10, 'max_iter': 300}
severe_toxic: 0.9899292477956521 {'C': 10, 'max_iter': 600}
obscene: 0.9879802721045804 {'C': 1, 'max_iter': 600}
threat: 0.9976499489255567 {'C': 10, 'max_iter': 100}
insult: 0.9784860657638292 {'C': 1, 'max_iter': 600}
identity_hate: 0.9898540461612699 {'C': 10, 'max_iter': 300}

Overall training accuracy: 0.9879186485430728
Wall time: 1h 17min 13s


Then, TF-IDF vectors will be used in training and tuning the model.

In [69]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_tfidf_tuned, predictions_lr_tfidf_tuned = tune_and_train_models(lr, parameters_lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_lr_tfidf_tuned, 'submission_lr_tfidf_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9787116706669758 {'C': 10, 'max_iter': 100}
severe_toxic: 0.989340168326325 {'C': 10, 'max_iter': 300}
obscene: 0.990167386304529 {'C': 10, 'max_iter': 100}
threat: 0.998145026351906 {'C': 10, 'max_iter': 100}
insult: 0.9829918970238953 {'C': 10, 'max_iter': 300}
identity_hate: 0.9926177062248153 {'C': 10, 'max_iter': 100}

Overall training accuracy: 0.9886623091497412
Wall time: 28min 23s


Lastly, the model using Word2Vec vectors will be tuned.

In [71]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_wrd2v_tuned, predictions_lr_wrd2v_tuned = tune_and_train_models(lr, parameters_lr, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_lr_wrd2v_tuned, 'submission_lr_wrd2v_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.8991358078848913 {'C': 0.01, 'max_iter': 100}
severe_toxic: 0.9492012959748325 {'C': 0.01, 'max_iter': 100}
obscene: 0.9227365874751678 {'C': 10, 'max_iter': 100}
threat: 0.9283579096452362 {'C': 10, 'max_iter': 300}
insult: 0.9174724730684147 {'C': 0.01, 'max_iter': 100}
identity_hate: 0.9094948330210376 {'C': 0.01, 'max_iter': 100}

Overall training accuracy: 0.9210664845115968
Wall time: 1h 14min 5s


#### Test Accuracy Score
As seen from the scores returned by Kaggle, the models performed quite accurately as all models reached at least 90% accuracy. Of the Logistic Regression models, the untuned model trained with TF-IDF vectors gave the highest training accuracy rate.

In [187]:
results = {
    "submission_lr_count": ['Logistic Regression', 'Count Vectors', 'Not Tuned', 0.94845, 0.94248],
    "submission_lr_tfidf": ['Logistic Regression', 'TF-IDF Vectors', 'Not Tuned', 0.97558, 0.97621], 
    "submission_lr_wrd2v": ['Logistic Regression', 'Word2Vec Vectors', 'Not Tuned', 0.95233, 0.94982],
    "submission_lr_count_tuned": ['Logistic Regression', 'Count Vectors', 'Tuned', 0.91450, 0.91797],
    "submission_lr_tfidf_tuned": ['Logistic Regression', 'TF-IDF Vectors', 'Tuned', 0.97135, 0.97227], 
    "submission_lr_wrd2v_tuned": ['Logistic Regression', 'Word2Vec Vectors', 'Tuned', 0.95240, 0.94993]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_lr_tfidf,Logistic Regression,TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_lr_tfidf_tuned,Logistic Regression,TF-IDF Vectors,Tuned,0.97135,0.97227,0.97144
submission_lr_wrd2v_tuned,Logistic Regression,Word2Vec Vectors,Tuned,0.9524,0.94993,0.95215
submission_lr_wrd2v,Logistic Regression,Word2Vec Vectors,Not Tuned,0.95233,0.94982,0.95208
submission_lr_count,Logistic Regression,Count Vectors,Not Tuned,0.94845,0.94248,0.94785
submission_lr_count_tuned,Logistic Regression,Count Vectors,Tuned,0.9145,0.91797,0.91485


### Multinomial Naive Bayes <a class="anchor" id="mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
A `MultinomialNB()` object with default parameters is initialized to serve as the base classifier.

In [219]:
mn = MultinomialNB()

The model will first be trained using the Count vectorized train data.

In [224]:
%%time
mn_models_count, predictions_mn_count = train_models(mn, count_train, count_test)
to_submission_csv(predictions_mn_count, 'submission_mn_count')

Fitting MultinomialNB()...
toxic: 0.9513696097661856
severe_toxic: 0.98641983819115
obscene: 0.9670867513520627
threat: 0.9955505699657206
insult: 0.9646301646289113
identity_hate: 0.9877233331871079

Overall training accuracy: 0.975463377848523
Wall time: 3.81 s


Then, the model will be trained using the TF-IDF vectorized data.

In [225]:
%%time
mn_models_tfidf, predictions_mn_tfidf = train_models(mn, tfidf_train, tfidf_test)
to_submission_csv(predictions_mn_tfidf, 'submission_mn_tfidf')

Fitting MultinomialNB()...
toxic: 0.9236828747078103
severe_toxic: 0.9899104473870566
obscene: 0.9538449968979326
threat: 0.996973134216117
insult: 0.9535629907689994
identity_hate: 0.9911074067343063

Overall training accuracy: 0.968180308452037
Wall time: 3.37 s


#### Hyperparameter Tuning
A `MultinomialNB()` object with default parameters is declared for use as the base estimator, to be tuned by the defined [`parameters_mnb`](#param_mn) hyperparameters.

The first model to be tuned will be trained on Count vectors as follows:

In [63]:
%%time
mn = MultinomialNB()
mn_models_count_tuned, predictions_mn_count_tuned = tune_and_train_models(mn, parameters_mnb, count_train, count_test)
to_submission_csv(predictions_mn_count_tuned, 'submission_mn_count_tuned')

Tuning MultinomialNB()...
toxic: 0.9513696097661856 {'alpha': 1, 'fit_prior': True}
severe_toxic: 0.9901297854873379 {'alpha': 1000, 'fit_prior': True}
obscene: 0.9670867513520627 {'alpha': 1, 'fit_prior': True}
threat: 0.9970044682304429 {'alpha': 1000, 'fit_prior': True}
insult: 0.9646301646289113 {'alpha': 1, 'fit_prior': True}
identity_hate: 0.99123274279161 {'alpha': 1000, 'fit_prior': True}

Overall training accuracy: 0.9769089203760917
Wall time: 36.9 s


Lastly, the model will be tuned and trained using TF-IDF vectors.

In [79]:
%%time
mn = MultinomialNB()
mn_models_tfidf_tuned, predictions_mn_tfidf_tuned = tune_and_train_models(mn, parameters_mnb, tfidf_train, tfidf_test)
to_submission_csv(predictions_mn_tfidf_tuned, 'submission_mn_tfidf_tuned')

Tuning MultinomialNB()...
toxic: 0.9617975697338489 {'alpha': 0.1, 'fit_prior': True}
severe_toxic: 0.9900044494300343 {'alpha': 10, 'fit_prior': True}
obscene: 0.9767501613701738 {'alpha': 0.1, 'fit_prior': True}
threat: 0.9970044682304429 {'alpha': 10, 'fit_prior': True}
insult: 0.9738611652493248 {'alpha': 0.1, 'fit_prior': True}
identity_hate: 0.9911951419744189 {'alpha': 10, 'fit_prior': True}

Overall training accuracy: 0.9817688259980407
Wall time: 23.4 s


#### Test Accuracy Score
As seen from the scores returned by Kaggle, the  Naive Bayes models yielded relatively lower accuracy scores compared to Logistic Regression models. Of the four models, the untuned Count vector model scored the best at 0.84654.

In [188]:
results = {
    "submission_mn_count": ['Multinomial Naive Bayes', 'Count Vectors', ' Not Tuned', 0.84551, 0.85581],
    "submission_mn_tfidf": ['Multinomial Naive Bayes', 'TF-IDF Vectors', 'Not Tuned', 0.82510, 0.83586],
    "submission_mn_count_tuned": ['Multinomial Naive Bayes', 'Count Vectors', 'Tuned', 0.75966, 0.76208],
    "submission_mn_tfidf_tuned": ['Multinomial Naive Bayes', 'TF-IDF Vectors', 'Tuned', 0.82930, 0.83952]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_mn_count,Multinomial Naive Bayes,Count Vectors,Not Tuned,0.84551,0.85581,0.84654
submission_mn_tfidf_tuned,Multinomial Naive Bayes,TF-IDF Vectors,Tuned,0.8293,0.83952,0.83032
submission_mn_tfidf,Multinomial Naive Bayes,TF-IDF Vectors,Not Tuned,0.8251,0.83586,0.82618
submission_mn_count_tuned,Multinomial Naive Bayes,Count Vectors,Tuned,0.75966,0.76208,0.7599


### RandomForestClassifier <a class="anchor" id="rf"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
For the `RandomForestClassifier()` object, default parameters were used aside from the `max_depth` which is set to 1000, and the `max_leaf_nodes` which is set to 100. This was done in consideration of the lengthy execution time when training this type of classifier, especially when conducting hyperparameter tuning.

In [189]:
rf = RandomForestClassifier(max_depth=1000, max_leaf_nodes=100, n_jobs=-1, class_weight='balanced', random_state=8)

The models will first be trained using Count vectors.

In [190]:
%%time
rf_models_count, predictions_rf_count = train_models(rf, count_train, count_test)
to_submission_csv(predictions_rf_count, 'submission_rf_count')

Fitting RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.7749403087027091
severe_toxic: 0.8696442336013436
obscene: 0.7857192096308226
threat: 0.9530303125254589
insult: 0.7856565416021708
identity_hate: 0.8484373727055669

Overall training accuracy: 0.8362379964613452
Wall time: 1min 8s


Afterwards, the models will be trained using TF-IDF vectors.

In [191]:
%%time
rf_models_tfidf, predictions_rf_tfidf = train_models(rf, tfidf_train, tfidf_test)
to_submission_csv(predictions_rf_tfidf, 'submission_rf_tfidf')

Fitting RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.7776726347519286
severe_toxic: 0.881933434019966
obscene: 0.7912715969693741
threat: 0.9599049952685639
insult: 0.7929824341515689
identity_hate: 0.8813944889735603

Overall training accuracy: 0.8475265973558269
Wall time: 1min 12s


Lastly, the Word2Vec vectors will also be used to train the models.

In [192]:
%%time
rf_models_wrd2v, predictions_rf_wrd2v = train_models(rf, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_rf_wrd2v, 'submission_rf_wrd2v')

Fitting RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.8827293179838441
severe_toxic: 0.965889792004813
obscene: 0.9096264358812065
threat: 0.9884440155166039
insult: 0.9103345846049721
identity_hate: 0.9449336032236434

Overall training accuracy: 0.9336596248691804
Wall time: 5min 38s


#### Hyperparameter Tuning
A RandomForestClassifier() object with a `max_depth` of 1000 and a `max_leaf_nodes` of 100 is initialized as the base estimator, to be tuned by the `parameters_rf` hyperparameters. This model will be tuned using Count vectors.

In [202]:
%%time
rf = RandomForestClassifier(max_depth=1000, max_leaf_nodes=100, n_jobs=-1, class_weight='balanced', random_state=8)
rf_models_count_tuned, predictions_rf_count_tuned = tune_and_train_models(rf, parameters_rf, count_train, count_test)
to_submission_csv(predictions_rf_count_tuned, 'submission_rf_count_tuned')

Tuning RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.7749403087027091 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
severe_toxic: 0.8686352783400493 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
obscene: 0.7857192096308226 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
threat: 0.9531305813713018 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
insult: 0.7856565416021708 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
identity_hate: 0.8497346008986595 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}

Overall training accuracy: 0.8363027534242855
Wall time: 26min 23s


After this, the model using TF-IDF vectors will be tuned.

In [203]:
%%time
rf = RandomForestClassifier(max_depth=1000, max_leaf_nodes=100, n_jobs=-1, class_weight='balanced', random_state=8)
rf_models_tfidf_tuned, predictions_rf_tfidf_tuned = tune_and_train_models(rf, parameters_rf, tfidf_train, tfidf_test)
to_submission_csv(predictions_rf_tfidf_tuned, 'submission_rf_tfidf_tuned')

Tuning RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.7771399565083881 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
severe_toxic: 0.8836004035821046 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
obscene: 0.7912715969693741 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
threat: 0.9599049952685639 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
insult: 0.7929824341515689 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
identity_hate: 0.8796335173684442 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}

Overall training accuracy: 0.8474221506414072
Wall time: 26min 53s


Lastly, the Word2Vec-trained model will be tuned.

In [204]:
%%time
rf = RandomForestClassifier(max_depth=1000, max_leaf_nodes=100, n_jobs=-1, class_weight='balanced', random_state=8)
rf_models_wrd2v_tuned, predictions_rf_wrd2v_tuned = tune_and_train_models(rf, parameters_rf, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_rf_wrd2v_tuned, 'submission_rf_wrd2v_tuned')

Tuning RandomForestClassifier(class_weight='balanced', max_depth=1000,
                       max_leaf_nodes=100, n_jobs=-1, random_state=8)...
toxic: 0.8827293179838441 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
severe_toxic: 0.965889792004813 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
obscene: 0.9096264358812065 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 100}
threat: 0.9886132191939638 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
insult: 0.9104285866479498 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}
identity_hate: 0.9453284118041498 {'max_depth': 1000, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'n_estimators': 200}

Overall training accuracy: 0.9337692939193212
Wall time: 28min 52s


#### Test Accuracy Score
As seen from the scores returned by Kaggle, the Random Forest models returned better results than Naive Bayes, but are still inferior to Logistic Regression models.

In [213]:
results = {
    "submission_rf_count": ['Random Forest Classifer', 'Count Vectors', 'Not Tuned', 0.90227, 0.90004],
    "submission_rf_tfidf": ['Random Forest Classifer', 'TF-IDF Vectors', 'Not Tuned', 0.89711, 0.89752],
    "submission_rf_wrd2v": ['Random Forest Classifer', 'Word2Vec Vectors', 'Not Tuned', 0.93870, 0.94112],
    "submission_rf_count_tuned": ['Random Forest Classifer', 'Count Vectors', 'Tuned', 0.90325, 0.90111],
    "submission_rf_tfidf_tuned": ['Random Forest Classifer', 'TF-IDF Vectors', 'Tuned', 0.89725, 0.89771],
    "submission_rf_wrd2v_tuned": ['Random Forest Classifer', 'Word2Vec Vectors', 'Tuned', 0.94065, 0.94117]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_rf_wrd2v_tuned,Random Forest Classifer,Word2Vec Vectors,Tuned,0.94065,0.94117,0.9407
submission_rf_wrd2v,Random Forest Classifer,Word2Vec Vectors,Not Tuned,0.9387,0.94112,0.93894
submission_rf_count_tuned,Random Forest Classifer,Count Vectors,Tuned,0.90325,0.90111,0.90304
submission_rf_count,Random Forest Classifer,Count Vectors,Not Tuned,0.90227,0.90004,0.90205
submission_rf_tfidf_tuned,Random Forest Classifer,TF-IDF Vectors,Tuned,0.89725,0.89771,0.8973
submission_rf_tfidf,Random Forest Classifer,TF-IDF Vectors,Not Tuned,0.89711,0.89752,0.89715


### GradientBoostingClassifier <a class="anchor" id="gbc"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
The `GradientBoostingClassifier()` object was initialized with default parameters.

In [173]:
gbc = GradientBoostingClassifier(random_state=8)

The model will first be trained using the Count vectorized train data.

In [174]:
%%time
gbc_models_count, predictions_gbc_count = train_models(gbc, count_train, count_test)
to_submission_csv(predictions_gbc_count, 'submission_gbc_count')

Fitting GradientBoostingClassifier(random_state=8)...
toxic: 0.943373169310213
severe_toxic: 0.9916902194007683
obscene: 0.9760106786320822
threat: 0.9974118104166797
insult: 0.968609584448302
identity_hate: 0.9942094741525715

Overall training accuracy: 0.9785508227267693
Wall time: 17min 33s


Next, the model will be trained using the TF-IDF vectorized data as shown below.

In [175]:
%%time
gbc_models_tfidf, predictions_gbc_tfidf = train_models(gbc, tfidf_train, tfidf_test)
to_submission_csv(predictions_gbc_tfidf, 'submission_gbc_tfidf')

Fitting GradientBoostingClassifier(random_state=8)...
toxic: 0.9443382569514511
severe_toxic: 0.9923106328844213
obscene: 0.9768378966102863
threat: 0.9978191526029165
insult: 0.9696498737239223
identity_hate: 0.9950366921307756

Overall training accuracy: 0.9793320841506289
Wall time: 38min 23s


#### Test Accuracy Score
As seen from the scores returned by Kaggle, the GradientBoostingClassifier models performed decently well, though inferior to some of the previous models.

In [211]:
results = {
    "submission_gbc_count": ['GradientBoostingClassifier', 'Count Vectors', 'Not Tuned', 0.92562, 0.93158],
    "submission_gbc_tfidf": ['GradientBoostingClassifier', 'TF-IDF Vectors', 'Not Tuned', 0.90490, 0.91988]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_gbc_count,GradientBoostingClassifier,Count Vectors,Not Tuned,0.92562,0.93158,0.92622
submission_gbc_tfidf,GradientBoostingClassifier,TF-IDF Vectors,Not Tuned,0.9049,0.91988,0.9064


### XGBClassifier <a class="anchor" id="xgb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
For the XGBClassifier, the `objective` parameter was set to binary:logistic, which is a learning objective for binary classification that uses logistic regression, and the `eval_metric` was set to auc which refers to the Area Under the ROC Convex Hull. This model was first trained on the Count-vectorized dataset.

In [168]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_count, predictions_xgb_count = train_models(xgb, count_train, count_test)
to_submission_csv(predictions_xgb_count, 'submission_xgb_count')

Fitting XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None,
              enable_categorical=False, eval_metric='auc', gamma=None,
              gpu_id=None, importance_type=None, interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, use_label_encoder=False,
              validate_parameters=None, verbosity=0)...
toxic: 0.9643732257114388
severe_toxic: 0.9943536106184708
obscene: 0.9863947709796893
threat: 0.9989910447387057
insult: 0.9798772959998997
identity_hate: 0.9955568367685858

Overall training accuracy: 0.9865911308027983
Wall time: 2min 33s


Next, this was trained on the TF-IDF vectorized data.

In [171]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_tfidf, predictions_xgb_tfidf = train_models(xgb, tfidf_train, tfidf_test)
to_submission_csv(predictions_xgb_tfidf, 'submission_xgb_tfidf')

Fitting XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None,
              enable_categorical=False, eval_metric='auc', gamma=None,
              gpu_id=None, importance_type=None, interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, use_label_encoder=False,
              validate_parameters=None, verbosity=0)...
toxic: 0.9667358103916125
severe_toxic: 0.9954127003026866
obscene: 0.9876167975383998
threat: 0.9993482525020211
insult: 0.9813186606588916
identity_hate: 0.9959391117433619

Overall training accuracy: 0.9877285555228289
Wall time: 8min 9s


#### Hyperparameter Tuning
A XGBClassifier() object with the parameters stated earlier was initialized to be tuned with the `parameters_xgb` hyperparameters. As with the previous models, the model will first be tuned and trained with the Count vectors.

In [None]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_count_tuned, predictions_xgb_count_tuned = tune_and_train_models(xgb, parameters_xgb, count_train, count_test)
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')

Then lastly, the model will be tuned and trained on the TF-IDF vectors.

In [None]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_tfidf_tuned, predictions_xgb_tfidf_tuned = tune_and_train_models(xgb, parameters_xgb, tfidf_train, tfidf_test)
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

#### Test Accuracy Score
As seen from the scores returned by Kaggle, the XGBClassifier models yielded high accuracy scores. While not the model with the highest test accuracy score, this model has the potential to perform better if a more thorough hyperparameter tuning is conducted.

In [180]:
results = {
    "submission_xgb_count": ['XGBClassifier', 'Count Vectors', 'Not Tuned', 0.96468, 0.96783],
    "submission_xgb_tfidf": ['XGBClassifier', 'TF-IDF Vectors', 'Not Tuned', 0.96502, 0.96803],
    "submission_xgb_count_tuned": ['XGBClassifier', 'Count Vectors', 'Tuned', 0.96282, 0.96673],
    "submission_xgb_tfidf_tuned": ['XGBClassifier', 'TF-IDF Vectors', 'Tuned', 0.96396, 0.96693],
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_xgb_tfidf,XGBClassifier,TF-IDF Vectors,Not Tuned,0.96502,0.96803,0.96532
submission_xgb_count,XGBClassifier,Count Vectors,Not Tuned,0.96468,0.96783,0.96499
submission_xgb_tfidf_tuned,XGBClassifier,TF-IDF Vectors,Tuned,0.96396,0.96693,0.96426
submission_xgb_count_tuned,XGBClassifier,Count Vectors,Tuned,0.96282,0.96673,0.96321


### AdaBoostClassifier <a class="anchor" id="adb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
An AdaBoostClassifier is initialized with the default parameters.

In [162]:
adb = AdaBoostClassifier(random_state=8)

This model is trained with the Count vectors.

In [164]:
%%time
adb_models_count, predictions_adb_count = train_models(adb, count_train, count_test)
to_submission_csv(predictions_adb_count, 'submission_adb_count')

Fitting AdaBoostClassifier(random_state=8)...
toxic: 0.9454349474528579
severe_toxic: 0.9896159076523929
obscene: 0.9711977740316223
threat: 0.9969480670046562
insult: 0.9659211260191388
identity_hate: 0.9915084821176781

Overall training accuracy: 0.9767710507130577
Wall time: 4min 43s


Then, the model is trained on the TF-IDF vectors.

In [166]:
%%time
adb_models_tfidf, predictions_adb_tfidf = train_models(adb, tfidf_train, tfidf_test)
to_submission_csv(predictions_adb_tfidf, 'submission_adb_tfidf')

Fitting AdaBoostClassifier(random_state=8)...
toxic: 0.9480544710505041
severe_toxic: 0.9899981826271691
obscene: 0.9749139881306754
threat: 0.9969606006103866
insult: 0.9679327697388623
identity_hate: 0.9917278202179594

Overall training accuracy: 0.9782646387292595
Wall time: 9min 50s


#### Test Accuracy Score
As seen from the scores returned by Kaggle, the AdaBoostClassifier models yielded accuracy scores of about 0.93, which makes these models one of the more high-performing models in this dataset.

In [170]:
results = {
    "submission_adb_count": ['AdaBoostClassifier', 'Count Vectors', 'Not Tuned', 0.93539, 0.94218],
    "submission_adb_tfidf": ['AdaBoostClassifier', 'TF-IDF Vectors', 'Not Tuned', 0.93830, 0.94145],
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_adb_tfidf,AdaBoostClassifier,TF-IDF Vectors,Not Tuned,0.9383,0.94145,0.93862
submission_adb_count,AdaBoostClassifier,Count Vectors,Not Tuned,0.93539,0.94218,0.93607


### Stochastic Gradient Descent Classifier <a class="anchor" id="sgd"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training


In [154]:
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)

In [155]:
%%time
sgd_models_count, predictions_sgd_count = train_models(sgd, count_train, count_test)
to_submission_csv(predictions_sgd_count, 'submission_sgd_count')

Fitting SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.9740930369553364
severe_toxic: 0.9773141736280402
obscene: 0.9366551566387377
threat: 0.9831673675041204
insult: 0.8818895663999098
identity_hate: 0.961321292716095

Overall training accuracy: 0.9524067656403732
Wall time: 11 s


In [156]:
%%time
sgd_models_tfidf, predictions_sgd_tfidf = train_models(sgd, tfidf_train, tfidf_test)
to_submission_csv(predictions_sgd_tfidf, 'submission_sgd_tfidf')

Fitting SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.9546032800446196
severe_toxic: 0.9773705748538268
obscene: 0.9798020943655176
threat: 0.9951808285966748
insult: 0.9667295435887473
identity_hate: 0.9792819497277074

Overall training accuracy: 0.9754947118628489
Wall time: 5.58 s


In [182]:
%%time
sgd_models_wrd2v, predictions_sgd_wrd2v = train_models(sgd, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_sgd_wrd2v, 'submission_sgd_wrd2v')

Fitting SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.9183560922724053
severe_toxic: 0.9547912841305751
obscene: 0.9135494544748106
threat: 0.974863853707754
insult: 0.9419694054684122
identity_hate: 0.9325002663391218

Overall training accuracy: 0.9393383927321798
Wall time: 39.7 s


#### Hyperparameter Tuning
The model would be tuned using a GridSearchCV with parameters_sgd. The model will first be tuned with Count vectors.

In [159]:
%%time
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_count_tuned, predictions_sgd_count_tuned = tune_and_train_models(sgd, parameters_sgd, count_train, count_test)
to_submission_csv(predictions_sgd_count_tuned, 'submission_sgd_count_tuned')

Tuning SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.9524600334647273 {'alpha': 0.0001, 'loss': 'log'}
severe_toxic: 0.9767438945673086 {'alpha': 0.0001, 'loss': 'log'}
obscene: 0.9366551566387377 {'alpha': 0.0001, 'loss': 'modified_huber'}
threat: 0.9797770271540568 {'alpha': 0.0001, 'loss': 'log'}
insult: 0.8818895663999098 {'alpha': 0.0001, 'loss': 'modified_huber'}
identity_hate: 0.9657769895532397 {'alpha': 100, 'loss': 'log'}

Overall training accuracy: 0.9488837779629966
Wall time: 2min 27s


Next, it will be tuned with TF-IDF vectors.

In [160]:
%%time
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_tfidf_tuned, predictions_sgd_tfidf_tuned = tune_and_train_models(sgd, parameters_sgd, tfidf_train, tfidf_test)
to_submission_csv(predictions_sgd_tfidf_tuned, 'submission_sgd_tfidf_tuned')

Tuning SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.9546032800446196 {'alpha': 0.0001, 'loss': 'modified_huber'}
severe_toxic: 0.9773705748538268 {'alpha': 0.0001, 'loss': 'modified_huber'}
obscene: 0.9798020943655176 {'alpha': 0.0001, 'loss': 'modified_huber'}
threat: 0.9962775190980817 {'alpha': 1, 'loss': 'modified_huber'}
insult: 0.9667295435887473 {'alpha': 0.0001, 'loss': 'modified_huber'}
identity_hate: 0.9911951419744189 {'alpha': 100, 'loss': 'log'}

Overall training accuracy: 0.977663025654202
Wall time: 1min 4s


Lastly, Word2Vec vectors were also tried.

In [183]:
%%time
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_wrd2v_tuned, predictions_sgd_wrd2v_tuned = tune_and_train_models(sgd, parameters_sgd, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_sgd_wrd2v_tuned, 'submission_sgd_wrd2v_tuned')

Tuning SGDClassifier(class_weight='balanced', loss='modified_huber', n_jobs=-1,
              random_state=8)...
toxic: 0.8969298932763472 {'alpha': 0.001, 'loss': 'modified_huber'}
severe_toxic: 0.9563642516497358 {'alpha': 0.0001, 'loss': 'log'}
obscene: 0.9164259169899293 {'alpha': 0.01, 'loss': 'modified_huber'}
threat: 0.9961396494350477 {'alpha': 100, 'loss': 'log'}
insult: 0.9419694054684122 {'alpha': 0.0001, 'loss': 'modified_huber'}
identity_hate: 0.9910572723113849 {'alpha': 100, 'loss': 'log'}

Overall training accuracy: 0.9498143981884762
Wall time: 7min 52s


In [185]:
results = {
    "submission_sgd_count": ['SGDClassifer', 'Count Vectors', 'Not Tuned', 0.88323, 0.88505],
    "submission_sgd_tfidf": ['SGDClassifer', 'TF-IDF Vectors', 'Not Tuned', 0.96854, 0.97321],
    "submission_sgd_wrd2v": ['SGDClassifer', 'Word2Vec Vectors', 'Not Tuned', 0.91181, 0.91560],
    "submission_sgd_count_tuned": ['SGDClassifer', 'Count Vectors', 'Tuned', 0.89659, 0.89790],
    "submission_sgd_tfidf_tuned": ['SGDClassifer', 'TF-IDF Vectors', 'Tuned', 0.96312, 0.96252],
    "submission_sgd_wrd2v_tuned": ['SGDClassifer', 'Word2Vec Vectors', 'Tuned', 0.93585, 0.93342]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_sgd_tfidf,SGDClassifer,TF-IDF Vectors,Not Tuned,0.96854,0.97321,0.96901
submission_sgd_tfidf_tuned,SGDClassifer,TF-IDF Vectors,Tuned,0.96312,0.96252,0.96306
submission_sgd_wrd2v_tuned,SGDClassifer,Word2Vec Vectors,Tuned,0.93585,0.93342,0.93561
submission_sgd_wrd2v,SGDClassifer,Word2Vec Vectors,Not Tuned,0.91181,0.9156,0.91219
submission_sgd_count_tuned,SGDClassifer,Count Vectors,Tuned,0.89659,0.8979,0.89672
submission_sgd_count,SGDClassifer,Count Vectors,Not Tuned,0.88323,0.88505,0.88341


### OneVsRest Classifier: Logistic Regression <a class="anchor" id="oc_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>


#### Model Training
First, the OneVsRestClassifier is initialized with the base estimator set as a Logistic Regression model, which used the default values of the parameters, except for the `class_weight`.

In [38]:
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))

The classifier is then trained using Count vectors.

In [None]:
%%time
oc_lr_model_count, predictions_oc_lr_count = train_model(oc_lr, count_train, count_test)
to_submission_csv(predictions_oc_lr_count, 'submission_oc_lr_count')

After this, the classifier will be trained with TF-IDF vectors.

In [39]:
%%time
oc_lr_model_tfidf, predictions_oc_lr_tfidf = train_model(oc_lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_lr_tfidf, 'submission_oc_lr_tfidf')

Fitting OneVsRestClassifier(estimator=LogisticRegression(class_weight='balanced'))...
0.9042369854171497
Wall time: 1min 36s


#### Hyperparameter Tuning
The model would be tuned using a `GridSearchCV` and the parameter grid **parameters_lr_mo**. Like in the previous models, we would be tuning the model with Count vector as input first.

In [None]:
%%time
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
oc_lr_model_count_tuned, predictions_oc_lr_count_tuned = tune_and_train_model(oc_lr, 
                                                                              parameters_lr_mo, 
                                                                              count_train, 
                                                                              count_test)
to_submission_csv(predictions_oc_lr_count_tuned, 'submission_oc_lr_count_tuned')

After the Count vector, we would be tuning and predicting using the model that used a TF-IDF vector as input.

In [None]:
%%time
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
oc_lr_model_tfidf_tuned, predictions_oc_lr_tfidf_tuned = tune_and_train_model(oc_lr, 
                                                                              parameters_lr_mo, 
                                                                              tfidf_train, 
                                                                              tfidf_test)
to_submission_csv(predictions_oc_lr_tfidf_tuned, 'submission_oc_lr_tfidf_tuned')

#### Test Accuracy Score
From the ROC AUC scores of the Kaggle competition, the tuned `OneVsRestClassifier` with a `LogisticRegression` base estimator and a TF-IDF vector input did not improve from its untuned model as the best parameters found was the default values of the `LogisticRegression` model. However, for the Count vectors version, the tuned model actually received a lower score.

In [220]:
results = {
    "submission_oc_lr_count": ['OneVsRestClassifier (Logistic Regression)', 'Count Vectors', 'Not Tuned', 0.93996, 0.94410],
    "submission_oc_lr_tfidf": ['OneVsRestClassifier (Logistic Regression)', 'TF-IDF Vectors', 'Not Tuned', 0.97558, 0.97621],
    "submission_oc_lr_count_tuned": ['OneVsRestClassifier (Logistic Regression)', 'Count Vectors', 'Tuned', 0.91488, 0.91866],
    "submission_oc_lr_tfidf_tuned": ['OneVsRestClassifier (Logistic Regression)', 'TF-IDF Vectors', 'Tuned', 0.97558, 0.97621]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_oc_lr_tfidf,OneVsRestClassifier (Logistic Regression),TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_oc_lr_tfidf_tuned,OneVsRestClassifier (Logistic Regression),TF-IDF Vectors,Tuned,0.97558,0.97621,0.97564
submission_oc_lr_count,OneVsRestClassifier (Logistic Regression),Count Vectors,Not Tuned,0.93996,0.9441,0.94037
submission_oc_lr_count_tuned,OneVsRestClassifier (Logistic Regression),Count Vectors,Tuned,0.91488,0.91866,0.91526


### OneVsRest Classifier: Multinomial Naive Bayes <a class="anchor" id="oc_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
To start with training the `MultinomialNB` version of the `OneVsRestClassifier`, we would be create a `MultinomialNB` with the default parameters.

In [None]:
oc_mn = OneVsRestClassifier(MultinomialNB())

This would be first used to train a version of this model using Count vectors as its feature input.

In [None]:
%%time
oc_mn_model_count, predictions_oc_mn_count = train_model(oc_mn, count_train, count_test)
to_submission_csv(predictions_oc_mn_count, 'submission_oc_mn_count')

Afterwards, the a copy of the same `OneVsRestClassifier` object would be trained with a TF-IDF vector.

In [None]:
%%time
oc_mn_model_tfidf, predictions_oc_mn_tfidf = train_model(oc_mn, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_mn_tfidf, 'submission_oc_mn_tfidf')

#### Hyperparameter Tuning
These two models of `OneVsRestClassifier` with `MultinomialNB` would be tuned using the hyperparameter values found in **parameters_mn_mo**. We would be starting with the model with Count vectors.

In [194]:
%%time
oc_mn = OneVsRestClassifier(MultinomialNB())
oc_mn_model_count_tuned, predictions_oc_mn_count_tuned = tune_and_train_model(oc_mn, 
                                                                              parameters_mn_mo, 
                                                                              count_train, 
                                                                              count_test)
to_submission_csv(predictions_oc_mn_count_tuned, 'submission_oc_mn_count_tuned')

Tuning OneVsRestClassifier(estimator=MultinomialNB())...
Overall training f1: 0.7751371196906105
Best parameters: {'estimator__alpha': 1e-05, 'estimator__fit_prior': True}
Wall time: 19.8 s


This is followed by the model that used TF-IDF vectors as features.

In [None]:
%%time
oc_mn = OneVsRestClassifier(MultinomialNB())
oc_mn_model_tfidf_tuned, predictions_oc_mn_tfidf_tuned = tune_and_train_model(oc_mn, 
                                                                              parameters_mn_mo, 
                                                                              tfidf_train, 
                                                                              tfidf_test)
to_submission_csv(predictions_oc_mn_tfidf_tuned, 'submission_oc_mn_tfidf_tuned')

#### Test Accuracy Score
Analyzing the results of the predictions on the test data, we could see that tuning the `MultinomialNB` of the TF-IDF vector greatly improved its score. Meanwhile, for the model that utilized the Count vector as its input, we could see that it yielded the same score.

In [219]:
results = {
    "submission_oc_mn_count": ['OneVsRestClassifier (Multinomial NB)', 'Count Vectors', 'Not Tuned', 0.84551, 0.85581],
    "submission_oc_mn_tfidf": ['OneVsRestClassifier (Multinomial NB)', 'TF-IDF Vectors', 'Not Tuned', 0.82510, 0.83586],
    "submission_oc_mn_count_tuned": ['OneVsRestClassifier (Multinomial NB)', 'Count Vectors', 'Tuned', 0.84551, 0.85581],
    "submission_oc_mn_tfidf_tuned": ['OneVsRestClassifier (Multinomial NB)', 'TF-IDF Vectors', 'Tuned', 0.90597, 0.91409]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_oc_mn_tfidf_tuned,OneVsRestClassifier (Multinomial NB),TF-IDF Vectors,Tuned,0.90597,0.91409,0.90678
submission_oc_mn_count,OneVsRestClassifier (Multinomial NB),Count Vectors,Not Tuned,0.84551,0.85581,0.84654
submission_oc_mn_count_tuned,OneVsRestClassifier (Multinomial NB),Count Vectors,Tuned,0.84551,0.85581,0.84654
submission_oc_mn_tfidf,OneVsRestClassifier (Multinomial NB),TF-IDF Vectors,Not Tuned,0.8251,0.83586,0.82618


### MultiOutput Classifier: Logistic Regression <a class="anchor" id="mo_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
To start with experimenting with `MultiOutputClassifier`, we will first declare a `MultiOutputClassifier` object with `LogisticRegression` as its base estimator.

In [None]:
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'))

As we now have this object, we can now create a version of this model that will be trained using an input of Count vector.

In [None]:
%%time
mo_lr_model_count, predictions_mo_lr_count = train_model(mo_lr, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_lr_count, 'submission_mo_lr_count')

Then, another copy of this model would be trained using a TF-IDF vector.

In [None]:
%%time
mo_lr_model_tfidf, predictions_mo_lr_tfidf = train_model(mo_lr, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_lr_tfidf, 'submission_mo_lr_tfidf')

#### Hyperparameter Tuning
These two models would undergo hyperparameter tuning using **parameters_lr_mo** as their parameter grid and `GridSearchCV`. We will start with tuning the model that used Count vectors.

In [None]:
%%time
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'))
mo_lr_model_count_tuned, predictions_mo_lr_count_tuned = tune_and_train_model(mo_lr, 
                                                                              parameters_lr_mo, 
                                                                              count_train, 
                                                                              count_test)
to_submission_csv_multiclass(predictions_mo_lr_count_tuned, 'submission_mo_lr_count_tuned')

This is followed by tuning the TF-IDF version of this model.

In [None]:
%%time
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'))
mo_lr_model_tfidf_tuned, predictions_mo_lr_tfidf_tuned = tune_and_train_model(mo_lr, 
                                                                              parameters_lr_mo, 
                                                                              tfidf_train, 
                                                                              tfidf_test)
to_submission_csv_multiclass(predictions_mo_lr_tfidf_tuned, 'submission_mo_lr_tfidf_tuned')

#### Test Accuracy Score
With these results, we can see that the scores between all four classifiers are near each other (with a range of 95 to 97). Although, the difference between the two models was that tuning the TF-IDF version made the score lower, while tuning the Count version increased the score by almost one point.

In [218]:
results = {
    "submission_mo_lr_count": ['MultiOutputClassifier (Logistic Regression)', 'Count Vectors', 'Not Tuned', 0.94866, 0.95065],
    "submission_mo_lr_tfidf": ['MultiOutputClassifier (Logistic Regression)', 'TF-IDF Vectors', 'Not Tuned', 0.97558, 0.97621],
    "submission_mo_lr_count_tuned": ['MultiOutputClassifier (Logistic Regression)', 'Count Vectors', 'Tuned', 0.95113, 0.95159],
    "submission_mo_lr_tfidf_tuned": ['MultiOutputClassifier (Logistic Regression)', 'TF-IDF Vectors', 'Tuned', 0.97558, 0.97621]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_mo_lr_tfidf,MultiOutputClassifier (Logistic Regression),TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_mo_lr_tfidf_tuned,MultiOutputClassifier (Logistic Regression),TF-IDF Vectors,Tuned,0.97558,0.97621,0.97564
submission_mo_lr_count_tuned,MultiOutputClassifier (Logistic Regression),Count Vectors,Tuned,0.95113,0.95159,0.95118
submission_mo_lr_count,MultiOutputClassifier (Logistic Regression),Count Vectors,Not Tuned,0.94866,0.95065,0.94886


### MultiOutput Classifier: Multinomial Naive Bayes <a class="anchor" id="mo_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training
For our last approach, we would be creating a `MultiOutputClassifier` that utilized `MultinomialNB` as its base estimator.

In [None]:
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)

With this object, we will be training this meta-estimator using a Count vector as its training input.

In [None]:
%%time
mo_mn_model_count, predictions_mo_mn_count = train_model(mo_mn, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_mn_count, 'submission_mo_mn_count')

On the other hand, we will also be training another copy of this meta-estimator, but with the input of a TF-IDF vector.

In [None]:
%%time
mo_mn_model_tfidf, predictions_mo_mn_tfidf = train_model(mo_mn, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_mn_tfidf, 'submission_mo_mn_tfidf')

#### Hyperparameter Tuning
Like in the previous model of `MultiOutputClassifier` (i.e., the one that utilized `LogisticRegression`), we will be tuning the model that used Count vector using `GridSearchCV` and the parameter grid `parameters_mn_mo`.

In [None]:
%%time
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)
mo_mn_model_count_tuned, predictions_mo_mn_count_tuned = tune_and_train_model(mo_mn, 
                                                                              parameters_mn_mo, 
                                                                              count_train, 
                                                                              count_test)
to_submission_csv_multiclass(predictions_mo_mn_count_tuned, 'submission_mo_mn_count_tuned')

Likewise, we would also be tuning the TF-IDF vector copy. 

In [None]:
%%time
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)
mo_mn_model_tfidf_tuned, predictions_mo_mn_tfidf_tuned = tune_and_train_model(mo_mn, 
                                                                              parameters_mn_mo, 
                                                                              tfidf_train, 
                                                                              tfidf_test)
to_submission_csv_multiclass(predictions_mo_mn_tfidf_tuned, 'submission_mo_mn_tfidf_tuned')

#### Test Accuracy Score
From the results below, we can determine that the not tuned version of the TF-IDF was better than the tuned version. Meanwhile, this is the opposite for the model that used the Count vector. 

However, comparing the results of the `MultiOutputClassifier`, we can see the `LogisticRegression` performed better for all versions of the models. Although, they follow the same pattern with regards the vectors that they used as inputs (i.e., models that used untuned TF-IDF had a higher score, while tuned model that utilized Count vector had a higher score).

In [217]:
results = {
    "submission_mo_mn_count": ['MultiOutputClassifier (Multinomial NB)', 'Count Vectors', 'Not Tuned', 0.84551, 0.85581],
    "submission_mo_mn_tfidf": ['MultiOutputClassifier (Multinomial NB)', 'TF-IDF Vectors', 'Not Tuned', 0.90867, 0.91595],
    "submission_mo_mn_count_tuned": ['MultiOutputClassifier (Multinomial NB)', 'Count Vectors', 'Tuned', 0.87456, 0.88221],
    "submission_mo_mn_tfidf_tuned": ['MultiOutputClassifier (Multinomial NB)', 'TF-IDF Vectors', 'Tuned', 0.89620, 0.90115]
}

all_results = update_results(all_results, results)
pd.DataFrame(results, index=index).T.sort_values('test accuracy', ascending=False)

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_mo_mn_tfidf,MultiOutputClassifier (Multinomial NB),TF-IDF Vectors,Not Tuned,0.90867,0.91595,0.9094
submission_mo_mn_tfidf_tuned,MultiOutputClassifier (Multinomial NB),TF-IDF Vectors,Tuned,0.8962,0.90115,0.89669
submission_mo_mn_count_tuned,MultiOutputClassifier (Multinomial NB),Count Vectors,Tuned,0.87456,0.88221,0.87532
submission_mo_mn_count,MultiOutputClassifier (Multinomial NB),Count Vectors,Not Tuned,0.84551,0.85581,0.84654


## All Scores

In [221]:
all_results.T.sort_values('test accuracy', ascending=False).drop_duplicates(subset=['model', 'vector', 'tuned'])

Unnamed: 0,model,vector,tuned,private,public,test accuracy
submission_lr_tfidf,Logistic Regression,TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_mo_lr_tfidf,MultiOutputClassifier (Logistic Regression),TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_lr_tfidf_tuned,Logistic Regression,TF-IDF Vectors,Tuned,0.97558,0.97621,0.97564
submission_mo_lr_tfidf_tuned,MultiOutputClassifier (Logistic Regression),TF-IDF Vectors,Tuned,0.97558,0.97621,0.97564
submission_oc_lr_tfidf,OneVsRestClassifier (Logistic Regression),TF-IDF Vectors,Not Tuned,0.97558,0.97621,0.97564
submission_oc_lr_tfidf_tuned,OneVsRestClassifier (Logistic Regression),TF-IDF Vectors,Tuned,0.97558,0.97621,0.97564
submission_sgd_tfidf,SGDClassifer,TF-IDF Vectors,Not Tuned,0.96854,0.97321,0.96901
submission_xgb_tfidf,XGBClassifier,TF-IDF Vectors,Not Tuned,0.96502,0.96803,0.96532
submission_xgb_count,XGBClassifier,Count Vectors,Not Tuned,0.96468,0.96783,0.96499
submission_xgb_tfidf_tuned,XGBClassifier,TF-IDF Vectors,Tuned,0.96396,0.96693,0.96426
