# You're Toxic, I'm Slippin' Under: Toxic Comment Classification Challenge

#### STINTSY S13 Group 8
- VICENTE, Francheska Josefa
- VISTA, Sophia Danielle S.

## Import Libraries
Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

#### Basic Libraries 
- `numpy` contains a large collection of mathematical functions
- `pandas` contains functions that are designed for data manipulation and data analysis

In [133]:
import numpy as np
import pandas as pd

#### Natural Language Processing Libraries 
- `re` is a module that allows the use of regular expressions
- `nltk` provides functions for processing text data
- `stopwords` is a corpus from NLTK, which includes a compiled list of stopwords
- `Counter` is from Python's `collections` module, which is helpful for tokenization
- `string` contains functions for string operations
- `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features 
- `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [None]:
import re
import nltk
import string

from nltk.corpus import stopwords
from collections import Counter
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from nltk.tokenize.casual import TweetTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Machine Learning Libraries
The following code block can be used to install **scikit-multilearn** without restarting Jupyter Notebook. The `sys` module is used to access the *executable* function of the interpreter, which would run the installation of scikit-multilearn.

In [135]:
import sys
!{sys.executable} -m pip install scikit-multilearn

C:\Users\User\anaconda3\python.exe: No module named pip


The following libraries are multi-label classification modules that would allow the usage of one model that can classify one instance as more than one class.
- `ClassifierChain` chains binary classifiers in a way that its predictions are dependent on the earlier classes
- `BinaryRelevance` uses binary classifiers to classify the classes independently
- `MultiOutputClassifier` fits one classifier per target class 
- `OneVsRestClassifier` fits one class against the other classes

In [136]:
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

The following classes are classifiers that implement different methods of classification.
- `RandomForestClassifier` is a class under the ensemble module that trains by fitting using a number of decision trees
- `GradientBoostingClassifier` is a class under the ensemble module that optimizes arbitrary differentiable loss functions
- `AdaBoostClassifier` is a class under the ensemble module that implements AdaBoost-SAMME
- `MultinomialNB` is a class under the Naive Bayes module that allows the classification of discrete features
- `LogisticRegression` is a class under the linear models module that implements regularized logistic regression
- `SGDClassifier` is a class under the linear models module that implements regularized linear models with stochastic gradient descent (SGD) learning

In [137]:
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

Meanwhile, the following classes are used for hyperparameter tuning.
- `ParameterGrid` is a class that allows the iteration over different combinations of parameter values 
- `GridSearchCV` is a cross-validation class that allows the exhaustive search over all possible combinations of hyperparameter values
- `RandomizedSearchCV` is a cross-validation class that allows a random search over some possible combinations of hyperparameter values
- `train_test_split` divides the dataset into two subsets

In [138]:
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

And lastly, these classes computes different scores about how well a model works.
- `log_loss` computes the Logistic loss given the true values and the predicted values
- `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
- `accuracy_score` computes the accuracy by determining how many classes were correctly predicted

In [139]:
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

The warnings module is used to ignore any ConvergenceWarnings that might appear when doing hyperparameter tuning. As these models will not be chosen due to low accuracy scores, the warnings would only clutter the output.

In [140]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category = ConvergenceWarning)

### Load Files
The csv files to be loaded here contains the datasets that have already gone through the data cleaning and preprocessing techniques discussed in the main notebook.

In [141]:
train = pd.read_csv('cleaned_data/cleaned_train.csv')
test = pd.read_csv('cleaned_data/cleaned_test.csv')

## Initialize Datasets
Before using these datasets, we would need to convert the values in the `comment_text` column into either "str, unicode or file objects", according to the documentation of TF-IDF vectorizer and Count Vectorizer.

In [142]:
test ['comment_text'] = test ['comment_text'].apply(lambda x: np.str_(x))
train ['comment_text'] = train ['comment_text'].apply(lambda x: np.str_(x))

Then, we would be declaring our **X_train**, **y_train**, and **X_test**.

In [143]:
X_train = train ['comment_text']
y_train = train.loc [ : , 'toxic' : ]

X_test = test ['comment_text']

Afterwards, we would be declaring the different classes that our model would need to predict. This can be found in the **train** data's column names.

In [144]:
classes = train.columns [2:]

## Vectorizing Data
As explained in the **Feature Engineering** part of the main notebook, two types of vectorizers would be used: (1) Count Vectorizer, and (2) TF-IDF Vectorizer.

Two types of each vectorizer were made in consideration of the more complex estimators: one with no **max_features** parameter, and one with a **max_features** parameter that is equal to 5000. Limiting the number of max features would lessen the time and space complexity from training the estimators; this would lessen the burden on our machines.

#### Count Vectorizer

In [None]:
count_vectorizer = CountVectorizer()                     # creating the Vectorizer with no max features
count_train = count_vectorizer.fit_transform(X_train)    # fitting the vectorizer according to the train data, and then
                                                         # returning the transformed train data
count_test = count_vectorizer.transform(X_test)          # returning the transformed test data

In [145]:
count_vectorizer_5000 = CountVectorizer(max_features = 5000)     # creating the Vectorizer with max features = 5000
count_train_5000 = count_vectorizer_5000.fit_transform(X_train)  # fitting the vectorizer according to the train data, and then
                                                                 # returning the transformed train data
count_test_5000 = count_vectorizer_5000.transform(X_test)        # returning the transformed test data

#### TF-IDF Vectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer()                    # creating the Vectorizer with no max features
tfidf_train = tfidf_vectorizer.fit_transform(X_train)   # fitting the vectorizer according to the train data, and then
                                                        # returning the transformed train data
tfidf_test = tfidf_vectorizer.transform(X_test)         # returning the transformed test data

In [146]:
tfidf_vectorizer_5000 = TfidfVectorizer(max_features = 5000)    # creating the Vectorizer with max features = 5000
tfidf_train_5000 = tfidf_vectorizer_5000.fit_transform(X_train) # fitting the vectorizer according to the train data, and then
                                                                # returning the transformed train data
tfidf_test_5000 = tfidf_vectorizer_5000.transform(X_test)       # returning the transformed test data

#### Average Word2Vec Vectors

In [214]:
def tokenize(data):
    t = TweetTokenizer()

    tokens_list = []
    for text in data:
        tokens_list += [t.tokenize(text)]
    return tokens_list

def vectorize_word2vec(model, tokens_list):
    vectors = []
    for tokens in tokens_list:
        feat = np.zeros(100)
        count = 0 + 1e-5
        for token in tokens:
            if token in model.wv.index_to_key:
                feat += model.wv[token]
                count +=1
        if(count!=0):
            feat /= count
        vectors.append(feat)
        
    return vectors

In [206]:
word2vec_model = Word2Vec(tokenize(X_train), epochs=30, sg=0, workers=4)

In [None]:
word2vec_train = vectorize_word2vec(word2vec_model, tokenize(X_train))

In [None]:
word2vec_test = vectorize_word2vec(word2vec_model, tokenize(X_test))

## Declaring Functions
Helper functions that would be repeatedly used throughout the notebook, will be declared and discussed here.

The following `to_submission_csv` functions are used to create CSV files with the correct submission template. The first function is used by almost all models, while a modified version is used for the MultiOutput Classifier.

In [147]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id']

def to_submission_csv(predictions, filename):
    for i in range (6):
        sample_submission[classes [i]] = predictions[:, i : i + 1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False) 
    
def to_submission_csv_multiclass(predictions, filename):
    for i in range (6):
        temp = list(zip(*predictions[i]))
        sample_submission[classes [i]] = temp[1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False)     

As the task requires us to give predictions for six classes, the `train_models` function would train multiple classifiers that will give predictions for a given class. The function would fit each classifier using data vectorized using both the Count Vectorizer and the TF-IDF Vectorizer as discussed below, and would then display their respective training accuracies. 

The function will return the trained models and their predictions.

In [None]:
def train_models(model):
    """Trains six models using both count vectorized data and TF-IDF vectorized data.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 

    Returns
    -------
    models_count
        a list of estimator objects fitted with count vectorized data
    models_tfidf
        a list of estimator objects fitted with TF-IDF vectorized data
    predictions_count
        a list of prediction probabilities made for the count vectorized test data
    predictions_tfidf
        a list of prediction probabilities made for the TF-IDF vectorized test data
    """
    
    predictions_count = np.zeros((len(test), len(classes)))       # initialize empty list for count predictions
    predictions_tfidf = np.zeros((len(test), len(classes)))       # initialize empty list for tf-idf predictions
    predictions_wrd2v = np.zeros((len(test), len(classes)))       # initialize empty list for tf-idf predictions    
    models_count = []                                             # initialize empty list for count models
    models_tfidf = []                                             # initialize empty list for tf-idf models
    models_wrd2v = []                                             # initialize empty list for tf-idf models

    for i in range(6):                                            # loop for each of six classes
        print('Fitting', classes[i] + '...')

        mdl = model                                                     # initialize the model
        mdl.fit(count_train, y_train[classes[i]])                       # fit the model
        predictions = mdl.predict(count_train)                          # predict using train data
        accuracy = accuracy_score(predictions, y_train[classes[i]])     # get training accuracy 
        print('Count Vectors:', accuracy)
        predictions_count[:,i] = mdl.predict_proba(count_test)[:,1]     # predict using test data
        models_count += [mdl]
        
        mdl = model                                                     # initialize the model
        mdl.fit(tfidf_train, y_train[classes[i]])                       # fit the model
        predictions = mdl.predict(tfidf_train)                          # predict using train data
        accuracy = accuracy_score(predictions, y_train[classes[i]])     # get training accuracy 
        print('TF-IDF Vectors:', accuracy)
        predictions_tfidf[:,i] = mdl.predict_proba(tfidf_test)[:,1]     # predict using test data
        models_tfidf += [mdl]
        
        mdl = model                                                     # initialize the model
        mdl.fit(word2vec_train, y_train[classes[i]])                       # fit the model
        predictions = mdl.predict(word2vec_train)                          # predict using train data
        accuracy = accuracy_score(predictions, y_train[classes[i]])     # get training accuracy 
        print('Word2Vec Vectors:', accuracy)
        predictions_wrd2v[:,i] = mdl.predict_proba(word2vec_test)[:,1]     # predict using test data
        models_wrd2v += [mdl]
    
    return models_count, models_tfidf, models_wrd2v, predictions_count, predictions_tfidf, predictions_wrd2v

Similarly, the `tune_and_train_models` will train multiple classifiers, however, this function will include hyperparameter tuning to achieve a better training accuracy.

In [149]:
def tune_and_train_models(model_cv):
    """Tunes six models using both count vectorized data and TF-IDF vectorized data.

    Parameters
    ----------
    model_cv : GridSearchCV or RandomizedSearchCV object
        a cross validation object with the estimator and hyperparameters to be fitted

    Returns
    -------
    models_count_tuned
        a list of tuned estimator objects fitted with count vectorized data
    models_tfidf_tuned
        a list of tuned estimator objects fitted with TF-IDF vectorized data
    predictions_count_tuned
        a list of prediction probabilities made for the count vectorized test data
    predictions_tfidf_tuned
        a list of prediction probabilities made for the TF-IDF vectorized test data
    """
    
    predictions_count_tuned = np.zeros((len(test), len(classes)))       # initialize empty list for count predictions
    predictions_tfidf_tuned = np.zeros((len(test), len(classes)))       # initialize empty list for tf-idf predictions
    models_count_tuned = []                                             # initialize empty list for count models
    models_tfidf_tuned = []                                             # initialize empty list for tf-idf models

    for i in range(6):                                                  # loop for each of six classes
        print('Fitting', classes[i] + '...')

        mdl_tuned = model_cv                                                     # initialize the cross validation model
        mdl_tuned.fit(count_train, y_train[classes[i]])                          # fit the model
        print('Best parameters:', mdl_tuned.best_params_)                        # print best parameters found 
        predictions = mdl_tuned.predict(count_train)                             # predict using train data
        accuracy = accuracy_score(predictions, y_train[classes[i]])              # get training accuracy 
        print('Count Vectors:', accuracy)                                        # print training accuracy
        predictions_count_tuned[:,i] = mdl_tuned.predict_proba(count_test)[:,1]  # predict using test data
        models_count_tuned += [mdl_tuned]                                        # add model to list

        mdl_tuned = model_cv                                                     # initialize the cross validation model
        mdl_tuned.fit(tfidf_train, y_train[classes[i]])                          # fit the model
        print('Best parameters:', mdl_tuned.best_params_)                        # print best parameters found 
        predictions = mdl_tuned.predict(tfidf_train)                             # predict using train data
        accuracy = accuracy_score(predictions, y_train[classes[i]])              # get training accuracy 
        print('TF-IDF Vectors:', accuracy)                                       # print training accuracy
        predictions_tfidf_tuned[:,i] = mdl_tuned.predict_proba(tfidf_test)[:,1]  # predict using test data
        models_tfidf_tuned += [mdl_tuned]
    
    return models_count_tuned, models_tfidf_tuned, predictions_count_tuned, predictions_tfidf_tuned

Wall time: 0 ns


To exhaust all options, we have opted to train different types of multi-label classifiers that are available from sckit-learn.
As such, the `train_model` and `tune_and_train_model` functions will forgo the loop and proceed to train and/or tune the model using the whole **y_train**.

In [150]:
def train_model(model):
    """Trains a model using both count vectorized data and TF-IDF vectorized data.

    Parameters
    ----------
    model : estimator object
        the type of multi-label estimator to be trained 

    Returns
    -------
    model_count
        a multi-label estimator object fitted with count vectorized data
    model_tfidf
        a multi-label estimator object fitted with TF-IDF vectorized data
    predictions_count
        a list of prediction probabilities made for the count vectorized test data
    predictions_tfidf
        a list of prediction probabilities made for the TF-IDF vectorized test data
    """

    model_count = model                                           # initialize the model
    model_count.fit(count_train, y_train)                                 # fit the model
    predictions = model_count.predict(count_train)                        # predict using train data
    accuracy = accuracy_score(predictions, y_train)               # get training accuracy 
    print('Count Vectors:', accuracy)                             # print training accuracy
    predictions_count = model_count.predict_proba(count_test)             # predict using test data
        
    model_tfidf = model                                           # initialize the model
    model_tfidf.fit(tfidf_train, y_train[classes[i]])                     # fit the model
    predictions = model_tfidf.predict(tfidf_train)                        # predict using train data
    accuracy = accuracy_score(predictions, y_train)               # get training accuracy 
    print('TF-IDF Vectors:', accuracy)                            # print training accuracy
    predictions_tfidf = model_tfidf.predict_proba(tfidf_test)             # predict using test data
    
    return model_count, model_tfidf, predictions_count, predictions_tfidf

Wall time: 0 ns


TODO: i ken fly

In [151]:
def tune_and_train_model(model_cv):
    """Tunes a model using both count vectorized data and TF-IDF vectorized data.

    Parameters
    ----------
    model_cv : GridSearchCV or RandomizedSearchCV object
        a cross validation object with the estimator and hyperparameters to be fitted

    Returns
    -------
    model_count
        a multi-label estimator object fitted with count vectorized data
    model_tfidf
        a multi-label estimator object fitted with TF-IDF vectorized data
    predictions_count
        a list of prediction probabilities made for the count vectorized test data
    predictions_tfidf
        a list of prediction probabilities made for the TF-IDF vectorized test data
    """

    model_count_tuned = model_cv                                             # initialize the cross validation model
    model_count_tuned.fit(count_train, y_train)                              # fit the model
    print('Best parameters:', model_count_tuned.best_params_)                # print best parameters found 
    predictions = model_count_tuned.predict(count_train)                     # predict using train data
    accuracy = accuracy_score(predictions, y_train)                          # get training accuracy 
    print('Count Vectors:', accuracy)                                        # print training accuracy
    predictions_count_tuned = model_count_tuned.predict_proba(count_test)    # predict using test data

    model_tfidf_tuned = model_cv                                             # initialize the cross validation model
    model_tfidf_tuned.fit(tfidf_train, y_train)                              # fit the model
    print('Best parameters:', model_tfidf_tuned.best_params_)                # print best parameters found 
    predictions = model_tfidf_tuned.predict(tfidf_train)                     # predict using train data
    accuracy = accuracy_score(predictions, y_train)                          # get training accuracy 
    print('TF-IDF Vectors:', accuracy)                                       # print training accuracy  
    predictions_tfidf_tuned = model_tfidf_tuned.predict_proba(tfidf_test)    # predict using test data
    
    return model_count_tuned, model_tfidf_tuned, predictions_count_tuned, predictions_tfidf_tuned

Wall time: 0 ns


## Declaring Hyperparameter Values
As hyperparameters for each base estimator will remain constant, these will be declared here.

<a class="anchor" id="param_lr"></a>
TODO: i ken fly

In [152]:
# Logistic Regression
parameters_lr = [{
    'C' : [0.01, 0.1, 1, 10],
    'max_iter' : [300, 600, 900, 1200], 
    'class_weight' : ['balanced', None]
}]

<a class="anchor" id="param_mn"></a>
TODO: i ken fly

In [153]:
# Multinomial Naive Bayes
parameters_mnb = [{
    'alpha' : [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100, 1000],
    'fit_prior' : [False, True]
}]

<a class="anchor" id="param_rf"></a>
TODO: i ken fly

In [154]:
# Random Forest Classifier
parameters_rf = [{
    'n_estimators' : [100, 200, 300, 400, 500],
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [5, 10, 20, 30],
    'min_samples_split' : [2, 4, 6, 10, 15, 20],
    'max_leaf_nodes' : [3, 5, 10, 20, 50, 100],
}]

TODO: i ken fly

In [155]:
# Gradient Boosting Classifier
parameters_gbc = [{
    'n_estimators' : [50, 100, 250],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

TODO: i ken fly

In [156]:
# XGBoost Classifier
parameters_xgb = [{
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

TODO: i ken fly

In [157]:
# Adaboost Classifier
parameters_adb = {
    'n_estimators' : [10, 25, 50, 100, 250],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2]
}

TODO: i ken fly

In [158]:
# SGDClassifier
parameters_sgd = [{
    'loss' : ['log', 'modified_huber'],
    'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
}]

TODO: i ken fly

In [159]:
# OneVsRest Classifier and MultiOutput Classifier: Logistic Regression
parameters_lr_mo = [{
    'estimator__C': [0.01, 0.1, 1, 10],            # [1, 12, 15]
    'estimator__max_iter': [300, 600, 900, 1200],  # [600, 1800, 3000],
    'estimator__class_weight' : ['balanced', None]
}]

TODO: i ken fly

In [160]:
# OneVsRest Classifier and MultiOutput Classifier: Multinomial Naive Bayes
parameters_mn_mo = [{
    'estimator__alpha': [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100, 1000],   # [0.5, 0.6, 0.7, 0.8, 1.0]
    'estimator__fit_prior': [True, False]
}]

TODO: i ken fly

In [161]:
# Binary Relevance: Logistic Regression
parameters_lr_multi = [{
    'classifier': [LogisticRegression()],
    'classifier__C': [0.01, 0.1, 1, 10],            # [1, 12, 15]
    'classifier__max_iter': [300, 600, 900, 1200]   # [600, 1800, 3000],
}]

TODO: i ken fly

In [162]:
# Classifier Chain and Binary Relevance: Multinomial Naive Bayes
parameters_mn_multi = [{
    'classifier': [MultinomialNB()],
    'classifier__alpha': [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100, 1000],   # [0.5, 0.6, 0.7, 0.8, 1.0]
    'classifier__fit_prior': [True, False]
}]

## Training and Tuning Different Models <a class="anchor" id="toc"></a>
TODO: i ken fly


### Six Single-Label Classifiers
* [**Logistic Regression**](#lr)
* [**Multinomial Naive Bayes**](#mn)
* [**Random Forest Classifier**](#rf)
* [**Gradient Boosting Classifier**](#gbc)
* [**eXtreme Gradient Boosting Classifier**](#xgb)
* [**AdaBoostClassifier Boosting Classifier**](#adb)
* [**Stochastic Gradient Descent Classifier**](#sgd)

### Multi-Label Classifiers
* [**OneVsRest Classifier: Logistic Regression**](#oc_lr)
* [**OneVsRest Classifier: Multinomial Naive Bayes**](#oc_mn)
* [**MultiOutput Classifier: Logistic Regression**](#mo_lr)
* [**MultiOutput Classifier: Multinomial Naive Bayes**](#mo_mn)
* [**Binary Relevance: Logistic Regression**](#br_lr)
* [**Binary Relevance: Multinomial Naive Bayes**](#br_mn)
* [**Classifier Chain: Multinomial Naive Bayes**](#cc_mn)

### Logistic Regression <a class="anchor" id="lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
# TODO: i ken fly, tell me why

#### Model Training
A `LogisticRegression()` object is initialized with default estimator parameters before it is passed to the `train_models()` function. This model will be used as a base to train one classifier for each of the six classes for each vectorizer for a total of 12 models.

Additionally, the `n_jobs` parameter is used for parallel processing which would help lessen the training time of the model. 

In [None]:
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_count, lr_models_tfidf, lr_models_wrd2v, \
    predictions_lr_count, predictions_lr_tfidf, predictions_lr_wrd2v = train_models(lr)

# TODO: nagbago results
From the output shown above, it should be noted that the Logistic Regression models fitted with count vectorized data produced higher training accuracy scores for all six classes.

Next, the test predictions are saved to a csv file before it is uploaded to Kaggle.

In [195]:
to_submission_csv(predictions_lr_count, 'submission_lr_count')
to_submission_csv(predictions_lr_tfidf, 'submission_lr_tfidf')
to_submission_csv(predictions_lr_wrd2v, 'submission_lr_wrd2v')

In [196]:
predictions_lr_wrd2v

array([[0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
       ...,
       [0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
       [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]])

#### Test Accuracy Score
As seen from the scores returned by Kaggle, TF-IDF vectors yielded a higher test accuracy rate than count vectors, which is contrary to the results shown earlier.

# TODO: mataas naman ish

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_lr_count</td>
    <td class="tg-baqh">0.93926</td>
    <td class="tg-baqh">0.94248</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_lr_tfidf</td>
    <td class="tg-baqh">0.97391</td>
    <td class="tg-baqh">0.97376</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning
In tuning the Logistic Regression models, a `GridSearchCV()` was used for a more comprehensive search that would result in a higher accuracy score. A `LogisticRegression()` object with default parameters will serve as the base estimator, while the [`parameters_lr`](#param_lr) hyperparameters will be used.

In [123]:
lr_tuned = GridSearchCV(LogisticRegression(n_jobs=-1, class_weight='balanced'), parameters_lr, scoring='f1', cv=2)
lr_models_count_tuned, lr_models_tfidf_tuned, \
    predictions_lr_count_tuned, predictions_lr_tfidf_tuned = tune_and_train_models(lr_tuned)

Fitting toxic...
Best parameters: {'C': 1, 'max_iter': 300}
Count Vectors: 0.9770572347105677
Best parameters: {'C': 10, 'max_iter': 600}
TF-IDF Vectors: 0.9798710291970345
Fitting severe_toxic...
Best parameters: {'C': 1, 'max_iter': 900}
Count Vectors: 0.9865953086713751
Best parameters: {'C': 10, 'max_iter': 300}
TF-IDF Vectors: 0.989340168326325
Fitting obscene...
Best parameters: {'C': 1, 'max_iter': 600}
Count Vectors: 0.9879802721045804
Best parameters: {'C': 10, 'max_iter': 300}
TF-IDF Vectors: 0.9901423190930683
Fitting threat...
Best parameters: {'C': 1, 'max_iter': 300}
Count Vectors: 0.9973930100080842
Best parameters: {'C': 10, 'max_iter': 300}
TF-IDF Vectors: 0.998145026351906
Fitting insult...
Best parameters: {'C': 1, 'max_iter': 1200}
Count Vectors: 0.9787743386956277
Best parameters: {'C': 10, 'max_iter': 300}
TF-IDF Vectors: 0.9829981638267605
Fitting identity_hate...
Best parameters: {'C': 1, 'max_iter': 900}
Count Vectors: 0.9894717711864938
Best parameters: {'C': 

It can be seen above that count vectors had higher training accuracy scores for **toxic**, **severe_toxic**, **obscene**, **threat**, and **identity_hate**, while 
# TODO: kasi irurun ulet to

Next, the test predictions are saved to a csv file before it is uploaded to Kaggle.

In [124]:
to_submission_csv(predictions_lr_count_tuned, 'submission_lr_count_tuned')
to_submission_csv(predictions_lr_tfidf_tuned, 'submission_lr_tfidf_tuned')

#### Test Accuracy Score

# TODO: bumaba bakit ganun hala

<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_lr_count_tuned</td>
    <td class="tg-baqh">0.94016</td>
    <td class="tg-baqh">0.94392</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_lr_tfidf_tuned</td>
    <td class="tg-baqh">0.97135</td>
    <td class="tg-baqh">0.97227</td>
  </tr>
  <tr>
</tbody>
</table>

### Multinomial Naive Bayes <a class="anchor" id="mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training
A `MultinomialNB()` object with default parameters is initialized before it is passed to the `train_models()` function. A classifier for each of the six classes for each vectorizer will be trained using this object as a base.

In [37]:
mn = MultinomialNB()
mn_models_count, mn_models_tfidf, predictions_mn_count, predictions_mn_tfidf = train_models(mn)

Fitting toxic...
Count Vectors: 0.9513696097661856
TF-IDF Vectors: 0.9236828747078103
Fitting severe_toxic...
Count Vectors: 0.98641983819115
TF-IDF Vectors: 0.9899104473870566
Fitting obscene...
Count Vectors: 0.9670867513520627
TF-IDF Vectors: 0.9538449968979326
Fitting threat...
Count Vectors: 0.9955505699657206
TF-IDF Vectors: 0.996973134216117
Fitting insult...
Count Vectors: 0.9646301646289113
TF-IDF Vectors: 0.9535629907689994
Fitting identity_hate...
Count Vectors: 0.9877233331871079
TF-IDF Vectors: 0.9911074067343063


As seen from the output above, the models trained with count vectorized data performed better when predicting **toxic**, **obscene**, and **insult** classes, while TF-IDF vectorized data performed better for **severe_toxic**, **threat**, and **identity_hate** classes.

Next, the test predictions are saved to a csv file before it is uploaded to Kaggle.

In [40]:
to_submission_csv(predictions_mn_count, 'submission_mn_count')
to_submission_csv(predictions_mn_tfidf, 'submission_mn_tfidf')

#### Test Accuracy Score
As seen from the scores returned by Kaggle, 

# TODO: yuck ang baba

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mnb_count</td>
    <td class="tg-baqh">0.84551</td>
    <td class="tg-baqh">0.85581</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mnb_tfidf</td>
    <td class="tg-baqh">0.82510</td>
    <td class="tg-baqh">0.83586</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning
In tuning the Multinomial Naive Bayes models, a `GridSearchCV()` was used for a more comprehensive search as tuning this model is relatively inexpensive timewise. A `MultinomialNB()` object with default parameters will serve as the base estimator, while the [`parameters_mnb`](#param_mn) hyperparameters will be used.

In [125]:
mn_tuned = GridSearchCV(MultinomialNB(), parameters_mnb, scoring='f1', cv=2)
mn_models_count_tuned, mn_models_tfidf_tuned, \
    predictions_mn_count_tuned, predictions_mn_tfidf_tuned = tune_and_train_models(mn_tuned)

Fitting toxic...
Best parameters: {'alpha': 0.1, 'fit_prior': True}
Count Vectors: 0.95461581365035
Best parameters: {'alpha': 0.001, 'fit_prior': True}
TF-IDF Vectors: 0.9748262528905628
Fitting severe_toxic...
Best parameters: {'alpha': 0.001, 'fit_prior': True}
Count Vectors: 0.9872031885492978
Best parameters: {'alpha': 0.0001, 'fit_prior': True}
TF-IDF Vectors: 0.9945980159302129
Fitting obscene...
Best parameters: {'alpha': 0.1, 'fit_prior': True}
Count Vectors: 0.966065262485038
Best parameters: {'alpha': 0.001, 'fit_prior': True}
TF-IDF Vectors: 0.9862067668937339
Fitting threat...
Best parameters: {'alpha': 0.001, 'fit_prior': True}
Count Vectors: 0.9916275513721164
Best parameters: {'alpha': 1e-05, 'fit_prior': True}
TF-IDF Vectors: 0.9987403726240983
Fitting insult...
Best parameters: {'alpha': 0.1, 'fit_prior': True}
Count Vectors: 0.9630258630954246
Best parameters: {'alpha': 0.001, 'fit_prior': True}
TF-IDF Vectors: 0.9840384531023808
Fitting identity_hate...
Best paramet

In [126]:
to_submission_csv(predictions_mn_count_tuned, 'submission_mn_count_tuned')
to_submission_csv(predictions_mn_tfidf_tuned, 'submission_mn_tfidf_tuned')

<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mn_count_tuned</td>
    <td class="tg-baqh">0.90205</td>
    <td class="tg-baqh">0.90411</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mn_tfidf_tuned</td>
    <td class="tg-baqh">0.90610</td>
    <td class="tg-baqh">0.90995</td>
  </tr>
  <tr>
</tbody>
</table>

### RandomForestClassifier <a class="anchor" id="rf"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training
A `RandomForestClassifier()` object with default parameters is initialized before it is passed to the `train_models()` function. A classifier for each of the six classes for each vectorizer will be trained using this object as a base.

In [127]:
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced')
rf_models_count, rf_models_tfidf, predictions_rf_count, predictions_rf_tfidf = train_models(rf)

Fitting toxic...
Count Vectors: 0.9997869287025838
TF-IDF Vectors: 0.9997493278853927
Fitting severe_toxic...
Count Vectors: 0.9998245295197749
TF-IDF Vectors: 0.9997054602653365
Fitting obscene...
Count Vectors: 0.9998245295197749
TF-IDF Vectors: 0.9997681282939882
Fitting threat...
Count Vectors: 0.9999435987742133
TF-IDF Vectors: 0.9999373319713482
Fitting insult...
Count Vectors: 0.9996866598567409
TF-IDF Vectors: 0.9995989246166284
Fitting identity_hate...
Count Vectors: 0.9998997311541571
TF-IDF Vectors: 0.9998809307455615


In [128]:
to_submission_csv(predictions_rf_count, 'submission_rf_count')
to_submission_csv(predictions_rf_tfidf, 'submission_rf_tfidf')

<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_rf_count</td>
    <td class="tg-baqh">0.94071</td>
    <td class="tg-baqh">0.94602</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_rf_tfidf</td>
    <td class="tg-baqh">0.93731</td>
    <td class="tg-baqh">0.93999</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [129]:
rf_tuned = RandomizedSearchCV(RandomForestClassifier(n_jobs=-1, class_weight='balanced'), parameters_rf, scoring='f1', random_state=8, cv=2)
rf_models_count_tuned, rf_models_tfidf_tuned, \
    predictions_rf_count_tuned, predictions_rf_tfidf_tuned = tune_and_train_models(rf_tuned)

Fitting toxic...
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
Count Vectors: 0.7163018342931986
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
TF-IDF Vectors: 0.721058337667872
Fitting severe_toxic...
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
Count Vectors: 0.7858633460967218
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
TF-IDF Vectors: 0.7954578212833159
Fitting obscene...
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
Count Vectors: 0.7358041248096459
Best parameters: {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
TF-IDF Vector

In [130]:
to_submission_csv(predictions_rf_count_tuned, 'submission_rf_count_tuned')
to_submission_csv(predictions_rf_tfidf_tuned, 'submission_rf_tfidf_tuned')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_rf_count_tuned</td>
    <td class="tg-baqh">0.96232</td>
    <td class="tg-baqh">0.96285</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_rf_tfidf_tuned</td>
    <td class="tg-baqh">0.95937</td>
    <td class="tg-baqh">0.95826</td>
  </tr>
  <tr>
</tbody>
</table>

### GradientBoostingClassifier <a class="anchor" id="gbc"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training

In [131]:
gbc = GradientBoostingClassifier(random_state=8)
gbc_models_count, gbc_models_tfidf, predictions_gbc_count, predictions_gbc_tfidf = train_models(gbc)

Fitting toxic...
Count Vectors: 0.943373169310213
TF-IDF Vectors: 0.9443382569514511
Fitting severe_toxic...
Count Vectors: 0.9916902194007683
TF-IDF Vectors: 0.9923106328844213
Fitting obscene...
Count Vectors: 0.9760106786320822
TF-IDF Vectors: 0.9768378966102863
Fitting threat...
Count Vectors: 0.9974118104166797
TF-IDF Vectors: 0.9978191526029165
Fitting insult...
Count Vectors: 0.968609584448302
TF-IDF Vectors: 0.9696498737239223
Fitting identity_hate...
Count Vectors: 0.9942094741525715
TF-IDF Vectors: 0.9950366921307756


In [None]:
to_submission_csv(predictions_gbc_count, 'submission_gbc_count')
to_submission_csv(predictions_gbc_tfidf, 'submission_gbc_tfidf')

#### Hyperparameter Tuning

In [132]:
gbc_tuned = RandomizedSearchCV(GradientBoostingClassifier(), parameters_gbc, scoring='accuracy', random_state=8, cv=2)
gbc_models_count_tuned, gbc_models_tfidf_tuned, \
    predictions_gbc_count_tuned, predictions_gbc_tfidf_tuned = tune_and_train_models(gbc_tuned)

Fitting toxic...


KeyboardInterrupt: 

In [None]:
to_submission_csv(predictions_gbc_count_tuned, 'submission_gbc_count_tuned')
to_submission_csv(predictions_gbc_tfidf_tuned, 'submission_gbc_tfidf_tuned')

### XGBClassifier <a class="anchor" id="xgb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training

In [121]:
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_count, xgb_models_tfidf, predictions_xgb_count, predictions_xgb_tfidf = train_models(xgb)

Fitting toxic...
Count Vectors: 0.9643732257114388
TF-IDF Vectors: 0.9667358103916125
Fitting severe_toxic...
Count Vectors: 0.9943536106184708
TF-IDF Vectors: 0.9954127003026866
Fitting obscene...
Count Vectors: 0.9863947709796893
TF-IDF Vectors: 0.9876167975383998
Fitting threat...
Count Vectors: 0.9989910447387057
TF-IDF Vectors: 0.9993482525020211
Fitting insult...
Count Vectors: 0.9798772959998997
TF-IDF Vectors: 0.9813186606588916
Fitting identity_hate...
Count Vectors: 0.9955568367685858
TF-IDF Vectors: 0.9959391117433619


In [117]:
to_submission_csv(predictions_xgb_count, 'submission_xgb_count')
to_submission_csv(predictions_xgb_tfidf, 'submission_xgb_tfidf')

<table>
<thead>
  <tr>
    <th></th>
    <th>private</th>
    <th>public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>submission_xgb_count</td>
    <td>0.96468</td>
    <td>0.96783</td>
  </tr>
  <tr>
    <td>submission_xgb_tfidf</td>
    <td>0.96502</td>
    <td>0.96803</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [None]:
estimator = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_tuned = GridSearchCV(estimator, parameters_xgb, scoring='f1', cv=2)
xgb_models_count_tuned, xgb_models_tfidf_tuned, \
    predictions_xgb_count_tuned, predictions_xgb_tfidf_tuned = tune_and_train_models(xgb_tuned)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_xgb_count_tuned.csv', 'submission_xgb_tfidf_tuned.csv']
)

### AdaBoostClassifier <a class="anchor" id="adb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
adb = AdaBoostClassifier(random_state=8)
adb_models_count, adb_models_tfidf, predictions_adb_count, predictions_adb_tfidf = train_models(adb)

In [None]:
to_submission_csv(predictions_adb_count, 'submission_adb_count')
to_submission_csv(predictions_adb_tfidf, 'submission_adb_tfidf')

In [None]:
pd.DataFrame(
    data={'private': [0.93539, 0.93830], 'public': [0.94218, 0.94145]}, 
    index=['submission_adb_count.csv', 'submission_adb_tfidf.csv']
)

#### Hyperparameter Tuning

In [None]:
adb_tuned = GridSearchCV(AdaBoostClassifier(random_state=8), parameters_adb, scoring='accuracy', cv=2)
adb_models_count_tuned, adb_models_tfidf_tuned, \
    predictions_adb_count_tuned, predictions_adb_tfidf_tuned = tune_and_train_models(adb_tuned)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_xgb_count_tuned.csv', 'submission_xgb_tfidf_tuned.csv']
)

### Stochastic Gradient Descent Classifier <a class="anchor" id="sgd"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [88]:
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_count, sgd_models_tfidf, predictions_sgd_count, predictions_sgd_tfidf = train_models(sgd)

Fitting toxic...
Count Vectors: 0.9740930369553364
TF-IDF Vectors: 0.9546032800446196
Fitting severe_toxic...
Count Vectors: 0.9773141736280402
TF-IDF Vectors: 0.9773705748538268
Fitting obscene...
Count Vectors: 0.9366551566387377
TF-IDF Vectors: 0.9798020943655176
Fitting threat...
Count Vectors: 0.9831673675041204
TF-IDF Vectors: 0.9951808285966748
Fitting insult...
Count Vectors: 0.8818895663999098
TF-IDF Vectors: 0.9667295435887473
Fitting identity_hate...
Count Vectors: 0.961321292716095
TF-IDF Vectors: 0.9792819497277074


In [73]:
to_submission_csv(predictions_sgd_count, 'submission_sgd_count')
to_submission_csv(predictions_sgd_tfidf, 'submission_sgd_tfidf')

<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_sgd_count</td>
    <td class="tg-baqh">0.68992</td>
    <td class="tg-baqh">0.70589</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_sgd_tfidf</td>
    <td class="tg-baqh">0.97214</td>
    <td class="tg-baqh">0.97625</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [None]:
sgd_tuned = GridSearchCV(SGDClassifier(n_jobs=-1, class_weight='balanced', random_state=8), parameters_sgd, scoring='f1', cv=2)
sgd_models_count_tuned, sgd_models_tfidf_tuned, \
    predictions_sgd_count_tuned, predictions_sgd_tfidf_tuned = tune_and_train_models(sgd_tuned)

In [None]:
to_submission_csv(predictions_sgd_count_tuned, 'submission_sgd_count_tuned')
to_submission_csv(predictions_sgd_tfidf_tuned, 'submission_sgd_tfidf_tuned')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_sgd_count_tuned</td>
    <td class="tg-baqh">0.81387</td>
    <td class="tg-baqh">0.81848</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_sgd_tfidf_tuned</td>
    <td class="tg-baqh">0.94660</td>
    <td class="tg-baqh">0.95230</td>
  </tr>
  <tr>
</tbody>
</table>

### OneVsRest Classifier: Logistic Regression <a class="anchor" id="oc_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
oc_lr = OneVsRestClassifier(LogisticRegression(max_iter=3000))
oc_lr_model_count, oc_lr_model_tfidf, predictions_oc_lr_count, predictions_oc_lr_tfidf = train_model(oc_lr)

In [None]:
to_submission_csv_multiclass(predictions_oc_lr_count, 'submission_oc_lr_count')
to_submission_csv_multiclass(predictions_oc_lr_tfidf, 'submission_oc_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = OneVsRestClassifier(LogisticRegression())
oc_lr_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs=-1, scoring='f1', cv=1)
oc_lr_model_count_tuned, oc_lr_model_tfidf_tuned, \
    predictions_oc_lr_count_tuned, predictions_oc_lr_tfidf_tuned = tune_and_train_model(oc_lr_tuned)

In [None]:
to_submission_csv_multiclass(predictions_oc_lr_count_tuned, 'submission_oc_lr_count_tuned')
to_submission_csv_multiclass(predictions_oc_lr_tfidf_tuned, 'submission_oc_lr_tfidf_tuned')

### OneVsRest Classifier: Multinomial Naive Bayes <a class="anchor" id="oc_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
oc_mn = OneVsRestClassifier(MultinomialNB())
oc_mn_model_count, oc_mn_model_tfidf, predictions_oc_mn_count, predictions_oc_mn_tfidf = train_model(oc_mn)

In [None]:
to_submission_csv_multiclass(predictions_oc_mn_count, 'submission_oc_mn_count')
to_submission_csv_multiclass(predictions_oc_mn_tfidf, 'submission_oc_mn_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = OneVsRestClassifier(MultinomialNB())
oc_mn_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs=-1, scoring='f1', cv=1)
oc_mn_model_count_tuned, oc_lr_model_tfidf_tuned, \
    predictions_oc_mn_count_tuned, predictions_oc_mn_tfidf_tuned = tune_and_train_model(oc_mn_tuned)

In [None]:
to_submission_csv_multiclass(predictions_oc_mn_count_tuned, 'submission_oc_mn_count_tuned')
to_submission_csv_multiclass(predictions_oc_mn_tfidf_tuned, 'submission_oc_mn_tfidf_tuned')

### MultiOutput Classifier: Logistic Regression <a class="anchor" id="mo_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
mo_lr = MultiOutputClassifier(LogisticRegression(max_iter=3000))
mo_lr_model_count, mo_lr_model_tfidf, predictions_mo_lr_count, predictions_mo_lr_tfidf = train_model(mo_lr)

In [None]:
to_submission_csv_multiclass(predictions_mo_lr_count, 'submission_mo_lr_count')
to_submission_csv_multiclass(predictions_mo_lr_tfidf, 'submission_mo_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = MultiOutputClassifier(LogisticRegression())
mo_lr_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs=-1, scoring='f1', cv=1)
mo_lr_model_count_tuned, mo_lr_model_tfidf_tuned, \
    predictions_mo_lr_count_tuned, predictions_mo_lr_tfidf_tuned = tune_and_train_model(mo_lr_tuned)

In [None]:
to_submission_csv_multiclass(predictions_mo_lr_count_tuned, 'submission_mo_lr_count_tuned')
to_submission_csv_multiclass(predictions_mo_lr_tfidf_tuned, 'submission_mo_lr_tfidf_tuned')

### MultiOutput Classifier: Multinomial Naive Bayes <a class="anchor" id="mo_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
mo_mn = MultiOutputClassifier(MultinomialNB())
mo_mn_model_count, mo_mn_model_tfidf, predictions_mo_mn_count, predictions_mo_mn_tfidf = train_model(mo_mn)

In [None]:
to_submission_csv_multiclass(predictions_mo_lr_count, 'submission_mo_lr_count')
to_submission_csv_multiclass(predictions_mo_lr_tfidf, 'submission_mo_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = MultiOutputClassifier(MultinomialNB())
mo_mn_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs=-1, scoring='f1', cv=1)
mo_mn_model_count_tuned, mo_mn_model_tfidf_tuned, \
    predictions_mo_mn_count_tuned, predictions_mo_mn_tfidf_tuned = tune_and_train_model(mo_mn_tuned)

In [None]:
to_submission_csv_multiclass(predictions_mo_mn_count_tuned, 'submission_mo_mn_count_tuned')
to_submission_csv_multiclass(predictions_mo_mn_tfidf_tuned, 'submission_mo_mn_tfidf_tuned')

### Binary Relevance: Logistic Regression <a class="anchor" id="br_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
br_lr = BinaryRelevance(LogisticRegression())
br_lr_model_count, br_lr_model_tfidf, predictions_br_lr_count, predictions_br_lr_tfidf = train_model(br_lr)

In [None]:
to_submission_csv_multiclass(predictions_br_lr_count, 'submission_br_lr_count')
to_submission_csv_multiclass(predictions_br_lr_tfidf, 'submission_br_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = BinaryRelevance(LogisticRegression())
br_lr_tuned = GridSearchCV(estimator, parameters_lr_multi, n_jobs=-1, scoring='f1', cv=1)
br_lr_model_count_tuned, mo_lr_model_tfidf_tuned, \
    predictions_br_lr_count_tuned, predictions_br_lr_tfidf_tuned = tune_and_train_model(br_lr_tuned)

In [None]:
to_submission_csv_multiclass(predictions_br_lr_count_tuned, 'submission_br_lr_count_tuned')
to_submission_csv_multiclass(predictions_br_lr_tfidf_tuned, 'submission_br_lr_tfidf_tuned')

### Binary Relevance: Multinomial Naive Bayes <a class="anchor" id="br_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
br_mn = MultiOutputClassifier(MultinomialNB())
br_mn_model_count, br_mn_model_tfidf, predictions_br_mn_count, predictions_br_mn_tfidf = train_model(br_mn)

In [None]:
to_submission_csv_multiclass(predictions_br_lr_count, 'submission_br_lr_count')
to_submission_csv_multiclass(predictions_br_lr_tfidf, 'submission_br_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = MultiOutputClassifier(MultinomialNB())
br_mn_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs=-1, scoring='f1', cv=1)
br_mn_model_count_tuned, br_mn_model_tfidf_tuned, \
    predictions_br_mn_count_tuned, predictions_br_mn_tfidf_tuned = tune_and_train_model(br_mn_tuned)

In [None]:
to_submission_csv_multiclass(predictions_br_mn_count_tuned, 'submission_br_mn_count_tuned')
to_submission_csv_multiclass(predictions_br_mn_tfidf_tuned, 'submission_br_mn_tfidf_tuned')

### Classifier Chain: Multinomial Naive Bayes <a class="anchor" id="cc_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
cc_mn = ClassifierChain(MultinomialNB())
cc_mn_model_count, cc_mn_model_tfidf, predictions_br_mn_count, predictions_br_mn_tfidf = train_model(br_mn)

In [None]:
to_submission_csv_multiclass(predictions_cc_lr_count, 'submission_cc_lr_count')
to_submission_csv_multiclass(predictions_cc_lr_tfidf, 'submission_cc_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
estimator = ClassifierChain(MultinomialNB())
cc_mn_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs=-1, scoring='f1', cv=1)
cc_mn_model_count_tuned, cc_mn_model_tfidf_tuned, \
    predictions_cc_mn_count_tuned, predictions_cc_mn_tfidf_tuned = tune_and_train_model(cc_mn_tuned)

In [None]:
to_submission_csv_multiclass(predictions_cc_mn_count_tuned, 'submission_cc_mn_count_tuned')
to_submission_csv_multiclass(predictions_cc_mn_tfidf_tuned, 'submission_cc_mn_tfidf_tuned')

# old stuff below

In [27]:
lr_oc_count = OneVsRestClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_oc_count.fit(count_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                 max_iter=3000))

In [28]:
lr_oc_tf = OneVsRestClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_oc_tf.fit(tfidf_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                 max_iter=3000))

In [29]:
predictions = lr_oc_tf.predict(tfidf_train)
print('TF-IDF Vectors: ' , accuracy_score(predictions, y_train))

predictions = lr_oc_count.predict(count_train)
print('Count Vectors: ', accuracy_score(predictions, y_train))

TF-IDF Vectors:  toxic            95.739201
severe_toxic     97.941355
obscene          98.078598
threat           99.375200
insult           96.812077
identity_hate    98.102412
dtype: float64
Count Vectors:  toxic            97.955142
severe_toxic     98.672064
obscene          98.804294
threat           99.741808
insult           97.874927
identity_hate    98.947177
dtype: float64


In [30]:
predictions_lr_oc_tf = lr_oc_tf.predict_proba(tfidf_test)
predictions_lr_oc_count = lr_oc_count.predict_proba(count_test)

In [35]:
to_submission_csv(predictions_lr_oc_tf, 'submission_oc_lr_tf')
to_submission_csv(predictions_lr_oc_count, 'submission_oc_lr_count')

In [36]:
pd.DataFrame(
    data={'private': [0.94036, 0.97558], 'public': [0.94400, 0.97621]}, 
    index=['submission_oc_lr_count.csv', 'submission_oc_lr_tf.csv']
)

Unnamed: 0,private,public
submission_oc_lr_count.csv,0.94036,0.944
submission_oc_lr_tf.csv,0.97558,0.97621


#### Hyperparameter Tuning

In [57]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [58]:
estimator = OneVsRestClassifier(LogisticRegression ())

In [59]:
lr_oc_tf_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_oc_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='accuracy', verbose=10)

In [60]:
predictions = lr_oc_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', accuracy_score(predictions, y_train), lr_oc_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            98.477167
severe_toxic     99.529363
obscene          99.205369
threat           99.895344
insult           98.869469
identity_hate    99.662219
dtype: float64 {'estimator__C': 12, 'estimator__class_weight': None, 'estimator__max_iter': 600}


In [61]:
predictions_lr_tfidf_tuned = lr_oc_tf_tuned.predict_proba(tfidf_test)
to_submission_csv(predictions_lr_tfidf_tuned, 'submission_oc_lr_tf_tuned')

In [None]:
lr_oc_count_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_oc_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [None]:
predictions = lr_oc_count_tuned.predict(count_train)
print('Count Vectors: ', accuracy_score(predictions, y_train), lr_oc_count_tuned.best_params_)

In [None]:
predictions_lr_count_tuned = lr_oc_count_tuned.predict_proba(count_test)
to_submission_csv(predictions_lr_count_tuned, 'submission_oc_lr_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0.93996, 0.97558], 'public': [0.94410, 0.97621]}, 
    index=['submission_oc_lr_count_tuned.csv', 'submission_oc_lr_tf_tuned.csv']
)

#### Model Training

In [34]:
mn_oc_count = OneVsRestClassifier(MultinomialNB())
mn_oc_count.fit(count_train, y_train)

OneVsRestClassifier(estimator=MultinomialNB())

In [35]:
mn_oc_tf = OneVsRestClassifier(MultinomialNB())
mn_oc_tf.fit(tfidf_train, y_train)

OneVsRestClassifier(estimator=MultinomialNB())

In [36]:
predictions = mn_oc_tf.predict(tfidf_train)
print('TF-IDF Vectors: \n' , accuracy_score(predictions, y_train))

predictions = mn_oc_count.predict(count_train)
print('Count Vectors: \n', accuracy_score(predictions, y_train))

TF-IDF Vectors:  toxic            92.368287
severe_toxic     98.991045
obscene          95.384500
threat           99.697313
insult           95.356299
identity_hate    99.110741
dtype: float64
Count Vectors:  toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64


In [37]:
predictions_mn_oc_tf = mn_oc_tf.predict_proba(tfidf_test)
predictions_mn_oc_count = mn_oc_count.predict_proba(count_test)

In [38]:
to_submission_csv(predictions_mn_oc_tf, 'submission_oc_mn_tf')
to_submission_csv(predictions_mn_oc_count, 'submission_oc_mn_count')

In [50]:
pd.DataFrame(
    data={'private': [0.84551, 0.82510], 'public': [0.85581, 0.83586]}, 
    index=['submission_oc_mn_count.csv', 'submission_oc_mn_tf.csv']
)

Unnamed: 0,private,public
submission_oc_mn_count.csv,0.84551,0.85581
submission_oc_mn_tf.csv,0.8251,0.83586


#### Hyperparameter Tuning

In [40]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [41]:
estimator = OneVsRestClassifier(MultinomialNB ())

In [42]:
mn_oc_tf_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'accuracy')
mn_oc_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='accuracy', verbose=10)

In [43]:
predictions = mn_oc_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', accuracy_score(predictions, y_train), mn_oc_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            93.609114
severe_toxic     98.989791
obscene          96.043141
threat           99.691673
insult           95.876444
identity_hate    99.106354
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [44]:
predictions_mn_tfidf_tuned = mn_oc_tf_tuned.predict_proba(tfidf_test)
to_submission_csv(predictions_mn_tfidf_tuned, 'submission_oc_mn_tf_tuned')

In [45]:
mn_oc_count_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'accuracy')
mn_oc_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 10 candidates, totalling 50 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='accuracy', verbose=10)

In [46]:
predictions = mn_oc_count_tuned.predict(count_train)
print('Count Vectors: ', accuracy_score(predictions, y_train), mn_oc_count_tuned.best_params_)

Count Vectors:  toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64 {'estimator__alpha': 1.0, 'estimator__fit_prior': True}


In [47]:
predictions_mn_count_tuned = mn_oc_count_tuned.predict_proba(count_test)
to_submission_csv(predictions_mn_count_tuned, 'submission_oc_mn_count_tuned')

In [49]:
pd.DataFrame(
    data={'private': [0.84551, 0.85045], 'public': [0.85581, 0.86105]}, 
    index=['submission_oc_mn_count_tuned.csv', 'submission_oc_mn_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_oc_mn_count_tuned.csv,0.84551,0.85581
submission_oc_mn_tf_tuned.csv,0.85045,0.86105


### MultiOutput Classifier: Logistic Regression

#### Model Training

In [142]:
X_train = train ['comment_text']
X_test = test ['comment_text']
y_train = train.loc [ : , 'toxic' : ]
y_train

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
159566,0,0,0,0,0,0
159567,0,0,0,0,0,0
159568,0,0,0,0,0,0
159569,0,0,0,0,0,0


In [143]:
lr_mo_count = MultiOutputClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_mo_count.fit(count_train, y_train)

MultiOutputClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                   max_iter=3000))

In [144]:
lr_mo_tf = MultiOutputClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_mo_tf.fit(tfidf_train, y_train)

MultiOutputClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                   max_iter=3000))

In [146]:
predictions = lr_mo_tf.predict(tfidf_train)
print('TF-IDF Vectors: ' , accuracy_score(predictions, y_train))

predictions = lr_mo_count.predict(count_train)
print('Count Vectors: ', accuracy_score(predictions, y_train))

TF-IDF Vectors:  toxic            94.403118
severe_toxic     97.335982
obscene          97.324702
threat           99.067500
insult           95.850750
identity_hate    97.123537
dtype: float64
Count Vectors:  toxic            97.955142
severe_toxic     98.672064
obscene          98.804294
threat           99.741808
insult           97.874927
identity_hate    98.947177
dtype: float64


In [149]:
predictions_lr_mo_tf = lr_mo_tf.predict_proba(tfidf_test)
predictions_lr_mo_count = lr_mo_count.predict_proba(count_test)

In [150]:
to_submission_csv_multiclass(predictions_lr_mo_tf, 'submission_mo_lr_tf')
to_submission_csv_multiclass(predictions_lr_mo_count, 'submission_mo_lr_count')

In [154]:
pd.DataFrame(
    data={'private': [0.94036, 0.97063], 'public': [0.94400, 0.97183]}, 
    index=['submission_mo_lr_count.csv', 'submission_mo_lr_tf.csv']
)

Unnamed: 0,private,public
submission_mo_lr_count.csv,0.94036,0.944
submission_mo_lr_tf.csv,0.97063,0.97183


#### Hyperparameter Tuning

In [152]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [155]:
estimator = MultiOutputClassifier(LogisticRegression ())
lr_mo_tf_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_mo_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='f1', verbose=10)

In [156]:
predictions = lr_mo_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', accuracy_score(predictions, y_train), lr_mo_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            94.403118
severe_toxic     97.335982
obscene          97.324702
threat           99.067500
insult           95.850750
identity_hate    97.123537
dtype: float64 {'estimator__C': 1, 'estimator__class_weight': 'balanced', 'estimator__max_iter': 600}


In [157]:
predictions_lr_tfidf_tuned = lr_mo_tf_tuned.predict_proba(tfidf_test)
to_submission_csv_multiclass(predictions_lr_tfidf_tuned, 'submission_mo_lr_tf_tuned')

In [158]:
lr_mo_count_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_mo_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 18 candidates, totalling 90 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(estimator=MultiOutputClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='f1', verbose=10)

In [159]:
predictions = lr_mo_count_tuned.predict(count_train)
print('Count Vectors: ', accuracy_score(predictions, y_train), lr_mo_count_tuned.best_params_)

Count Vectors:  toxic            97.936342
severe_toxic     98.640730
obscene          98.735359
threat           99.741808
insult           97.858633
identity_hate    98.945924
dtype: float64 {'estimator__C': 1, 'estimator__class_weight': 'balanced', 'estimator__max_iter': 600}


In [160]:
predictions_lr_count_tuned = lr_mo_count_tuned.predict_proba(count_test)
to_submission_csv_multiclass(predictions_lr_count_tuned, 'submission_mo_lr_count_tuned')

In [161]:
pd.DataFrame(
    data={'private': [0.93996, 0.97063], 'public': [0.94410, 0.97183]}, 
    index=['submission_mo_lr_count_tuned.csv', 'submission_mo_lr_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_mo_lr_count_tuned.csv,0.93996,0.9441
submission_mo_lr_tf_tuned.csv,0.97063,0.97183


### MultiOutput Classifier: Multinomial Naive Bayes

#### Model Training

In [51]:
mn_mo_count = MultiOutputClassifier(MultinomialNB())
mn_mo_count.fit(count_train, y_train)

MultiOutputClassifier(estimator=MultinomialNB())

In [52]:
mn_mo_tf = MultiOutputClassifier(MultinomialNB())
mn_mo_tf.fit(tfidf_train, y_train)

MultiOutputClassifier(estimator=MultinomialNB())

In [53]:
predictions = mn_mo_tf.predict(tfidf_train)
print('TF-IDF Vectors: \n' , accuracy_score(predictions, y_train))

predictions = mn_mo_count.predict(count_train)
print('Count Vectors: \n', accuracy_score(predictions, y_train))

TF-IDF Vectors: 
 toxic            92.368287
severe_toxic     98.991045
obscene          95.384500
threat           99.697313
insult           95.356299
identity_hate    99.110741
dtype: float64
Count Vectors: 
 toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64


In [54]:
predictions_mnb_mo_tf = mn_mo_tf.predict_proba(tfidf_test)
predictions_mnb_mo_count = mn_mo_count.predict_proba(count_test)

In [55]:
to_submission_csv_multiclass(predictions_mnb_mo_tf, 'submission_mo_mn_tf')
to_submission_csv_multiclass(predictions_mnb_mo_count, 'submission_mo_mn_count')

In [56]:
pd.DataFrame(
    data={'private': [0.84551, 0.82510], 'public': [0.85581, 0.83586]}, 
    index=['submission_mo_mn_count.csv', 'submission_mo_mn_tf.csv']
)

Unnamed: 0,private,public
submission_mo_mn_count.csv,0.84551,0.85581
submission_mo_mn_tf.csv,0.8251,0.83586


#### Hyperparameter Tuning

In [38]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [39]:
estimator = MultiOutputClassifier(MultinomialNB ())
mn_mo_tf_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_mo_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='f1', verbose=10)

In [40]:
predictions = mn_mo_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: \n', accuracy_score(predictions, y_train), mn_mo_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            93.609114
severe_toxic     98.989791
obscene          96.043141
threat           99.691673
insult           95.876444
identity_hate    99.106354
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [41]:
predictions_mn_tfidf_tuned = mn_mo_tf_tuned.predict_proba(tfidf_test)
to_submission_csv_multiclass(predictions_mn_tfidf_tuned, 'submission_mo_mn_tf_tuned')

In [43]:
mn_mo_count_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_mo_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='f1', verbose=10)

In [44]:
predictions = mn_mo_count_tuned.predict(count_train)
print('Count Vectors: \n', accuracy_score(predictions, y_train), mn_mo_count_tuned.best_params_)

Count Vectors:  toxic            95.165162
severe_toxic     98.426406
obscene          96.575192
threat           99.472335
insult           96.317627
identity_hate    98.473407
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [45]:
predictions_mn_count_tuned = mn_mo_count_tuned.predict_proba(count_test)
to_submission_csv_multiclass(predictions_mn_count_tuned, 'submission_mo_mn_count_tuned')

In [46]:
pd.DataFrame(
    data={'private': [0.87456, 0.85045], 'public': [0.88221, 0.86105]}, 
    index=['submission_mo_mn_count_tuned.csv', 'submission_mo_mn_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_mo_mn_count_tuned.csv,0.87456,0.88221
submission_mo_mn_tf_tuned.csv,0.85045,0.86105


### Classifier Chain: Multinomial Naive Bayes

#### Model Training

In [31]:
mn_cc_tf = ClassifierChain(classifier = MultinomialNB(alpha = 1.0, fit_prior = True))
mn_cc_count = ClassifierChain(classifier = MultinomialNB(alpha = 1.0, fit_prior = True))

In [32]:
mn_cc_tf.fit(tfidf_train_5000, y_train)
mn_cc_count.fit(count_train_5000, y_train)

ClassifierChain(classifier=MultinomialNB(), require_dense=[True, True])

In [33]:
predictions_mn_cc_tf = mn_cc_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , accuracy_score(predictions_mn_cc_tf.todense(), y_train))

predictions_mn_cc_count = mn_cc_count.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions_mn_cc_count.todense(), y_train))

TF-IDF Vectors: 
 toxic            95.029799
severe_toxic     98.715932
obscene          97.173672
threat           99.046819
insult           96.766330
identity_hate    95.881457
dtype: float64
Count Vectors: 
 toxic            94.139913
severe_toxic     97.492025
obscene          94.467666
threat           95.961672
insult           94.012070
identity_hate    93.246893
dtype: float64


In [34]:
predictions_mn_cc_tf = mn_cc_tf.predict_proba(tfidf_test_5000)
predictions_mn_cc_count = mn_cc_count.predict_proba(count_test_5000)

In [35]:
to_submission_csv(predictions_mn_cc_tf.todense(), 'submission_tfidf_mn_cc')
to_submission_csv(predictions_mn_cc_count.todense(), 'submission_count_mn_cc')

In [36]:
pd.DataFrame(
    data={'private': [0.92866, 0.94711], 'public': [0.92896, 0.94614]}, 
    index=['submission_count_mn_cc.csv', 'submission_tfidf_mn_cc.csv']
)

Unnamed: 0,private,public
submission_count_mn_cc.csv,0.92866,0.92896
submission_tfidf_mn_cc.csv,0.94711,0.94614


#### Hyperparameter Tuning

In [37]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = ClassifierChain(MultinomialNB ())
mn_cc_tf_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_cc_tf_tuned.fit(tfidf_train_5000, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
predictions = mn_cc_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', accuracy_score(predictions.todense(), y_train), mn_cc_tf_tuned.best_params_)

In [None]:
predictions_mn_tfidf_tuned = mn_cc_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_mn_tfidf_tuned.todense(), 'submission_cc_mn_tf_tuned')

In [None]:
mn_cc_count_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_cc_count_tuned.fit(count_train_5000, y_train)

In [None]:
predictions = mn_cc_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions.todense(), y_train), mn_cc_count_tuned.best_params_)

In [None]:
predictions_mn_count_tuned = mn_cc_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_mn_count_tuned.todense(), 'submission_cc_mn_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_cc_mn_count_tuned.csv', 'submission_cc_mn_tf_tuned.csv']
)

### Binary Relevance: Logistic Regression

#### Model Training

In [57]:
br_lr_tf = BinaryRelevance(classifier = LogisticRegression())
br_lr_count = BinaryRelevance(classifier = LogisticRegression())

In [None]:
br_lr_tf.fit(tfidf_train_5000, y_train)
br_lr_count.fit(count_train_5000, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
predictions_br_lr_tf = br_lr_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , accuracy_score(predictions_br_lr_tf.todense(), y_train))

predictions_br_lr_count = br_lr_count.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions_br_lr_count.todense(), y_train))

In [None]:
predictions_br_lr_tf = br_lr_tf.predict_proba(tfidf_test_5000)
predictions_br_lr_count = br_lr_count.predict_proba(count_test_5000)

In [None]:
to_submission_csv(predictions_br_lr_tf.todense(), 'submission_tfidf_lr_br')
to_submission_csv(predictions_br_lr_count.todense(), 'submission_count_lr_br')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_count_lr_br.csv', 'submission_tfidf_lr_br.csv']
)

#### Hyperparameter Tuning

In [None]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = BinaryRelevance(LogisticRegression ())
lr_cc_tf_tuned = GridSearchCV(estimator, parameters_lr_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_cc_tf_tuned.fit(tfidf_train_5000, y_train)

In [None]:
predictions = lr_cc_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', accuracy_score(predictions.todense(), y_train), lr_cc_tf_tuned.best_params_)

In [None]:
predictions_lr_tfidf_tuned = lr_cc_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_lr_tfidf_tuned.todense(), 'submission_lr_cc_tf_tuned')

In [None]:
lr_br_count_tuned = GridSearchCV(estimator, parameters_lr_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_br_count_tuned.fit(count_train_5000, y_train)

In [None]:
predictions = lr_br_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions.todense(), y_train), lr_br_count_tuned.best_params_)

In [None]:
predictions_lr_count_tuned = lr_br_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_lr_count_tuned.todense(), 'submission_lr_cc_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_count_lr_br.csv', 'submission_tfidf_lr_br.csv']
)

#### Model Selection

### Binary Relevance: Multinomial Naive Bayes

#### Model Training

In [48]:
mn_br_count = BinaryRelevance(MultinomialNB())
mn_br_count.fit(count_train_5000, y_train)

BinaryRelevance(classifier=MultinomialNB(), require_dense=[True, True])

In [49]:
mn_br_tf = BinaryRelevance(MultinomialNB())
mn_br_tf.fit(tfidf_train_5000, y_train)

BinaryRelevance(classifier=MultinomialNB(), require_dense=[True, True])

In [50]:
predictions = mn_br_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , accuracy_score(predictions.todense(), y_train))

predictions = mn_br_count.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions.todense(), y_train))

TF-IDF Vectors: 
 toxic            95.029799
severe_toxic     99.088180
obscene          97.347889
threat           99.702327
insult           96.954960
identity_hate    99.159622
dtype: float64
Count Vectors: 
 toxic            94.139913
severe_toxic     98.111186
obscene          96.105182
threat           98.551115
insult           95.760508
identity_hate    97.667496
dtype: float64


In [51]:
predictions_mnb_br_tf = mn_br_tf.predict_proba(tfidf_test_5000)
predictions_mnb_br_count = mn_br_count.predict_proba(count_test_5000)

In [52]:
to_submission_csv(predictions_mnb_br_tf.todense(), 'submission_br_mn_tf')
to_submission_csv(predictions_mnb_br_count.todense(), 'submission_br_mn_count')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_br_mn_count.csv', 'submission_br_mn_tf.csv']
)

#### Hyperparameter Tuning

In [53]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = BinaryRelevance(MultinomialNB())
mn_br_tf_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_br_tf_tuned.fit(tfidf_train_5000, y_train)

In [None]:
predictions = mn_br_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', accuracy_score(predictions.todense(), y_train), mn_br_tf_tuned.best_params_)

In [None]:
predictions_mn_tfidf_tuned = mn_br_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_mn_tfidf_tuned.todense(), 'submission_br_mn_tf_tuned')

In [None]:
mn_br_count_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_br_count_tuned.fit(count_train_5000, y_train)   

In [None]:
predictions = mn_br_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', accuracy_score(predictions.todense(), y_train), mn_br_count_tuned.best_params_)

In [None]:
predictions_mn_count_tuned = mn_br_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_mn_count_tuned.todense(), 'submission_br_mn_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_br_mn_count_tuned.csv', 'submission_br_mn_tf_tuned.csv']
)