# You're Toxic, I'm Slippin' Under: Toxic Comment Classification Challenge

#### STINTSY S13 Group 8
- VICENTE, Francheska Josefa
- VISTA, Sophia Danielle S.

## Import Libraries
Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

#### Basic Libraries 
- `numpy` contains a large collection of mathematical functions
- `pandas` contains functions that are designed for data manipulation and data analysis

In [1]:
import numpy as np
import pandas as pd

#### Natural Language Processing Libraries 
- `re` is a module that allows the use of regular expressions
- `nltk` provides functions for processing text data
- `stopwords` is a corpus from NLTK, which includes a compiled list of stopwords
- `Counter` is from Python's `collections` module, which is helpful for tokenization
- `string` contains functions for string operations
- `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features 
- `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [2]:
import re
import nltk
import string

from nltk.corpus import stopwords
from collections import Counter
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from nltk.tokenize.casual import TweetTokenizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer



#### Machine Learning Libraries
The following code block can be used to install **scikit-multilearn** without restarting Jupyter Notebook. The `sys` module is used to access the *executable* function of the interpreter, which would run the installation of scikit-multilearn.

In [3]:
import sys
!{sys.executable} -m pip install scikit-multilearn

C:\Users\User\anaconda3\python.exe: No module named pip


The following libraries are multi-label classification modules that would allow the usage of one model that can classify one instance as more than one class.
- `ClassifierChain` chains binary classifiers in a way that its predictions are dependent on the earlier classes
- `BinaryRelevance` uses binary classifiers to classify the classes independently
- `MultiOutputClassifier` fits one classifier per target class 
- `OneVsRestClassifier` fits one class against the other classes

In [4]:
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

The following classes are classifiers that implement different methods of classification.
- `RandomForestClassifier` is a class under the ensemble module that trains by fitting using a number of decision trees
- `GradientBoostingClassifier` is a class under the ensemble module that optimizes arbitrary differentiable loss functions
- `AdaBoostClassifier` is a class under the ensemble module that implements AdaBoost-SAMME
- `MultinomialNB` is a class under the Naive Bayes module that allows the classification of discrete features
- `LogisticRegression` is a class under the linear models module that implements regularized logistic regression
- `SGDClassifier` is a class under the linear models module that implements regularized linear models with stochastic gradient descent (SGD) learning

In [5]:
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

Meanwhile, the following classes are used for hyperparameter tuning.
- `ParameterGrid` is a class that allows the iteration over different combinations of parameter values 
- `GridSearchCV` is a cross-validation class that allows the exhaustive search over all possible combinations of hyperparameter values
- `RandomizedSearchCV` is a cross-validation class that allows a random search over some possible combinations of hyperparameter values
- `train_test_split` divides the dataset into two subsets

In [6]:
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

And lastly, these classes computes different scores about how well a model works.
- `log_loss` computes the Logistic loss given the true values and the predicted values
- `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
- `accuracy_score` computes the accuracy by determining how many classes were correctly predicted

In [None]:
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

The warnings module is used to ignore any ConvergenceWarnings that might appear when doing hyperparameter tuning. As these models will not be chosen due to low accuracy scores, the warnings would only clutter the output.

In [8]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category = ConvergenceWarning)

### Load Files
The csv files to be loaded here contains the datasets that have already gone through the data cleaning and preprocessing techniques discussed in the main notebook.

In [9]:
train = pd.read_csv('cleaned_data/cleaned_train.csv')
test = pd.read_csv('cleaned_data/cleaned_test.csv')

## Initialize Datasets
Before using these datasets, we would need to convert the values in the `comment_text` column into either "str, unicode or file objects", according to the documentation of TF-IDF vectorizer and Count Vectorizer.

In [10]:
test ['comment_text'] = test ['comment_text'].apply(lambda x: np.str_(x))
train ['comment_text'] = train ['comment_text'].apply(lambda x: np.str_(x))

Then, we would be declaring our **X_train**, **y_train**, and **X_test**.

In [11]:
X_train = train ['comment_text']
y_train = train.loc [ : , 'toxic' : ]

X_test = test ['comment_text']

Afterwards, we would be declaring the different classes that our model would need to predict. This can be found in the **train** data's column names.

In [12]:
classes = train.columns [2:]

## Vectorizing Data
As explained in the **Feature Engineering** part of the main notebook, three types of vectorizers would be used: (1) Count Vectorizer, (2) TF-IDF Vectorizer, and (3) Average Word2Vec Vectors.

Two types of CountVectorizer and TF-IDF Vectorizers were made in consideration of the more complex estimators: one with no **max_features** parameter, and one with a **max_features** parameter that is equal to 5000. Limiting the number of max features would lessen the time and space complexity from training the estimators; this would lessen the burden on our machines.

#### Count Vectorizer

In [13]:
count_vectorizer = CountVectorizer()                     # creating the Vectorizer with no max features
count_train = count_vectorizer.fit_transform(X_train)    # fitting the vectorizer according to the train data, and then
                                                         # returning the transformed train data
count_test = count_vectorizer.transform(X_test)          # returning the transformed test data

In [14]:
count_vectorizer_5000 = CountVectorizer(max_features = 5000)     # creating the Vectorizer with max features = 5000
count_train_5000 = count_vectorizer_5000.fit_transform(X_train)  # fitting the vectorizer according to the train data, and then
                                                                 # returning the transformed train data
count_test_5000 = count_vectorizer_5000.transform(X_test)        # returning the transformed test data

#### TF-IDF Vectorizer

In [16]:
tfidf_vectorizer = TfidfVectorizer()                    # creating the Vectorizer with no max features
tfidf_train = tfidf_vectorizer.fit_transform(X_train)   # fitting the vectorizer according to the train data, and then
                                                        # returning the transformed train data
tfidf_test = tfidf_vectorizer.transform(X_test)         # returning the transformed test data

In [17]:
tfidf_vectorizer_5000 = TfidfVectorizer(max_features = 5000)    # creating the Vectorizer with max features = 5000
tfidf_train_5000 = tfidf_vectorizer_5000.fit_transform(X_train) # fitting the vectorizer according to the train data, and then
                                                                # returning the transformed train data
tfidf_test_5000 = tfidf_vectorizer_5000.transform(X_test)       # returning the transformed test data

#### Average Word2Vec Vectors

Before building a Word2Vec model, the data must be tokenized to produce a list of lists of tokens as indicated in Gensim's documentation. 

In [18]:
def tokenize(data):
    t = TweetTokenizer()

    tokens_list = []
    for text in data:
        tokens_list += [t.tokenize(text)]
    return tokens_list

The train set is tokenized using NLTK's `TweetTokenizer`.

In [23]:
tokens_train = tokenize(X_train)

The Word2Vec model is trained using this tokenized list, which would transform these words into word vectors.

In [24]:
wrd2v_model = Word2Vec(tokens_train, epochs=30, sg=0, workers=4)

To transform the word vectors into usable features, these vectors are averaged for all words in the model's vocabulary.

In [30]:
def vectorize_word2vec(model, tokens_list):
    vectors = []
    
    for tokens in tokens_list:                    # iterate through each sentence
        feat = np.zeros(100)                      # initializes a list that will hold the vectors
        count = 0                                 # initializes the word count for a sentence
        
        for token in tokens:                      # iterate through each word in the sentence
            if token in model.wv.index_to_key:    # if the word is in the model's vocabulary...
                feat += model.wv[token]           # ...add word vectors to list and...
                count += 1                        # ...update the word count
        
        if count > 1:                             # if sentence contains more than 1 word in the model...
            feat /= count                         # ...divide word vectors by word count to get the average
            
        vectors.append(feat)                      # add the averaged vectors to the list
        
    return vectors

Using the defined function, the train data can be vectorized using the word vectors.

In [31]:
wrd2v_train = vectorize_word2vec(wrd2v_model, tokens_train)

The test data is also vectorized as follows:

In [34]:
tokens_test = tokenize(X_test)
wrd2v_test = vectorize_word2vec(wrd2v_model, tokens_test)

## Training and Tuning Different Models <a class="anchor" id="toc"></a>
TODO: i ken fly

#### Variable and Function Declarations
* [**Helper Functions**](#functs)
* [**Hyperparameters**](#params)

### Six Single-Label Classifiers
* [**Logistic Regression**](#lr)
* [**Multinomial Naive Bayes**](#mn)
* [**Random Forest Classifier**](#rf)
* [**Gradient Boosting Classifier**](#gbc)
* [**eXtreme Gradient Boosting Classifier**](#xgb)
* [**AdaBoostClassifier Boosting Classifier**](#adb)
* [**Stochastic Gradient Descent Classifier**](#sgd)

### Multi-Label Classifiers
* [**OneVsRest Classifier: Logistic Regression**](#oc_lr)
* [**OneVsRest Classifier: Multinomial Naive Bayes**](#oc_mn)
* [**MultiOutput Classifier: Logistic Regression**](#mo_lr)
* [**MultiOutput Classifier: Multinomial Naive Bayes**](#mo_mn)
* [**Binary Relevance: Logistic Regression**](#br_lr)
* [**Binary Relevance: Multinomial Naive Bayes**](#br_mn)
* [**Classifier Chain: Multinomial Naive Bayes**](#cc_mn)

### Declaring Helper Functions <a class="anchor" id="functs"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
Helper functions that would be repeatedly used throughout the notebook, will be declared and discussed here.

#### Submission Template Functions
The following `to_submission_csv` functions are used to create CSV files with the correct submission template. The first function is used by almost all models, while a modified version is used for the MultiOutput Classifier.

In [35]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id']

In [36]:
def to_submission_csv(predictions, filename):
    for i in range (6):
        sample_submission[classes [i]] = predictions[:, i : i + 1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False) 

In [37]:
def to_submission_csv_multiclass(predictions, filename):
    for i in range (6):
        temp = list(zip(*predictions[i]))
        sample_submission[classes [i]] = temp[1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False)     

#### Display Functions
The `format_results` function is used to compute for the final test accuracy and to display the final results as a DataFrame.

In [111]:
index = ['model', 'vector', 'tuned', 'private', 'public', 'test accuracy']
all_results = pd.DataFrame(index=index)

def update_results(all_results, results):
    for result in results:
        # private score accounts for 90% of the test data, while the remaining 10% is the public score
        results[result] += [round((results[result][3]*9 + results[result][4]) / 10, 5)]

    return pd.concat([pd.DataFrame(results, index=index), all_results], axis=1)

#### Training and Tuning Functions

As the task requires us to give predictions for six classes, the `train_models` function would train multiple classifiers that will give predictions for a given class. The function would fit each classifier using the passed train set, compute for the training accuracies for each class, then predict the classes of the passed test set.

The function will return the trained models and their predictions.

In [127]:
def train_models(model, X_train, X_test):
    """Trains six models using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted

    Returns
    -------
    models
        a list of fitted estimator objects
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    test_predictions = np.zeros((len(test), len(classes)))                  # initialize empty list for predictions
    models = []                                                             # initialize empty list for models
    train_accuracy = []
    
    
    print('Fitting', str(model) + '...')
    
    for i in range(6):                                                      # loop for each of six classes
        
        model.fit(X_train, y_train[classes[i]])                             # fit the model
        
        train_predictions = model.predict(X_train)                          # predict using train data
        accuracy = accuracy_score(train_predictions, y_train[classes[i]])   # get training accuracy 
        print(classes[i] + ':', accuracy)
        
        test_predictions[:,i] = model.predict_proba(X_test)[:,1]            # predict using test data
        
        models += [model]
        train_accuracy += [accuracy]
    
    print('\nOverall training accuracy:', np.mean(train_accuracy))
    
    return models, test_predictions

Similarly, the `tune_and_train_models` function will train multiple classifiers with the addition of hyperparameter tuning to achieve a better training accuracy. Hyperparameter tuning will be done using a `GridSearchCV` for a more comprehensive search.

In [208]:
def tune_and_train_models(model, hyperparameters, X_train, X_test, scoring='accuracy'):
    """Tunes six models using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    hyperparameters : estimator object
        the hyperparameters used for tuning the model  
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    scoring : 
        the metric for deciding the best combination of parameters, either 'accuracy' or 'f1'

    Returns
    -------
    models
        a list of fitted estimator objects
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    test_predictions = np.zeros((len(test), len(classes)))                  # initialize empty list for predictions
    models = []                                                             # initialize empty list for models
    train_accuracy = []
    
    print('Tuning', str(model) + '...')
    
    for i in range(6):                                                              # loop for each of six classes
        model_cv = GridSearchCV(model, hyperparameters, 
                                cv=[(slice(None), slice(None))], scoring=scoring)
        model_cv.fit(X_train, y_train[classes[i]])
        
        train_predictions = model_cv.predict(X_train)                               # predict using train data
        accuracy = accuracy_score(train_predictions, y_train[classes[i]])           # get training accuracy 
        print(classes[i] + ':', accuracy)
        
        test_predictions[:,i] = model_cv.predict_proba(X_test)[:,1]                 # predict using test data
        
        models += [model_cv]
        train_accuracy += [accuracy]
    
    print('\nOverall training', scoring + ':', np.mean(train_accuracy))
    
    return models, test_predictions

Multi-label classifiers would follow the same pipeline of training the model, getting the training predictions, and predicting the classes of the test data. As such, the `train_model` and `tune_and_train_model` functions will forgo the loop and proceed to train and/or tune the model using the whole **y_train**.

In [40]:
def train_model(model, X_train, X_test, dense=False):
    """Trains a model using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    dense : 
        a boolean value indicating if the predictions need to be converted to dense

    Returns
    -------
    model
        a fitted estimator object
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    print('Fitting', str(model) + '...')
    
    model.fit(X_train, y_train)                                               # fit the model
    train_predictions = model.predict(X_train)                                # predict using train data
    
    if dense:                                           
        train_predictions = train_predictions.to_dense()                      # convert predictions to dense
        
    accuracy = accuracy_score(train_predictions, y_train)                     # get training accuracy 
    print(accuracy)                                                        
    
    test_predictions = model.predict_proba(X_test)                            # predict using test data
    
    return model, test_predictions

As with the previous functions, the `tune_and_train_model` function will tune a single multi-label classifier using a `GridSearchCV` to increase the training accuracy.

In [193]:
def tune_and_train_model(model, hyperparameters, X_train, X_test, scoring='accuracy', dense=False):
    """Tunes a model using a given train and test set.

    Parameters
    ----------
    model : estimator object
        the type of estimator to be trained 
    hyperparameters : estimator object
        the hyperparameters used for tuning the model  
    X_train : 
        the data used in fitting the model
    X_test : 
        the data to be predicted
    scoring : 
        the metric for deciding the best combination of parameters, either 'accuracy' or 'f1'
    dense : 
        a boolean value indicating if the predictions need to be converted to dense

    Returns
    -------
    model
        a  fitted estimator object
    test_predictions
        a list of prediction probabilities by the fitted model
    """
    
    print('Tuning', str(model) + '...')

    model_cv = GridSearchCV(model, hyperparameters, 
                            cv=[(slice(None), slice(None))], scoring=scoring)
    model_cv.fit(X_train, y_train)

    train_predictions = model_cv.predict(X_train)                                # predict using train data
    
    if dense:                                           
        train_predictions = train_predictions.to_dense()                         # convert predictions to dense
        
    accuracy = accuracy_score(train_predictions, y_train)                        # get training accuracy 
    print(accuracy)                                                        
    
    test_predictions = model_cv.predict_proba(X_test)                            # predict using test data
    
    return models, test_predictions

### Declaring Hyperparameter Values <a class="anchor" id="params"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
As hyperparameters for each base estimator will remain constant, these will be declared here.

#### Logistic Regression Hyperparameters <a class="anchor" id="param_lr"></a>
Tuning Logistic Regression models mostly involve altering the C, which controls the regularization strength, and the maximum number of iterations. 

For the C value, different powers of the default value (1) were tested to see if a stronger or weaker regularization strength can affect the results.

During earlier testing stages, it was determined that the default number of max iterations (100) resulted in a ConvergenceWarning. With this, higher values were considered.

In [42]:
parameters_lr = [{
    'C' : [0.01, 0.1, 1, 10],
    'max_iter' : [300, 600, 900, 1200]
}]

As the OneVsRest Classifier and MultiOutput Classifier require a slightly altered format, this was declared as a different variable.

In [43]:
parameters_lr_mo = [{
    'estimator__C': [0.01, 0.1, 1, 10],           
    'estimator__max_iter': [300, 600, 900, 1200], 
}]

The Binary Relevance classifier also requires its own separate format, as seen below.

In [44]:
parameters_lr_multi = [{
    'classifier': [LogisticRegression()],
    'classifier__C': [0.01, 0.1, 1, 10],            
    'classifier__max_iter': [300, 600, 900, 1200] 
}]

#### Multinomial Naive Bayes Hyperparameters <a class="anchor" id="param_mnb"></a>
For Multinomial Naive Bayes hyperparameters, 

In [185]:
parameters_mnb = [{
    'alpha' : [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
    'fit_prior' : [True, False]
}]

TODO: i ken fly

In [46]:
parameters_mn_mo = [{
    'estimator__alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000], 
    'estimator__fit_prior': [True, False]
}]

TODO: i ken fly

In [47]:
parameters_mn_multi = [{
    'classifier': [MultinomialNB()],
    'classifier__alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],  
    'classifier__fit_prior': [True, False]
}]

#### Random Forest Classifier Hyperparameters <a class="anchor" id="param_rf"></a>
For Multinomial Naive Bayes hyperparameters, 

In [48]:
parameters_rf = [{
    'n_estimators' : [100, 200, 300, 400, 500],
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [5, 10, 20, 30],
    'min_samples_split' : [2, 4, 6, 10, 15, 20],
    'max_leaf_nodes' : [3, 5, 10, 20, 50, 100],
}]

#### Gradient Boosting Classifier Hyperparameters <a class="anchor" id="param_gbc"></a>
For Multinomial Naive Bayes hyperparameters, 

In [49]:
parameters_gbc = [{
    'n_estimators' : [50, 100, 250],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

#### XGBoost Classifier Hyperparameters <a class="anchor" id="param_xgb"></a>
For Multinomial Naive Bayes hyperparameters, 

In [50]:
parameters_xgb = [{
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2],
}]

#### Adaboost Classifier Hyperparameters <a class="anchor" id="param_adb"></a>
For Multinomial Naive Bayes hyperparameters, 

In [51]:
parameters_adb = {
    'n_estimators' : [10, 25, 50, 100, 250],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 1.2]
}

#### SGDClassifier Hyperparameters <a class="anchor" id="param_sgd"></a>
For Multinomial Naive Bayes hyperparameters, 

In [52]:
parameters_sgd = [{
    'loss' : ['log', 'modified_huber'],
    'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
}]

## Model Experimentation
It should be noted that all models used below will have these constant values for parameters, if applicable:
* `n_jobs = -1`, which would ensure that all CPU cores will be used for faster processing,
* `class_weight='balanced'`, which would <TODO>, and
* `random_state=8`, which would ensure that the output can be reproduced

### Logistic Regression <a class="anchor" id="lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: baket ito

#### Model Training
A `LogisticRegression()` object is first initialized to be used as the base classifier.

In [211]:
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')

The model is then trained using the count vectorized train data.

In [212]:
%%time
lr_models_count, predictions_lr_count = train_models(lr, count_train, count_test)
to_submission_csv(predictions_lr_count, 'submission_lr_count')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9616847672822756
severe_toxic: 0.9756534708687669
obscene: 0.9744627783243822
threat: 0.9962963195066773
insult: 0.9615468976192416
identity_hate: 0.9769820330761855

Overall training accuracy: 0.9744377111129215
Wall time: 1min 2s


The results for count vectors are quite accurate, seeing as the training accuracy rate for each class is at least 0.96.

Next, the model will be trained using the TF-IDF vectorized data as shown below.

In [213]:
%%time
lr_models_tfidf, predictions_lr_tfidf = train_models(lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_lr_tfidf, 'submission_lr_tfidf')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9574233413339517
severe_toxic: 0.9794260861936066
obscene: 0.9807421147952949
threat: 0.9937519975434133
insult: 0.9681019734162223
identity_hate: 0.9810241209242281

Overall training accuracy: 0.9767449390344529
Wall time: 52.8 s


Compared to count vectors, the use of TF-IDF vectors yielded higher training accuracy scores for **severe_toxic**, **obscene**, **insult**, and **identity_hate**. Moreover, the execution time is faster by around 9 seconds.

Lastly, the Word2Vec vectorized data will be used to train the model.

In [214]:
%%time
lr_models_wrd2v, predictions_lr_wrd2v = train_models(lr, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_lr_wrd2v, 'submission_lr_wrd2v')

Fitting LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.8984088587525302
severe_toxic: 0.948117139079156
obscene: 0.92264258543219
threat: 0.9272236183266382
insult: 0.9160687092266139
identity_hate: 0.9084670773511477

Overall training accuracy: 0.9201546646947126
Wall time: 4min 6s


Word2Vec vectors produced the lowest training accuracies so far, ranging from 0.89 to 0.94 only. Furthermore, training the six models took 3 minutes, which is relatively longer than the others.

#### Test Accuracy Score
As seen from the scores returned by Kaggle, TF-IDF vectors performed the best, followed by Word2Vec vectors, then count vectors.

In [None]:
results = {
    "submission_lr_count": ['Logistic Regression', 'Count Vectors', 'Not Tuned', 0.94845, 0.94248],
    "submission_lr_tfidf": ['Logistic Regression', 'TF-IDF Vectors', 'Not Tuned', 0.97558, 0.97621], 
    "submission_lr_wrd2v": ['Logistic Regression', 'Word2Vec Vectors', 'Not Tuned', 0.95233, 0.94982]
}

all_results = update_results(all_results, results)
all_results.T.sort_values('test accuracy', ascending=False)

TODO: chika abt ^^

#### Hyperparameter Tuning
A `LogisticRegression()` object with default parameters will serve as the base estimator, which will be tuned using the [`parameters_lr`](#param_lr) hyperparameters.

In [216]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_count_tuned, predictions_lr_count_tuned = tune_and_train_models(lr, parameters_lr, count_train, count_test)
to_submission_csv(predictions_lr_count_tuned, 'submission_lr_count_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9922228976443088
severe_toxic: 0.9934887918230756
obscene: 0.9942972093926842
threat: 0.999448521347864
insult: 0.9888012232799193
identity_hate: 0.9950554925393712

Overall training accuracy: 0.9938856893378705
Wall time: 1h 15min 48s


After tuning the models that use count vectors, all training accuracies for each class have noticably increased. As these models are more regularized, this could however mean that the models may have overfitted to the train set. Also, tuning these models took around an hour and 30 minutes, which is considerably long.

In [217]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_tfidf_tuned, predictions_lr_tfidf_tuned = tune_and_train_models(lr, parameters_lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_lr_tfidf_tuned, 'submission_lr_tfidf_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.9798647623941694
severe_toxic: 0.989340168326325
obscene: 0.9901485858959335
threat: 0.998145026351906
insult: 0.9829918970238953
identity_hate: 0.9930062480024566

Overall training accuracy: 0.9889161146657811
Wall time: 16min 15s


For TF-IDF vectors, the tuning time is significantly faster at around 16 minutes.

In [None]:
%%time
lr = LogisticRegression(n_jobs=-1, class_weight='balanced')
lr_models_wrd2v_tuned, predictions_lr_wrd2v_tuned = tune_and_train_models(lr, parameters_lr, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_lr_wrd2v_tuned, 'submission_lr_wrd2v_tuned')

Tuning LogisticRegression(class_weight='balanced', n_jobs=-1)...
toxic: 0.898502860795508


It can be seen above that count vectors had higher training accuracy scores for **toxic**, **severe_toxic**, **obscene**, **threat**, and **identity_hate**, while TODO: kasi irurun ulet to

#### Test Accuracy Score
As seen from the scores returned by Kaggle, TODO

In [None]:
results = {
    "submission_lr_count_tuned": ['Logistic Regression', 'Count Vectors', 'Tuned', 0.94845, 0.94248],
    "submission_lr_tfidf_tuned": ['Logistic Regression', 'TF-IDF Vectors', 'Tuned', 0.97558, 0.97621], 
    "submission_lr_wrd2v_tuned": ['Logistic Regression', 'Word2Vec Vectors', 'Tuned', 0.95233, 0.94982]
}

all_results = update_results(all_results, results)
all_results.T.sort_values('test accuracy', ascending=False)

### Multinomial Naive Bayes <a class="anchor" id="mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training
A `MultinomialNB()` object with default parameters is initialized to serve as the base classifier.

In [195]:
mn = MultinomialNB()

The model will first be trained using the count vectorized train data.

In [196]:
%%time
mn_models_count, predictions_mn_count = train_models(mn, count_train, count_test)
to_submission_csv(predictions_mn_count, 'submission_mn_count')

Fitting MultinomialNB()...
toxic: 0.9513696097661856
severe_toxic: 0.98641983819115
obscene: 0.9670867513520627
threat: 0.9955505699657206
insult: 0.9646301646289113
identity_hate: 0.9877233331871079

Overall training accuracy: 0.975463377848523
Wall time: 3.75 s


In comparison to the Logistic Regression model fitted with count vectorized data, the training accuracy is higher by only 0.001. Moreover, the time of execution has vastly improved, as fitting the six MultinomialNB() estimators only took 1.48 seconds.

Then, the model will be trained using the TF-IDF vectorized data.

In [197]:
%%time
mn_models_tfidf, predictions_mn_tfidf = train_models(mn, tfidf_train, tfidf_test)
to_submission_csv(predictions_mn_tfidf, 'submission_mn_tfidf')

Fitting MultinomialNB()...
toxic: 0.9236828747078103
severe_toxic: 0.9899104473870566
obscene: 0.9538449968979326
threat: 0.996973134216117
insult: 0.9535629907689994
identity_hate: 0.9911074067343063

Overall training accuracy: 0.968180308452037
Wall time: 3.23 s


The overall training accuracy for TF-IDF Vectors using the MultinomialNB estimator is lower compared to both the Count Vectors MultinonmialNB model and the TF-IDF Vectors using LogisticRegression models at 0.968. However, the training time of 1.32 seconds is the fastest so far.

The MultinomialNB models trained with count vectorized data performed better when predicting **toxic**, **obscene**, and **insult** classes, while TF-IDF vectorized data performed better for **severe_toxic**, **threat**, and **identity_hate** classes.

#### Test Accuracy Score
As seen from the scores returned by Kaggle, 

# TODO: yuck ang baba

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mnb_count</td>
    <td class="tg-baqh">0.84551</td>
    <td class="tg-baqh">0.85581</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mnb_tfidf</td>
    <td class="tg-baqh">0.82510</td>
    <td class="tg-baqh">0.83586</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning
A `MultinomialNB()` object with default parameters is declared for use as the base estimator, to be tuned by the defined [`parameters_mnb`](#param_mn) hyperparameters.

In [209]:
%%time
mn = MultinomialNB()
mn_models_count_tuned, predictions_mn_count_tuned = tune_and_train_models(mn, parameters_mnb, count_train, count_test)
to_submission_csv(predictions_mn_count_tuned, 'submission_mn_count_tuned')

Tuning MultinomialNB()...
toxic: 0.9654135149870591
severe_toxic: 0.9901736531073942
obscene: 0.9763992204097236
threat: 0.9970044682304429
insult: 0.974074236546741
identity_hate: 0.9912515432002056

Overall training accuracy: 0.9823861060802611
Wall time: 22.9 s


In [210]:
%%time
mn = MultinomialNB()
mn_models_tfidf_tuned, predictions_mn_tfidf_tuned = tune_and_train_models(mn, parameters_mnb, tfidf_train, tfidf_test)
to_submission_csv(predictions_mn_tfidf_tuned, 'submission_mn_tfidf_tuned')

Tuning MultinomialNB()...
toxic: 0.975553202022924
severe_toxic: 0.9945980159302129
obscene: 0.9865577078541841
threat: 0.998727839018368
insult: 0.9844708625000783
identity_hate: 0.9957949752774627

Overall training accuracy: 0.9892837671005382
Wall time: 21.9 s


<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mn_count_tuned</td>
    <td class="tg-baqh">0.90205</td>
    <td class="tg-baqh">0.90411</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mn_tfidf_tuned</td>
    <td class="tg-baqh">0.90610</td>
    <td class="tg-baqh">0.90995</td>
  </tr>
  <tr>
</tbody>
</table>

### RandomForestClassifier <a class="anchor" id="rf"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training
A `RandomForestClassifier()` object with default parameters is initialized before it is passed to the `train_models()` function. A classifier for each of the six classes for each vectorizer will be trained using this object as a base.

In [141]:
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', random_state=8)

In [146]:
%%time
rf_models_count, predictions_rf_count = train_models(rf, count_train, count_test)
to_submission_csv(predictions_rf_count, 'submission_rf_count')

Fitting RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=8)...
toxic: 0.999793195505449
severe_toxic: 0.9998245295197749
obscene: 0.99983079632264
threat: 0.9999247983656178
insult: 0.9996678594481453
identity_hate: 0.9999059979570223

Overall training accuracy: 0.9998245295197749
Wall time: 42min 15s


In [147]:
%%time
rf_models_tfidf, predictions_rf_tfidf = train_models(rf, tfidf_train, tfidf_test)
to_submission_csv(predictions_rf_tfidf, 'submission_rf_tfidf')

Fitting RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=8)...
toxic: 0.9997618614911231
severe_toxic: 0.9997117270682017
obscene: 0.9997869287025838
threat: 0.999931065168483


KeyboardInterrupt: 

In [148]:
%%time
rf_models_wrd2v, predictions_rf_wrd2v = train_models(rf, wrd2v_train, wrd2v_test)
to_submission_csv(predictions_rf_wrd2v, 'submission_rf_wrd2v')

Fitting RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=8)...
toxic: 0.9996490590395498
severe_toxic: 0.9996929266596061
obscene: 0.9996678594481453
threat: 0.999931065168483
insult: 0.9994673217564595
identity_hate: 0.9998871975484267

Overall training accuracy: 0.9997159049367784
Wall time: 6min 44s


<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_rf_count</td>
    <td class="tg-baqh">0.94071</td>
    <td class="tg-baqh">0.94602</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_rf_tfidf</td>
    <td class="tg-baqh">0.93731</td>
    <td class="tg-baqh">0.93999</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [149]:
%%time
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', random_state=8)
rf_models_count_tuned, predictions_rf_count_tuned = tune_and_train_models(rf, parameters_rf, count_train, count_test)
to_submission_csv(predictions_rf_count_tuned, 'submission_rf_count_tuned')

Tuning RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=8)...


KeyboardInterrupt: 

In [150]:
%%time
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', random_state=8)
rf_models_tfidf_tuned, predictions_rf_tfidf_tuned = tune_and_train_models(rf, parameters_rf, tfidf_train, tfidf_test)
to_submission_csv(predictions_rf_tfidf_tuned, 'submission_rf_tfidf_tuned')

Tuning RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=8)...


KeyboardInterrupt: 

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_rf_count_tuned</td>
    <td class="tg-baqh">0.96232</td>
    <td class="tg-baqh">0.96285</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_rf_tfidf_tuned</td>
    <td class="tg-baqh">0.95937</td>
    <td class="tg-baqh">0.95826</td>
  </tr>
  <tr>
</tbody>
</table>

### GradientBoostingClassifier <a class="anchor" id="gbc"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training

In [None]:
gbc = GradientBoostingClassifier(random_state=8)

In [None]:
%%time
gbc_models_count, predictions_gbc_count = train_models(gbc, count_train, count_test)

In [None]:
%%time
gbc_models_tfidf, predictions_gbc_tfidf = train_models(gbc, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_gbc_count, 'submission_gbc_count')
to_submission_csv(predictions_gbc_tfidf, 'submission_gbc_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
gbc = GradientBoostingClassifier(random_state=8)
gbc_models_count_tuned, predictions_gbc_count_tuned = tune_and_train_models(gbc, parameters_gbc, count_train, count_test)

In [None]:
%%time
gbc = GradientBoostingClassifier(random_state=8)
gbc_models_tfidf_tuned, predictions_gbc_tfidf_tuned = tune_and_train_models(gbc, parameters_gbc, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_gbc_count_tuned, 'submission_gbc_count_tuned')
to_submission_csv(predictions_gbc_tfidf_tuned, 'submission_gbc_tfidf_tuned')

### XGBClassifier <a class="anchor" id="xgb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>
TODO: i ken fly

#### Model Training

In [None]:
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)

In [None]:
%%time
xgb_models_count, predictions_xgb_count = train_models(xgb, count_train, count_test)

In [None]:
%%time
xgb_models_tfidf, predictions_xgb_tfidf = train_models(xgb, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_xgb_count, 'submission_xgb_count')
to_submission_csv(predictions_xgb_tfidf, 'submission_xgb_tfidf')

<table>
<thead>
  <tr>
    <th></th>
    <th>private</th>
    <th>public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>submission_xgb_count</td>
    <td>0.96468</td>
    <td>0.96783</td>
  </tr>
  <tr>
    <td>submission_xgb_tfidf</td>
    <td>0.96502</td>
    <td>0.96803</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [None]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_count_tuned, predictions_xgb_count_tuned = tune_and_train_models(xgb, parameters_xgb, count_train, count_test)

In [None]:
%%time
xgb = xgboost.XGBClassifier(objective="binary:logistic", eval_metric='auc', verbosity=0, use_label_encoder=False)
xgb_models_tfidf_tuned, predictions_xgb_tfidf_tuned = tune_and_train_models(xgb, parameters_xgb, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

### AdaBoostClassifier <a class="anchor" id="adb"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_xgb_count_tuned.csv', 'submission_xgb_tfidf_tuned.csv']
)

#### Model Training

In [None]:
adb = AdaBoostClassifier(random_state=8)

In [None]:
%%time
adb_models_count, predictions_adb_count = train_models(adb, count_train, count_test)

In [None]:
%%time
adb_models_tfidf, predictions_adb_tfidf = train_models(adb, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_adb_count, 'submission_adb_count')
to_submission_csv(predictions_adb_tfidf, 'submission_adb_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
adb = AdaBoostClassifier(random_state=8)
adb_models_count_tuned, predictions_adb_count_tuned = tune_and_train_models(adb, parameters_adb, count_train, count_test)

In [None]:
pd.DataFrame(
    data={'private': [0.93539, 0.93830], 'public': [0.94218, 0.94145]}, 
    index=['submission_adb_count.csv', 'submission_adb_tfidf.csv']
)

In [None]:
%%time
adb = AdaBoostClassifier(random_state=8)
adb_models_tfidf_tuned, predictions_adb_tfidf_tuned = tune_and_train_models(adb, parameters_adb, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

In [None]:
adb_tuned = GridSearchCV(AdaBoostClassifier(random_state=8), parameters_adb, scoring='accuracy', cv=cv)
adb_models_count_tuned, adb_models_tfidf_tuned, \
    predictions_adb_count_tuned, predictions_adb_tfidf_tuned = tune_and_train_models(adb_tuned)

### Stochastic Gradient Descent Classifier <a class="anchor" id="sgd"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_xgb_count_tuned.csv', 'submission_xgb_tfidf_tuned.csv']
)

#### Model Training

In [None]:
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)

In [None]:
%%time
sgd_models_count, predictions_sgd_count = train_models(sgd, count_train, count_test)

In [None]:
%%time
sgd_models_tfidf, predictions_sgd_tfidf = train_models(sgd, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_sgd_count, 'submission_sgd_count')
to_submission_csv(predictions_sgd_tfidf, 'submission_sgd_tfidf')

In [None]:
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_count, sgd_models_tfidf, predictions_sgd_count, predictions_sgd_tfidf = train_models(sgd)

<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_sgd_count</td>
    <td class="tg-baqh">0.68992</td>
    <td class="tg-baqh">0.70589</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_sgd_tfidf</td>
    <td class="tg-baqh">0.97214</td>
    <td class="tg-baqh">0.97625</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [None]:
%%time
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_count_tuned, predictions_sgd_count_tuned = tune_and_train_models(sgd, parameters_sgd, count_train, count_test)

In [None]:
%%time
sgd = SGDClassifier(loss='modified_huber', class_weight='balanced', n_jobs=-1, random_state=8)
sgd_models_tfidf_tuned, predictions_sgd_tfidf_tuned = tune_and_train_models(sgd, parameters_sgd, tfidf_train, tfidf_test)

In [None]:
to_submission_csv(predictions_sgd_count_tuned, 'submission_sgd_count_tuned')
to_submission_csv(predictions_sgd_tfidf_tuned, 'submission_sgd_tfidf_tuned')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_sgd_count_tuned</td>
    <td class="tg-baqh">0.81387</td>
    <td class="tg-baqh">0.81848</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_sgd_tfidf_tuned</td>
    <td class="tg-baqh">0.94660</td>
    <td class="tg-baqh">0.95230</td>
  </tr>
  <tr>
</tbody>
</table>

### OneVsRest Classifier: Logistic Regression <a class="anchor" id="oc_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))

In [None]:
%%time
oc_lr_model_count, predictions_oc_lr_count = train_model(oc_lr, count_train, count_test)
to_submission_csv(predictions_oc_lr_count, 'submission_oc_lr_count')

In [None]:
%%time
oc_lr_model_tfidf, predictions_oc_lr_tfidf = train_model(oc_lr, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_lr_tfidf, 'submission_oc_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
oc_lr_model_count_tuned, predictions_oc_lr_count_tuned = tune_and_train_model(oc_lr, parameters_lr_mo, count_train, count_test)
to_submission_csv(predictions_oc_lr_count_tuned, 'submission_oc_lr_count_tuned')


In [None]:
%%time
oc_lr = OneVsRestClassifier(LogisticRegression(class_weight='balanced'))
oc_lr_model_tfidf_tuned, predictions_oc_lr_tfidf_tuned = tune_and_train_model(oc_lr, parameters_lr_mo, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_lr_tfidf_tuned, 'submission_oc_lr_tfidf_tuned')

### OneVsRest Classifier: Multinomial Naive Bayes <a class="anchor" id="oc_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
oc_mn = OneVsRestClassifier(MultinomialNB())

In [None]:
%%time
oc_mn_model_count, predictions_oc_mn_count = train_model(oc_mn, count_train, count_test)
to_submission_csv(predictions_oc_mn_count, 'submission_oc_mn_count')

In [None]:
%%time
oc_mn_model_tfidf, predictions_oc_mn_tfidf = train_model(oc_mn, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_mn_tfidf, 'submission_oc_mn_tfidf')

#### Hyperparameter Tuning

In [194]:
%%time
oc_mn = OneVsRestClassifier(MultinomialNB())
oc_mn_model_count_tuned, predictions_oc_mn_count_tuned = tune_and_train_model(oc_mn, 
                                                                              parameters_mn_mo, 
                                                                              count_train, 
                                                                              count_test, 
                                                                              scoring='f1')
to_submission_csv(predictions_oc_mn_count_tuned, 'submission_oc_mn_count_tuned')

Tuning OneVsRestClassifier(estimator=MultinomialNB())...
Overall training f1: 0.7751371196906105
Best parameters: {'estimator__alpha': 1e-05, 'estimator__fit_prior': True}
Wall time: 19.8 s


In [None]:
%%time
oc_mn = OneVsRestClassifier(MultinomialNB())
oc_mn_model_tfidf_tuned, predictions_oc_mn_tfidf_tuned = tune_and_train_model(oc_mn, parameters_mn_mo, tfidf_train, tfidf_test)
to_submission_csv(predictions_oc_mn_tfidf_tuned, 'submission_oc_mn_tfidf_tuned')

### MultiOutput Classifier: Logistic Regression <a class="anchor" id="mo_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'), n_jobs=-1)

In [None]:
%%time
mo_lr_model_count, predictions_mo_lr_count = train_model(mo_lr, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_lr_count, 'submission_mo_lr_count')

In [None]:
%%time
mo_lr_model_tfidf, predictions_mo_lr_tfidf = train_model(mo_lr, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_lr_tfidf, 'submission_mo_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'), n_jobs=-1)
mo_lr_model_count_tuned, predictions_mo_lr_count_tuned = tune_and_train_model(mo_lr, parameters_lr_mo, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_lr_count_tuned, 'submission_mo_lr_count_tuned')

In [None]:
%%time
mo_lr = MultiOutputClassifier(LogisticRegression(n_jobs=-1, class_weight='balanced'), n_jobs=-1)
mo_lr_model_tfidf_tuned, predictions_mo_lr_tfidf_tuned = tune_and_train_model(mo_lr, parameters_lr_mo, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_lr_tfidf_tuned, 'submission_mo_lr_tfidf_tuned')

### MultiOutput Classifier: Multinomial Naive Bayes <a class="anchor" id="mo_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)

In [None]:
%%time
mo_mn_model_count, predictions_mo_mn_count = train_model(mo_mn, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_mn_count, 'submission_mo_mn_count')

In [None]:
%%time
mo_mn_model_tfidf, predictions_mo_mn_tfidf = train_model(mo_mn, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_mn_tfidf, 'submission_mo_mn_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)
mo_mn_model_count_tuned, predictions_mo_mn_count_tuned = tune_and_train_model(mo_mn, parameters_mn_mo, count_train, count_test)
to_submission_csv_multiclass(predictions_mo_mn_count_tuned, 'submission_mo_mn_count_tuned')

In [None]:
%%time
mo_mn = MultiOutputClassifier(MultinomialNB(), n_jobs=-1)
mo_mn_model_tfidf_tuned, predictions_mo_mn_tfidf_tuned = tune_and_train_model(mo_mn, parameters_mn_mo, tfidf_train, tfidf_test)
to_submission_csv_multiclass(predictions_mo_mn_tfidf_tuned, 'submission_mo_mn_tfidf_tuned')

### Binary Relevance: Logistic Regression <a class="anchor" id="br_lr"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
br_lr = BinaryRelevance(LogisticRegression(n_jobs=-1, class_weight='balanced'))

In [None]:
%%time
br_lr_model_count, predictions_br_lr_count = train_model(br_lr, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_count, 'submission_br_lr_count')

In [None]:
%%time
br_lr_model_tfidf, predictions_br_lr_tfidf = train_model(br_lr, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_tfidf, 'submission_br_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
br_lr br_mn= BinaryRelevance(LogisticRegression(n_jobs=-1, class_weight='balanced'))
br_lr_model_count_tuned, predictions_br_lr_count_tuned = \
    tune_and_train_model(br_lr, parameters_lr_multi, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_count_tuned, 'submission_br_lr_count_tuned')

In [None]:
%%time
br_lr = BinaryRelevance(LogisticRegression(n_jobs=-1, class_weight='balanced'))
br_lr_model_tfidf_tuned, predictions_br_lr_tfidf_tuned = \
    tune_and_train_model(br_lr, parameters_lr_multi, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_tfidf_tuned, 'submission_br_lr_tfidf_tuned')

### Binary Relevance: Multinomial Naive Bayes <a class="anchor" id="br_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
br_mn = BinaryRelevance(MultinomialNB())

In [None]:
%%time
br_mn_model_count, predictions_br_mn_count = train_model(br_mn, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_count, 'submission_br_lr_count')

In [None]:
%%time
br_mn_model_tfidf, predictions_br_mn_tfidf = train_model(br_mn, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_br_lr_tfidf, 'submission_br_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
br_mn = BinaryRelevance(MultinomialNB())
br_mn_model_count_tuned, predictions_br_mn_count_tuned = \
    tune_and_train_models(br_mn, parameters_mn_multi, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_br_mn_count_tuned, 'submission_br_mn_count_tuned')

In [None]:
%%time
br_mn = BinaryRelevance(MultinomialNB())
br_mn_model_tfidf_tuned, predictions_br_mn_tfidf_tuned = \
    tune_and_train_models(br_mn, parameters_mn_multi, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_br_mn_tfidf_tuned, 'submission_br_mn_tfidf_tuned')

### Classifier Chain: Multinomial Naive Bayes <a class="anchor" id="cc_mn"></a><a style="float:right; font-size:11px" href="#toc">Back to Models List</a>

#### Model Training

In [None]:
cc_mn = ClassifierChain(MultinomialNB())

In [None]:
%%time
cc_mn_model_count, predictions_cc_mn_count = train_model(cc_mn, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_cc_lr_count, 'submission_cc_lr_count')

In [None]:
%%time
cc_mn_model_tfidf, predictions_cc_mn_tfidf = train_model(cc_mn, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_cc_lr_tfidf, 'submission_cc_lr_tfidf')

#### Hyperparameter Tuning

In [None]:
%%time
cc_mn = ClassifierChain(MultinomialNB(), random_state=8)
cc_mn_model_count_tuned, predictions_cc_mn_count_tuned = \
    tune_and_train_models(cc_mn, parameters_mn_multi, count_train, count_test, dense=True)
to_submission_csv_multiclass(predictions_cc_mn_count_tuned, 'submission_cc_mn_count_tuned')

In [None]:
%%time
cc_mn = ClassifierChain(MultinomialNB(), random_state=8)
cc_mn_model_tfidf_tuned, predictions_cc_mn_tfidf_tuned = \
    tune_and_train_models(cc_mn, parameters_mn_multi, tfidf_train, tfidf_test, dense=True)
to_submission_csv_multiclass(predictions_cc_mn_tfidf_tuned, 'submission_cc_mn_tfidf_tuned')