# You're Toxic, I'm Slippin' Under: Toxic Comment Classification Challenge

#### STINTSY S13 Group 8
- VICENTE, Francheska Josefa
- VISTA, Sophia Danielle S.

## Requirements and Imports
Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

### Import
Several libraries are required to perform a thorough analysis of the dataset. Each of these libraries will be imported and described below:

#### Basic Libraries 
Import `numpy` and `pandas`.
- `numpy` contains a large collection of mathematical functions
- `pandas` contains functions that are designed for data manipulation and data analysis

In [1]:
import numpy as np
import pandas as pd

#### Natural Language Processing Libraries 
- `re` is a module that allows the use of regular expressions
- `nltk` provides functions for processing text data
- `stopwords` is a corpus from NLTK, which includes a compiled list of stopwords
- `Counter` is from Python's `collections` module, which is helpful for tokenization
- `string` contains functions for string operations
- `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features 
- `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Machine Learning Libraries
The following code block can be used to install **scikit-multilearn** without restarting Jupyter Notebook. The `sys` module is used to access the *executable* function of the interpreter, which would run the installation of scikit-multilearn.

In [3]:
import sys
!{sys.executable} -m pip install scikit-multilearn



The following libraries are multi-label classification modules that would allow the usage of one model that can classify one instance as more than one class.
- `ClassifierChain` chains binary classifiers in a way that its predictions are dependent on the earlier classes
- `BinaryRelevance` uses binary classifiers to classify the classes independently
- `MultiOutputClassifier` fits one classifier per target class 
- `OneVsRestClassifier` fits one class against the other classes

In [4]:
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier

The following classes are classifiers that implement different methods of classification.
- `RandomForestClassifier` is a class under the ensemble module that trains by fitting using a number of decision trees
- `GradientBoostingClassifier` is a class under the ensemble module that optimizes arbitrary differentiable loss functions
- `AdaBoostClassifier` is a class under the ensemble module that implements AdaBoost-SAMME
- `MultinomialNB` is a class under the Naive Bayes module that allows the classification of discrete features
- `LogisticRegression` is a class under the linear models module that implements regularized logistic regression
- `SGDClassifier`

In [5]:
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

Meanwhile, the following classes are used for hyperparameter tuning.
- `ParameterGrid` is a class that allows the iteration over different combinations of parameter values 
- `GridSearchCV` is a cross-validation class that allows the exhaustive search over all possible combinations of hyperparameter values
- `RandomizedSearchCV` is a cross-validation class that allows a random search over some possible combinations of hyperparameter values
- `train_test_split` divides the dataset into two subsets

In [6]:
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

And lastly, these classes computes different scores about how well a model works.
- `log_loss` computes the Logistic loss given the true values and the predicted values
- `f1_score` computes the balanced F-score by comparing the actual classes and the predicted classes
- `accuracy_score` computes the accuracy by determining how many classes were correctly predicted

In [7]:
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [8]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category = ConvergenceWarning)

### Datasets and Files
From the previous notebook, we would be loading the cleaned data.

In [9]:
train = pd.read_csv('cleaned_data/cleaned_train.csv')
test = pd.read_csv('cleaned_data/cleaned_test.csv')

## Trying different Models
To determine which model would be best for the task, we would be trying different feature extraction methods, models, and hyperparameters. 

However, before we can utilize the cleaned data, we would need to convert the values in the `comment_text` column into either  "str, unicode or file objects", according to the documentation of TF-IDF vectorizer and Count Vectorixer.

In [11]:
test ['comment_text'] = test ['comment_text'].apply(lambda x: np.str_(x))
train ['comment_text'] = train ['comment_text'].apply(lambda x: np.str_(x))

Then, we would be declaring our **X_train**, **y_train**, and **X_test**.

In [12]:
X_train = train ['comment_text']
y_train = train.loc [ : , 'toxic' : ]

X_test = test ['comment_text']

Afterwards, we would be declaring the different classes that our model would need to predict. This can be found in the **train** data's column names.

In [9]:
classes = train.columns [2:]

In [10]:
class mn_hyper_parameter:
    def __init__(self, class_, alpha, fit_prior):
        self.class_ = class_
        self.alpha = alpha
        self.fit_prior = fit_prior

In [11]:
class lr_hyperparameter:
    def __init__(self, class_, c, max_iter):
        self.class_ = class_
        self.c = c
        self.max_iter = max_iter

#### Helper Functions

In [12]:
def compute_accuracy(predictions, actual):
    accuracy = np.sum (predictions == actual) / len (predictions) * 100
    return accuracy

In [13]:
def to_submission_csv(predictions, filename):
    sample_submission = pd.read_csv('data/sample_submission.csv')
    sample_submission ['id'] = test ['id'] 
    counter = 0

    for i in range (6):
        sample_submission[classes [i]] = predictions[:, i : i + 1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False) 

In [14]:
def to_submission_csv_multiclass (predictions, filename):
    sample_submission = pd.read_csv('data/sample_submission.csv')
    sample_submission ['id'] = test ['id'] 
    counter = 0

    for i in range (6):
        temp = list(zip(*predictions[i]))
        sample_submission[classes [i]] = temp[1]

    sample_submission.to_csv(f'results/' + filename + '.csv', index = False) 

In [21]:
def train_models(model):
    predictions_count = np.zeros((len(test), len(classes)))
    predictions_tfidf = np.zeros((len(test), len(classes)))

    for i in range(6):
        print('Fitting', classes[i] + '...')

        mdl = model
        mdl.fit(count_train, y_train[classes[i]])
        print('Count Vectors:', compute_accuracy(mdl.predict(count_train), y_train[classes[i]]))
        predictions_count[:,i] = mdl.predict_proba(count_test)[:,1]

        mdl = model
        mdl.fit(tfidf_train, y_train[classes[i]])
        print('TF-IDF Vectors:', compute_accuracy(mdl.predict(tfidf_train), y_train[classes[i]]))
        predictions_tfidf[:,i] = mdl.predict_proba(tfidf_test)[:,1]
    
    return predictions_count, predictions_tfidf

In [22]:
def tune_and_train_models(model):
    predictions_count_tuned = np.zeros((len(test), len(classes)))
    predictions_tfidf_tuned = np.zeros((len(test), len(classes)))

    for i in range(6):
        print('Fitting', classes[i] + '...')

        mdl_tuned = model
        mdl_tuned.fit(count_train, y_train[classes[i]])
        print('Count Vectors:', compute_accuracy(mdl_tuned.predict(count_train), y_train[classes[i]]), mdl_tuned.best_params_)
        predictions_count_tuned[:,i] = mdl_tuned.predict_proba(count_test)[:,1]

        mdl_tuned = model
        mdl_tuned.fit(tfidf_train, y_train[classes[i]])
        print('TF-IDF Vectors:', compute_accuracy(mdl_tuned.predict(tfidf_train), y_train[classes[i]]), mdl_tuned.best_params_)
        predictions_tfidf_tuned[:,i] = mdl_tuned.predict_proba(tfidf_test)[:,1]
    
    return predictions_count_tuned, predictions_tfidf_tuned

### TF-IDF Vectorizer

In [10]:
tfidf_vectorizer = TfidfVectorizer()

In [14]:
X_train

0         explanation why the edits made under my userna...
1         d aww he matches this background colour i am s...
2         hey man i am really not trying to edit war it ...
3         more i can not make any real suggestions on im...
4         you sir are my hero any chance you remember wh...
                                ...                        
159566    and for the second time of asking when your vi...
159567    you should be ashamed of yourself that is a ho...
159568    spitzer umm theres no actual article for prost...
159569    and it looks like it was actually you who put ...
159570    and i really do not think you understand i cam...
Name: comment_text, Length: 159571, dtype: object

In [13]:
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [None]:
tfidf_test = tfidf_vectorizer.transform(X_test)

In [18]:
tfidf_vectorizer_5000 = TfidfVectorizer(max_features = 5000)

In [None]:
tfidf_train_5000 = tfidf_vectorizer_5000.fit_transform(X_train)

In [None]:
tfidf_test_5000 = tfidf_vectorizer_5000.transform(X_test)

In [21]:
count_vectorizer = CountVectorizer()

In [22]:
count_train = count_vectorizer.fit_transform(X_train)

In [23]:
count_test = count_vectorizer.transform(X_test)

In [24]:
count_vectorizer_5000 = CountVectorizer(max_features = 5000)

In [25]:
count_train_5000 = count_vectorizer_5000.fit_transform(X_train)

In [26]:
count_tes_5000t = count_vectorizer_5000.transform(X_test)

In [27]:
parameters_mn_multi = [
    {
        'classifier': [MultinomialNB()],
        'classifier__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
        'classifier__fit_prior': [True, False]
    }
]

In [28]:
parameters_lr_multi = [
    {
        'classifier': [LogisticRegression()],
        'classifier__C': [1, 12, 15],
        'classifier__max_iter': [600, 1800, 3000]
    }
]

In [29]:
parameters_mn_mo = [
    {
        'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
        'estimator__fit_prior': [True, False]
    }
]

In [30]:
parameters_lr_mo = [
    {
        'estimator__C': [1, 12, 15],
        'estimator__max_iter': [600, 1800, 3000],
        'estimator__class_weight' : ['balanced', None]
    }
]

In [31]:
parameters_mnb = [
    {
        'alpha' : [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100, 1000],
        'fit_prior' : [False, True]
    }
]

In [32]:
parameters_lr = [
    {
        'C' : [1, 12, 15],
        'max_iter' :[600, 1800, 3000, 4200]
    }
]

In [33]:
parameters_rf = [
    {
        'n_estimators' : [500, 1000, 1500],
        'min_samples_split' : [2, 10, 20],
        'max_leaf_nodes' : [15, 20, 25],
        'min_samples_leaf' : [1, 5, 10],
    }
]

### Logistic Regression

In [25]:
lr = LogisticRegression()
predictions_lr_count, predictions_lr_tfidf = train_models(lr)

Fitting toxic...
Count Vectors: 97.1943523572579
TF-IDF Vectors: 96.23553151888501
Fitting severe_toxic...
Count Vectors: 99.19596919239711
TF-IDF Vectors: 99.12578100030707
Fitting obscene...
Count Vectors: 98.13750618846782
TF-IDF Vectors: 97.95514222509102
Fitting threat...
Count Vectors: 99.79632890688158
TF-IDF Vectors: 99.73366087822976
Fitting insult...
Count Vectors: 97.40241021238195
TF-IDF Vectors: 97.39551672923025
Fitting identity_hate...
Count Vectors: 99.27681094935797
TF-IDF Vectors: 99.24109017302642


In [26]:
to_submission_csv(predictions_lr_count, 'submission_lr_count')
to_submission_csv(predictions_lr_tfidf, 'submission_lr_tfidf')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_lr_count</td>
    <td class="tg-baqh">0.93926</td>
    <td class="tg-baqh">0.94248</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_lr_tfidf</td>
    <td class="tg-baqh">0.97391</td>
    <td class="tg-baqh">0.97376</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [None]:
lr_tuned = GridSearchCV(LogisticRegression(n_jobs=-1), parameters_lr, scoring='f1', verbose=2)
predictions_lr_count_tuned, predictions_lr_tfidf_tuned = tune_and_train_models(lr_tuned)

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_lr_count_tuned</td>
    <td class="tg-baqh">N/A</td>
    <td class="tg-baqh">N/A</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_lr_tfidf_tuned</td>
    <td class="tg-baqh">N/A</td>
    <td class="tg-baqh">N/A</td>
  </tr>
  <tr>
</tbody>
</table>

### Naive Bayes: Multinomial Naive Bayes

In [27]:
mnb = MultinomialNB()
predictions_mnb_count, predictions_mnb_tfidf = train_models(mnb)

Fitting toxic...
Count Vectors: 95.13696097661855
TF-IDF Vectors: 92.36828747078103
Fitting severe_toxic...
Count Vectors: 98.641983819115
TF-IDF Vectors: 98.99104473870565
Fitting obscene...
Count Vectors: 96.70867513520626
TF-IDF Vectors: 95.38449968979326
Fitting threat...
Count Vectors: 99.55505699657206
TF-IDF Vectors: 99.6973134216117
Fitting insult...
Count Vectors: 96.46301646289113
TF-IDF Vectors: 95.35629907689994
Fitting identity_hate...
Count Vectors: 98.77233331871079
TF-IDF Vectors: 99.11074067343063


In [28]:
to_submission_csv(predictions_mnb_count, 'submission_mnb_count')
to_submission_csv(predictions_mnb_tfidf, 'submission_mnb_tfidf')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mnb_count</td>
    <td class="tg-baqh">0.84551</td>
    <td class="tg-baqh">0.85581</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mnb_tfidf</td>
    <td class="tg-baqh">0.82510</td>
    <td class="tg-baqh">0.83586</td>
  </tr>
  <tr>
</tbody>
</table>

#### Hyperparameter Tuning

In [29]:
parameters_mnb = [{
    'alpha' : [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100, 1000],
    'fit_prior' : [False, True]
}]

In [30]:
mnb_tuned = GridSearchCV(MultinomialNB(), parameters_mnb, scoring='f1')
predictions_mnb_count_tuned, predictions_mnb_tfidf_tuned = tune_and_train_models(mnb_tuned)

Fitting toxic...
Count Vectors: 95.13696097661855 {'alpha': 1, 'fit_prior': True}
TF-IDF Vectors: 97.48262528905627 {'alpha': 0.001, 'fit_prior': True}
Fitting severe_toxic...
Count Vectors: 98.72031885492977 {'alpha': 0.001, 'fit_prior': True}
TF-IDF Vectors: 99.42157409554368 {'alpha': 0.001, 'fit_prior': True}
Fitting obscene...
Count Vectors: 96.70867513520626 {'alpha': 1, 'fit_prior': True}
TF-IDF Vectors: 98.62067668937338 {'alpha': 0.001, 'fit_prior': True}
Fitting threat...
Count Vectors: 99.31315840597603 {'alpha': 0.0001, 'fit_prior': True}
TF-IDF Vectors: 99.87403726240983 {'alpha': 1e-05, 'fit_prior': True}
Fitting insult...
Count Vectors: 96.46301646289113 {'alpha': 1, 'fit_prior': True}
TF-IDF Vectors: 98.40384531023808 {'alpha': 0.001, 'fit_prior': True}
Fitting identity_hate...
Count Vectors: 98.69337160260949 {'alpha': 0.0001, 'fit_prior': True}
TF-IDF Vectors: 99.56759060230243 {'alpha': 0.001, 'fit_prior': True}


In [31]:
to_submission_csv(predictions_mnb_count_tuned, 'submission_mnb_count_tuned')
to_submission_csv(predictions_mnb_tfidf_tuned, 'submission_mnb_tfidf_tuned')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_mnb_count_tuned</td>
    <td class="tg-baqh">0.88069</td>
    <td class="tg-baqh">0.88388</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_mnb_tfidf_tuned</td>
    <td class="tg-baqh">0.91455</td>
    <td class="tg-baqh">0.91739</td>
  </tr>
  <tr>
</tbody>
</table>

### Ensemble Models: RandomForestClassifier

In [32]:
rf = RandomForestClassifier(n_jobs=-1)
predictions_rf_count, predictions_rf_tfidf = train_models(rf)

Fitting toxic...
Count Vectors: 99.97869287025839
TF-IDF Vectors: 99.97743950968534
Fitting severe_toxic...
Count Vectors: 99.98558635341008
TF-IDF Vectors: 99.97493278853928
Fitting obscene...
Count Vectors: 99.98370631255052
TF-IDF Vectors: 99.9793195505449
Fitting threat...
Count Vectors: 99.99435987742133
TF-IDF Vectors: 99.99373319713482
Fitting insult...
Count Vectors: 99.96991934624712
TF-IDF Vectors: 99.9655325842415
Fitting identity_hate...
Count Vectors: 99.9931065168483
TF-IDF Vectors: 99.98934643512919


In [33]:
to_submission_csv(predictions_rf_count, 'submission_rf_count')
to_submission_csv(predictions_rf_tfidf, 'submission_rf_tfidf')

In [None]:
pd.DataFrame(
    data={'private': [0.96735, 0.96784], 'public': [0.96725, 0.96710]}, 
    index=['submission_rf_count.csv', 'submission_rf_tfidf.csv']
)

#### Hyperparameter Tuning

In [36]:
parameters_rf = {
    'n_estimators' : [100, 200, 300, 400, 500],
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [5, 10, 20, 30],
    'min_samples_split' : [2, 4, 6, 10, 15, 20],
    'max_leaf_nodes' : [3, 5, 10, 20, 50, 100],
}

In [39]:
rf_tuned = RandomizedSearchCV(RandomForestClassifier(n_jobs=-1), parameters_rf, scoring='f1', random_state=8, verbose=1)
predictions_rf_count_tuned, predictions_rf_tfidf_tuned = tune_and_train_models(rf_tuned)

Fitting toxic...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Count Vectors: 90.41680505856327 {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
Fitting 5 folds for each of 10 candidates, totalling 50 fits
TF-IDF Vectors: 90.4180584191363 {'n_estimators': 500, 'min_samples_split': 2, 'max_leaf_nodes': 50, 'max_depth': 30, 'criterion': 'entropy'}
Fitting severe_toxic...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Count Vectors: 99.00044494300343 {'n_estimators': 200, 'min_samples_split': 2, 'max_leaf_nodes': 20, 'max_depth': 20, 'criterion': 'gini'}
Fitting 5 folds for each of 10 candidates, totalling 50 fits
TF-IDF Vectors: 99.00044494300343 {'n_estimators': 200, 'min_samples_split': 2, 'max_leaf_nodes': 20, 'max_depth': 20, 'criterion': 'gini'}
Fitting obscene...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Count Vectors: 94.70580493949402 {'n_estimators': 500, 'min_samples_spli

In [38]:
to_submission_csv(predictions_rf_count_tuned, 'submission_rf_count_tuned')
to_submission_csv(predictions_rf_tfidf_tuned, 'submission_rf_tfidf_tuned')

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-1wig{font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-1wig"></th>
    <th class="tg-1wig">private</th>
    <th class="tg-1wig">public</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-baqh">submission_rf_count_tuned</td>
    <td class="tg-baqh">0.96232</td>
    <td class="tg-baqh">0.96285</td>
  </tr>
  <tr>
    <td class="tg-baqh">submission_rf_tfidf_tuned</td>
    <td class="tg-baqh">0.95937</td>
    <td class="tg-baqh">0.95826</td>
  </tr>
  <tr>
</tbody>
</table>

### Ensemble Models: GradientBoostingClassifier

In [None]:
gbc = GradientBoostingClassifier()
predictions_gbc_count, predictions_gbc_tfidf = train_models(gbc)

In [None]:
to_submission_csv(predictions_gbc_count, 'submission_gbc_count')
to_submission_csv(predictions_gbc_tfidf, 'submission_gbc_tfidf')

In [None]:
pd.DataFrame(
    data={'private': [0.90663, 0.92569], 'public': [0.92024, 0.93239]}, 
    index=['submission_gbc_count.csv', 'submission_gbc_tfidf.csv']
)

#### Hyperparameter Tuning

In [None]:
parameters_gbc = [{
    'n_estimators' : [50, 100, 250],
    'learning_rate' : [0.0001, 0.001, 0.01, 0.1, 1, 1.2],
}]

In [None]:
gbc_tuned = RandomizedSearchCV(GradientBoostingClassifier(), parameters_gbc, scoring='accuracy', random_state=8, verbose=2)
predictions_gbc_count_tuned, predictions_gbc_tfidf_tuned = tune_and_train_models(gbc_tuned)

In [None]:
to_submission_csv(predictions_gbc_count_tuned, 'submission_gbc_count_tuned')
to_submission_csv(predictions_gbc_tfidf_tuned, 'submission_gbc_tfidf_tuned')

### Ensemble Models: XGBClassifier

In [None]:
xgb = xgboost.XGBClassifier(eval_metric='logloss', verbosity=0, use_label_encoder=False)
predictions_xgb_count, predictions_xgb_tfidf = train_models(xgb)

In [None]:
to_submission_csv(predictions_xgb_count, 'submission_xgb_count')
to_submission_csv(predictions_xgb_tfidf, 'submission_xgb_tfidf')

In [None]:
parameters_xgb = [{
    'n_estimators' : [50, 100, 250],
    'learning_rate' : [0.0001, 0.001, 0.01, 0.1, 1, 1.2],
}]

In [None]:
xgb_tuned = GridSearchCV(xgboost.XGBClassifier(eval_metric='logloss', verbosity=0, use_label_encoder=False), parameters_xgb, scoring='f1')
predictions_xgb_count_tuned, predictions_xgb_tfidf_tuned = tune_and_train_models(xgb_tuned)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

### Ensemble Models: AdaBoostClassifier

In [None]:
adb = AdaBoostClassifier()
predictions_adb_count, predictions_adb_tfidf = train_models(adb)

In [None]:
to_submission_csv(predictions_adb_count, 'submission_adb_count')
to_submission_csv(predictions_adb_tfidf, 'submission_adb_tfidf')

In [None]:
pd.DataFrame(
    data={'private': [0.93539, 0.93830], 'public': [0.94218, 0.94145]}, 
    index=['submission_adb_count.csv', 'submission_adb_tfidf.csv']
)

In [None]:
parameters_adb = {
    'n_estimators' : [10, 25, 50, 100, 250],
    'learning_rate' : [0.0001, 0.001, 0.01, 0.1, 1, 1.2]
}

In [None]:
adb_tuned = GridSearchCV(AdaBoostClassifier(), parameters_adb, scoring='accuracy')
predictions_adb_count_tuned, predictions_adb_tfidf_tuned = tune_and_train_models(adb_tuned)

In [None]:
to_submission_csv(predictions_xgb_count_tuned, 'submission_xgb_count_tuned')
to_submission_csv(predictions_xgb_tfidf_tuned, 'submission_xgb_tfidf_tuned')

### Support Vector Machines: SGDClassifier

In [None]:
sgd = SGDClassifier(loss='log', n_jobs=-1)
predictions_sgd_count, predictions_sgd_tfidf = train_models(sgd)

In [None]:
to_submission_csv(predictions_sgd_count, 'submission_sgd_count')
to_submission_csv(predictions_sgd_tfidf, 'submission_sgd_tfidf')

In [None]:
pd.DataFrame(
    data={'private': [0.92026, 0.95112], 'public': [0.92912, 0.95232]}, 
    index=['submission_sgd_count.csv', 'submission_sgd_tfidf.csv']
)

#### Hyperparameter Tuning

In [36]:
parameters_sgd = [{
    'loss' : ['log', 'modified_huber'],
    'alpha' : [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
}]

In [37]:
sgd_tuned = GridSearchCV(SGDClassifier(n_jobs=-1), parameters_sgd, scoring='accuracy')
predictions_sgd_count_tuned, predictions_sgd_tfidf_tuned = tune_and_train_models(sgd_tuned)

Fitting toxic...
Count Vectors: 97.49139881306755 {'alpha': 1e-05, 'loss': 'log'}
TF-IDF Vectors: 98.16570680136115 {'alpha': 1e-05, 'loss': 'modified_huber'}
Fitting severe_toxic...
Count Vectors: 99.08003333939124 {'alpha': 0.01, 'loss': 'log'}
TF-IDF Vectors: 99.10760727199805 {'alpha': 1e-05, 'loss': 'log'}
Fitting obscene...
Count Vectors: 97.64681552412405 {'alpha': 1e-05, 'loss': 'log'}


KeyboardInterrupt: 

In [None]:
to_submission_csv(predictions_rf_count_tuned, 'submission_rf_count_tuned')
to_submission_csv(predictions_rf_tfidf_tuned, 'submission_rf_tfidf_tuned')

### Ensemble Models: GradientBoostingClassifier

In [53]:
predictions_gbc_count = np.zeros((len(test), len(classes)))
predictions_gbc_tfidf = np.zeros((len(test), len(classes)))

for i in range(6):
    print('Fitting', classes[i] + '...')
    
    gbc = GradientBoostingClassifier(verbose=2)
    
    gbc.fit(count_train, y_train[classes[i]])
    print('Count Vectors:', compute_accuracy(gbc.predict(count_train), y_train[classes[i]]))
    predictions_gbc_count[:,i] = gbc.predict_proba(count_test)[:,1]
    
    gbc.fit(tfidf_train, y_train[classes[i]])
    print('TF-IDF Vectors:', compute_accuracy(gbc.predict(tfidf_train), y_train[classes[i]]))
    predictions_gbc_tfidf[:,i] = gbc.predict_proba(tfidf_test)[:,1]

Fitting toxic...
      Iter       Train Loss   Remaining Time 
         1           0.5865            2.41m
         2           0.5667            2.30m
         3           0.5524            2.20m
         4           0.5425            2.14m
         5           0.5329            2.10m
         6           0.5258            2.10m
         7           0.5180            2.10m
         8           0.5124            2.10m
         9           0.5050            2.10m
        10           0.5007            2.09m
        11           0.4942            2.06m
        12           0.4897            2.03m
        13           0.4840            2.02m
        14           0.4796            2.00m
        15           0.4761            1.99m
        16           0.4699            1.97m
        17           0.4667            1.93m
        18           0.4623            1.90m
        19           0.4596            1.87m
        20           0.4573            1.85m
        21           0.4547          

        80           0.3567            1.06m
        81           0.3558            1.01m
        82           0.3552           57.21s
        83           0.3539           54.05s
        84           0.3532           50.87s
        85           0.3525           47.67s
        86           0.3515           44.48s
        87           0.3508           41.30s
        88           0.3501           38.12s
        89           0.3496           34.93s
        90           0.3489           31.75s
        91           0.3483           28.57s
        92           0.3477           25.39s
        93           0.3469           22.21s
        94           0.3458           19.03s
        95           0.3450           15.86s
        96           0.3444           12.68s
        97           0.3438            9.51s
        98           0.3429            6.34s
        99           0.3424            3.17s
       100           0.3417            0.00s
TF-IDF Vectors: 94.43382569514512
Fitting severe_toxic.

        58           0.0540            2.30m
        59           0.0538            2.25m
        60           0.0536            2.19m
        61           0.0535            2.14m
        62           0.0532            2.08m
        63           0.0529            2.03m
        64           0.0527            1.97m
        65           0.0527            1.93m
        66           0.0525            1.87m
        67           0.0523            1.82m
        68           0.0522            1.76m
        69           0.0519            1.71m
        70           0.0519            1.65m
        71           0.0517            1.60m
        72           0.0517            1.54m
        73           0.0515            1.49m
        74           0.0514            1.43m
        75           0.0513            1.38m
        76           0.0510            1.32m
        77           0.0508            1.27m
        78           0.0506            1.21m
        79           0.0505            1.16m
        80

        37           0.1965            3.28m
        38           0.1957            3.23m
        39           0.1948            3.17m
        40           0.1937            3.12m
        41           0.1926            3.07m
        42           0.1908            3.02m
        43           0.1896            2.97m
        44           0.1887            2.91m
        45           0.1877            2.86m
        46           0.1869            2.81m
        47           0.1862            2.76m
        48           0.1853            2.71m
        49           0.1845            2.65m
        50           0.1836            2.60m
        51           0.1816            2.55m
        52           0.1810            2.50m
        53           0.1802            2.45m
        54           0.1796            2.40m
        55           0.1788            2.35m
        56           0.1778            2.29m
        57           0.1772            2.24m
        58           0.1757            2.19m
        59

        60 325854827083698219568101057963223825814381632814998035103744.0000           50.14s
        61 325854827083698219568101057963223825814381632814998035103744.0000           48.80s
        62 325854827083698219568101057963223825814381632814998035103744.0000           47.47s
        63 325854827083698219568101057963223825814381632814998035103744.0000           46.13s
        64 325854827083698219568101057963223825814381632814998035103744.0000           44.81s
        65 325854827083698219568101057963223825814381632814998035103744.0000           43.50s
        66 325854827083698219568101057963223825814381632814998035103744.0000           42.21s
        67 325854827083698219568101057963223825814381632814998035103744.0000           40.91s
        68 325854827083698219568101057963223825814381632814998035103744.0000           39.61s
        69 325854827083698219568101057963223825814381632814998035103744.0000           38.32s
        70 3258548270836982195681010579632238258143816328149

        96    35116343.4071           12.23s
        97    35116343.4071            9.17s
        98    35116343.4071            6.11s
        99    35116343.4071            3.06s
       100    35116343.4071            0.00s
TF-IDF Vectors: 99.78066189971862
Fitting insult...
      Iter       Train Loss   Remaining Time 
         1           0.3446            2.23m
         2           0.3304            2.12m
         3           0.3197            2.06m
         4           0.3116            2.01m
         5           0.3040            2.00m
         6           0.2994            1.97m
         7           0.2953            1.95m
         8           0.2903            1.92m
         9           0.2856            1.91m
        10           0.2830            1.89m
        11           0.2796            1.86m
        12           0.2771            1.84m
        13           0.2739            1.82m
        14           0.2715            1.79m
        15           0.2697            1.77m
  

        75           0.1994            1.28m
        76           0.1989            1.23m
        77           0.1979            1.18m
        78           0.1974            1.13m
        79           0.1970            1.07m
        80           0.1967            1.02m
        81           0.1964           58.33s
        82           0.1960           55.27s
        83           0.1957           52.19s
        84           0.1949           49.12s
        85           0.1945           46.05s
        86           0.1941           42.98s
        87           0.1937           39.90s
        88           0.1926           36.83s
        89           0.1922           33.76s
        90           0.1918           30.69s
        91           0.1915           27.62s
        92           0.1910           24.55s
        93           0.1901           21.48s
        94           0.1898           18.41s
        95           0.1894           15.35s
        96           0.1891           12.28s
        97

        53           0.0492            2.38m
        54           0.0491            2.33m
        55           0.0490            2.28m
        56           0.0486            2.23m
        57           0.0484            2.18m
        58           0.0481            2.13m
        59           0.0479            2.08m
        60           0.0474            2.03m
        61           0.0473            1.98m
        62           0.0473            1.92m
        63           0.0470            1.87m
        64           0.0465            1.82m
        65           0.0462            1.77m
        66           0.0460            1.72m
        67           0.0458            1.67m
        68           0.0456            1.62m
        69           0.0454            1.57m
        70           0.0452            1.51m
        71           0.0449            1.46m
        72           0.0447            1.41m
        73           0.0445            1.36m
        74           0.0444            1.31m
        75

In [54]:
to_submission_csv(predictions_sgd_count_tuned, 'submission_sgd_count_tuned')
to_submission_csv(predictions_sgd_tfidf_tuned, 'submission_sgd_tfidf_tuned')

In [None]:
to_submission_csv(predictions_adb_count, 'submission_adb_count')
to_submission_csv(predictions_adb_tfidf, 'submission_adb_tfidf')

In [77]:
pd.DataFrame(
    data={'private': [0.79902, 0.95247], 'public': [0.81682, 0.95784]}, 
    index=['predictions_sgd_count_tuned.csv', 'predictions_sgd_tfidf_tuned.csv']
)

Unnamed: 0,private,public
submission_adb_count.csv,0.93539,0.94218
submission_adb_tfidf.csv,0.9383,0.94145


### OneVsRest Classifier: Logistic Regression

#### Model Training

In [27]:
lr_oc_count = OneVsRestClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_oc_count.fit(count_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                 max_iter=3000))

In [28]:
lr_oc_tf = OneVsRestClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_oc_tf.fit(tfidf_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                 max_iter=3000))

In [29]:
predictions = lr_oc_tf.predict(tfidf_train)
print('TF-IDF Vectors: ' , compute_accuracy(predictions, y_train))

predictions = lr_oc_count.predict(count_train)
print('Count Vectors: ', compute_accuracy(predictions, y_train))

TF-IDF Vectors:  toxic            95.739201
severe_toxic     97.941355
obscene          98.078598
threat           99.375200
insult           96.812077
identity_hate    98.102412
dtype: float64
Count Vectors:  toxic            97.955142
severe_toxic     98.672064
obscene          98.804294
threat           99.741808
insult           97.874927
identity_hate    98.947177
dtype: float64


In [30]:
predictions_lr_oc_tf = lr_oc_tf.predict_proba(tfidf_test)
predictions_lr_oc_count = lr_oc_count.predict_proba(count_test)

In [35]:
to_submission_csv(predictions_lr_oc_tf, 'submission_oc_lr_tf')
to_submission_csv(predictions_lr_oc_count, 'submission_oc_lr_count')

In [36]:
pd.DataFrame(
    data={'private': [0.94036, 0.97558], 'public': [0.94400, 0.97621]}, 
    index=['submission_oc_lr_count.csv', 'submission_oc_lr_tf.csv']
)

Unnamed: 0,private,public
submission_oc_lr_count.csv,0.94036,0.944
submission_oc_lr_tf.csv,0.97558,0.97621


#### Hyperparameter Tuning

In [57]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [58]:
estimator = OneVsRestClassifier(LogisticRegression ())

In [59]:
lr_oc_tf_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_oc_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='accuracy', verbose=10)

In [60]:
predictions = lr_oc_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', compute_accuracy(predictions, y_train), lr_oc_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            98.477167
severe_toxic     99.529363
obscene          99.205369
threat           99.895344
insult           98.869469
identity_hate    99.662219
dtype: float64 {'estimator__C': 12, 'estimator__class_weight': None, 'estimator__max_iter': 600}


In [61]:
predictions_lr_tfidf_tuned = lr_oc_tf_tuned.predict_proba(tfidf_test)
to_submission_csv(predictions_lr_tfidf_tuned, 'submission_oc_lr_tf_tuned')

In [None]:
lr_oc_count_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_oc_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [None]:
predictions = lr_oc_count_tuned.predict(count_train)
print('Count Vectors: ', compute_accuracy(predictions, y_train), lr_oc_count_tuned.best_params_)

In [None]:
predictions_lr_count_tuned = lr_oc_count_tuned.predict_proba(count_test)
to_submission_csv(predictions_lr_count_tuned, 'submission_oc_lr_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0.93996, 0.97558], 'public': [0.94410, 0.97621]}, 
    index=['submission_oc_lr_count_tuned.csv', 'submission_oc_lr_tf_tuned.csv']
)

### OneVsRest Classifier: Multinomial Naive Bayes

#### Model Training

In [34]:
mn_oc_count = OneVsRestClassifier(MultinomialNB())
mn_oc_count.fit(count_train, y_train)

OneVsRestClassifier(estimator=MultinomialNB())

In [35]:
mn_oc_tf = OneVsRestClassifier(MultinomialNB())
mn_oc_tf.fit(tfidf_train, y_train)

OneVsRestClassifier(estimator=MultinomialNB())

In [36]:
predictions = mn_oc_tf.predict(tfidf_train)
print('TF-IDF Vectors: \n' , compute_accuracy(predictions, y_train))

predictions = mn_oc_count.predict(count_train)
print('Count Vectors: \n', compute_accuracy(predictions, y_train))

TF-IDF Vectors:  toxic            92.368287
severe_toxic     98.991045
obscene          95.384500
threat           99.697313
insult           95.356299
identity_hate    99.110741
dtype: float64
Count Vectors:  toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64


In [37]:
predictions_mn_oc_tf = mn_oc_tf.predict_proba(tfidf_test)
predictions_mn_oc_count = mn_oc_count.predict_proba(count_test)

In [38]:
to_submission_csv(predictions_mn_oc_tf, 'submission_oc_mn_tf')
to_submission_csv(predictions_mn_oc_count, 'submission_oc_mn_count')

In [50]:
pd.DataFrame(
    data={'private': [0.84551, 0.82510], 'public': [0.85581, 0.83586]}, 
    index=['submission_oc_mn_count.csv', 'submission_oc_mn_tf.csv']
)

Unnamed: 0,private,public
submission_oc_mn_count.csv,0.84551,0.85581
submission_oc_mn_tf.csv,0.8251,0.83586


#### Hyperparameter Tuning

In [40]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [41]:
estimator = OneVsRestClassifier(MultinomialNB ())

In [42]:
mn_oc_tf_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'accuracy')
mn_oc_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='accuracy', verbose=10)

In [43]:
predictions = mn_oc_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', compute_accuracy(predictions, y_train), mn_oc_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            93.609114
severe_toxic     98.989791
obscene          96.043141
threat           99.691673
insult           95.876444
identity_hate    99.106354
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [44]:
predictions_mn_tfidf_tuned = mn_oc_tf_tuned.predict_proba(tfidf_test)
to_submission_csv(predictions_mn_tfidf_tuned, 'submission_oc_mn_tf_tuned')

In [45]:
mn_oc_count_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'accuracy')
mn_oc_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 10 candidates, totalling 50 fits


GridSearchCV(estimator=OneVsRestClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='accuracy', verbose=10)

In [46]:
predictions = mn_oc_count_tuned.predict(count_train)
print('Count Vectors: ', compute_accuracy(predictions, y_train), mn_oc_count_tuned.best_params_)

Count Vectors:  toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64 {'estimator__alpha': 1.0, 'estimator__fit_prior': True}


In [47]:
predictions_mn_count_tuned = mn_oc_count_tuned.predict_proba(count_test)
to_submission_csv(predictions_mn_count_tuned, 'submission_oc_mn_count_tuned')

In [49]:
pd.DataFrame(
    data={'private': [0.84551, 0.85045], 'public': [0.85581, 0.86105]}, 
    index=['submission_oc_mn_count_tuned.csv', 'submission_oc_mn_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_oc_mn_count_tuned.csv,0.84551,0.85581
submission_oc_mn_tf_tuned.csv,0.85045,0.86105


### MultiOutput Classifier: Logistic Regression

#### Model Training

In [142]:
X_train = train ['comment_text']
X_test = test ['comment_text']
y_train = train.loc [ : , 'toxic' : ]
y_train

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
159566,0,0,0,0,0,0
159567,0,0,0,0,0,0
159568,0,0,0,0,0,0
159569,0,0,0,0,0,0


In [143]:
lr_mo_count = MultiOutputClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_mo_count.fit(count_train, y_train)

MultiOutputClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                   max_iter=3000))

In [144]:
lr_mo_tf = MultiOutputClassifier(LogisticRegression(class_weight = 'balanced', max_iter = 3000))
lr_mo_tf.fit(tfidf_train, y_train)

MultiOutputClassifier(estimator=LogisticRegression(class_weight='balanced',
                                                   max_iter=3000))

In [146]:
predictions = lr_mo_tf.predict(tfidf_train)
print('TF-IDF Vectors: ' , compute_accuracy(predictions, y_train))

predictions = lr_mo_count.predict(count_train)
print('Count Vectors: ', compute_accuracy(predictions, y_train))

TF-IDF Vectors:  toxic            94.403118
severe_toxic     97.335982
obscene          97.324702
threat           99.067500
insult           95.850750
identity_hate    97.123537
dtype: float64
Count Vectors:  toxic            97.955142
severe_toxic     98.672064
obscene          98.804294
threat           99.741808
insult           97.874927
identity_hate    98.947177
dtype: float64


In [149]:
predictions_lr_mo_tf = lr_mo_tf.predict_proba(tfidf_test)
predictions_lr_mo_count = lr_mo_count.predict_proba(count_test)

In [150]:
to_submission_csv_multiclass(predictions_lr_mo_tf, 'submission_mo_lr_tf')
to_submission_csv_multiclass(predictions_lr_mo_count, 'submission_mo_lr_count')

In [154]:
pd.DataFrame(
    data={'private': [0.94036, 0.97063], 'public': [0.94400, 0.97183]}, 
    index=['submission_mo_lr_count.csv', 'submission_mo_lr_tf.csv']
)

Unnamed: 0,private,public
submission_mo_lr_count.csv,0.94036,0.944
submission_mo_lr_tf.csv,0.97063,0.97183


#### Hyperparameter Tuning

In [152]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [155]:
estimator = MultiOutputClassifier(LogisticRegression ())
lr_mo_tf_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_mo_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='f1', verbose=10)

In [156]:
predictions = lr_mo_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: ', compute_accuracy(predictions, y_train), lr_mo_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            94.403118
severe_toxic     97.335982
obscene          97.324702
threat           99.067500
insult           95.850750
identity_hate    97.123537
dtype: float64 {'estimator__C': 1, 'estimator__class_weight': 'balanced', 'estimator__max_iter': 600}


In [157]:
predictions_lr_tfidf_tuned = lr_mo_tf_tuned.predict_proba(tfidf_test)
to_submission_csv_multiclass(predictions_lr_tfidf_tuned, 'submission_mo_lr_tf_tuned')

In [158]:
lr_mo_count_tuned = GridSearchCV(estimator, parameters_lr_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_mo_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 18 candidates, totalling 90 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

GridSearchCV(estimator=MultiOutputClassifier(estimator=LogisticRegression()),
             n_jobs=-1,
             param_grid=[{'estimator__C': [1, 12, 15],
                          'estimator__class_weight': ['balanced', None],
                          'estimator__max_iter': [600, 1800, 3000]}],
             scoring='f1', verbose=10)

In [159]:
predictions = lr_mo_count_tuned.predict(count_train)
print('Count Vectors: ', compute_accuracy(predictions, y_train), lr_mo_count_tuned.best_params_)

Count Vectors:  toxic            97.936342
severe_toxic     98.640730
obscene          98.735359
threat           99.741808
insult           97.858633
identity_hate    98.945924
dtype: float64 {'estimator__C': 1, 'estimator__class_weight': 'balanced', 'estimator__max_iter': 600}


In [160]:
predictions_lr_count_tuned = lr_mo_count_tuned.predict_proba(count_test)
to_submission_csv_multiclass(predictions_lr_count_tuned, 'submission_mo_lr_count_tuned')

In [161]:
pd.DataFrame(
    data={'private': [0.93996, 0.97063], 'public': [0.94410, 0.97183]}, 
    index=['submission_mo_lr_count_tuned.csv', 'submission_mo_lr_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_mo_lr_count_tuned.csv,0.93996,0.9441
submission_mo_lr_tf_tuned.csv,0.97063,0.97183


### MultiOutput Classifier: Multinomial Naive Bayes

#### Model Training

In [51]:
mn_mo_count = MultiOutputClassifier(MultinomialNB())
mn_mo_count.fit(count_train, y_train)

MultiOutputClassifier(estimator=MultinomialNB())

In [52]:
mn_mo_tf = MultiOutputClassifier(MultinomialNB())
mn_mo_tf.fit(tfidf_train, y_train)

MultiOutputClassifier(estimator=MultinomialNB())

In [53]:
predictions = mn_mo_tf.predict(tfidf_train)
print('TF-IDF Vectors: \n' , compute_accuracy(predictions, y_train))

predictions = mn_mo_count.predict(count_train)
print('Count Vectors: \n', compute_accuracy(predictions, y_train))

TF-IDF Vectors: 
 toxic            92.368287
severe_toxic     98.991045
obscene          95.384500
threat           99.697313
insult           95.356299
identity_hate    99.110741
dtype: float64
Count Vectors: 
 toxic            95.136961
severe_toxic     98.641984
obscene          96.708675
threat           99.555057
insult           96.463016
identity_hate    98.772333
dtype: float64


In [54]:
predictions_mnb_mo_tf = mn_mo_tf.predict_proba(tfidf_test)
predictions_mnb_mo_count = mn_mo_count.predict_proba(count_test)

In [55]:
to_submission_csv_multiclass(predictions_mnb_mo_tf, 'submission_mo_mn_tf')
to_submission_csv_multiclass(predictions_mnb_mo_count, 'submission_mo_mn_count')

In [56]:
pd.DataFrame(
    data={'private': [0.84551, 0.82510], 'public': [0.85581, 0.83586]}, 
    index=['submission_mo_mn_count.csv', 'submission_mo_mn_tf.csv']
)

Unnamed: 0,private,public
submission_mo_mn_count.csv,0.84551,0.85581
submission_mo_mn_tf.csv,0.8251,0.83586


#### Hyperparameter Tuning

In [38]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [39]:
estimator = MultiOutputClassifier(MultinomialNB ())
mn_mo_tf_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_mo_tf_tuned.fit(tfidf_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='f1', verbose=10)

In [40]:
predictions = mn_mo_tf_tuned.predict(tfidf_train)
print('TF-IDF Vectors: \n', compute_accuracy(predictions, y_train), mn_mo_tf_tuned.best_params_)

TF-IDF Vectors:  toxic            93.609114
severe_toxic     98.989791
obscene          96.043141
threat           99.691673
insult           95.876444
identity_hate    99.106354
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [41]:
predictions_mn_tfidf_tuned = mn_mo_tf_tuned.predict_proba(tfidf_test)
to_submission_csv_multiclass(predictions_mn_tfidf_tuned, 'submission_mo_mn_tf_tuned')

In [43]:
mn_mo_count_tuned = GridSearchCV(estimator, parameters_mn_mo, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_mo_count_tuned.fit(count_train, y_train)   

Fitting 5 folds for each of 10 candidates, totalling 50 fits




GridSearchCV(estimator=MultiOutputClassifier(estimator=MultinomialNB()),
             n_jobs=-1,
             param_grid=[{'estimator__alpha': [0.5, 0.6, 0.7, 0.8, 1.0],
                          'estimator__fit_prior': [True, False]}],
             scoring='f1', verbose=10)

In [44]:
predictions = mn_mo_count_tuned.predict(count_train)
print('Count Vectors: \n', compute_accuracy(predictions, y_train), mn_mo_count_tuned.best_params_)

Count Vectors:  toxic            95.165162
severe_toxic     98.426406
obscene          96.575192
threat           99.472335
insult           96.317627
identity_hate    98.473407
dtype: float64 {'estimator__alpha': 0.5, 'estimator__fit_prior': True}


In [45]:
predictions_mn_count_tuned = mn_mo_count_tuned.predict_proba(count_test)
to_submission_csv_multiclass(predictions_mn_count_tuned, 'submission_mo_mn_count_tuned')

In [46]:
pd.DataFrame(
    data={'private': [0.87456, 0.85045], 'public': [0.88221, 0.86105]}, 
    index=['submission_mo_mn_count_tuned.csv', 'submission_mo_mn_tf_tuned.csv']
)

Unnamed: 0,private,public
submission_mo_mn_count_tuned.csv,0.87456,0.88221
submission_mo_mn_tf_tuned.csv,0.85045,0.86105


### Classifier Chain: Multinomial Naive Bayes

#### Model Training

In [37]:
mn_cc_tf = ClassifierChain(classifier = MultinomialNB(alpha = 1.0, fit_prior = True))
mn_cc_count = ClassifierChain(classifier = MultinomialNB(alpha = 1.0, fit_prior = True))

In [38]:
mn_cc_tf.fit(tfidf_train_5000, y_train)
mn_cc_count.fit(count_train_5000, y_train)

ClassifierChain(classifier=MultinomialNB(), require_dense=[True, True])

In [40]:
predictions_mn_cc_tf = mn_cc_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , compute_accuracy(predictions_mn_cc_tf.todense(), y_train))

predictions_mn_cc_count = mn_cc_count.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions_mn_cc_count.todense(), y_train))

TF-IDF Vectors: 
 toxic            95.029799
severe_toxic     98.715932
obscene          97.173672
threat           99.046819
insult           96.766330
identity_hate    95.881457
dtype: float64
Count Vectors: 
 toxic            94.139913
severe_toxic     97.492025
obscene          94.467666
threat           95.961672
insult           94.012070
identity_hate    93.246893
dtype: float64


In [41]:
predictions_mn_cc_tf = mn_cc_tf.predict_proba(tfidf_test_5000)
predictions_mn_cc_count = mn_cc_count.predict_proba(count_test_5000)

In [43]:
to_submission_csv(predictions_mn_cc_tf.todense(), 'submission_tfidf_mn_cc')
to_submission_csv(predictions_mn_cc_count.todense(), 'submission_count_mn_cc')

In [50]:
pd.DataFrame(
    data={'private': [0.92866, 0.94711], 'public': [0.92896, 0.94614]}, 
    index=['submission_count_mn_cc.csv', 'submission_tfidf_mn_cc.csv']
)

Unnamed: 0,private,public
submission_count_mn_cc.csv,0.92866,0.92896
submission_tfidf_mn_cc.csv,0.94711,0.94614


#### Hyperparameter Tuning

In [44]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = ClassifierChain(MultinomialNB ())
mn_cc_tf_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_cc_tf_tuned.fit(tfidf_train_5000, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
predictions = mn_cc_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', compute_accuracy(predictions.todense(), y_train), mn_cc_tf_tuned.best_params_)

In [None]:
predictions_mn_tfidf_tuned = mn_cc_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_mn_tfidf_tuned.todense(), 'submission_cc_mn_tf_tuned')

In [None]:
mn_cc_count_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_cc_count_tuned.fit(count_train_5000, y_train)

In [None]:
predictions = mn_cc_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions.todense(), y_train), mn_cc_count_tuned.best_params_)

In [None]:
predictions_mn_count_tuned = mn_cc_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_mn_count_tuned.todense(), 'submission_cc_mn_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_cc_mn_count_tuned.csv', 'submission_cc_mn_tf_tuned.csv']
)

### Binary Relevance: Logistic Regression

#### Model Training

In [57]:
br_lr_tf = BinaryRelevance(classifier = LogisticRegression())
br_lr_count = BinaryRelevance(classifier = LogisticRegression())

In [None]:
br_lr_tf.fit(tfidf_train_5000, y_train)
br_lr_count.fit(count_train_5000, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
predictions_br_lr_tf = br_lr_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , compute_accuracy(predictions_br_lr_tf.todense(), y_train))

predictions_br_lr_count = br_lr_count.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions_br_lr_count.todense(), y_train))

In [None]:
predictions_br_lr_tf = br_lr_tf.predict_proba(tfidf_test_5000)
predictions_br_lr_count = br_lr_count.predict_proba(count_test_5000)

In [None]:
to_submission_csv(predictions_br_lr_tf.todense(), 'submission_tfidf_lr_br')
to_submission_csv(predictions_br_lr_count.todense(), 'submission_count_lr_br')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_count_lr_br.csv', 'submission_tfidf_lr_br.csv']
)

#### Hyperparameter Tuning

In [None]:
predictions_lr_count_tuned = np.zeros((len(test), len(classes)))
predictions_lr_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = BinaryRelevance(LogisticRegression ())
lr_cc_tf_tuned = GridSearchCV(estimator, parameters_lr_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_cc_tf_tuned.fit(tfidf_train_5000, y_train)

In [None]:
predictions = lr_cc_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', compute_accuracy(predictions.todense(), y_train), lr_cc_tf_tuned.best_params_)

In [None]:
predictions_lr_tfidf_tuned = lr_cc_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_lr_tfidf_tuned.todense(), 'submission_lr_cc_tf_tuned')

In [None]:
lr_br_count_tuned = GridSearchCV(estimator, parameters_lr_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
lr_br_count_tuned.fit(count_train_5000, y_train)

In [None]:
predictions = lr_br_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions.todense(), y_train), lr_br_count_tuned.best_params_)

In [None]:
predictions_lr_count_tuned = lr_br_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_lr_count_tuned.todense(), 'submission_lr_cc_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_count_lr_br.csv', 'submission_tfidf_lr_br.csv']
)

#### Model Selection

### Binary Relevance: Multinomial Naive Bayes

#### Model Training

In [None]:
mn_br_count = MultiOutputClassifier(MultinomialNB())
mn_br_count.fit(tfidf_train_5000, y_train)

In [None]:
mn_br_tf = MultiOutputClassifier(MultinomialNB())
mn_br_tf.fit(count_train_5000, y_train)

In [None]:
predictions = mn_br_tf.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n' , compute_accuracy(predictions.todense(), y_train))

predictions = mn_br_count.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions.todense(), y_train))

In [None]:
predictions_mnb_br_tf = mn_br_tf.predict_proba(tfidf_test_5000)
predictions_mnb_br_count = mn_br_count.predict_proba(count_test_5000)

In [None]:
to_submission_csv(predictions_mnb_br_tf.todense(), 'submission_br_mn_tf')
to_submission_csv(predictions_mnb_br_count.todense(), 'submission_br_mn_count')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_br_mn_count.csv', 'submission_br_mn_tf.csv']
)

#### Hyperparameter Tuning

In [None]:
predictions_mn_count_tuned = np.zeros((len(test), len(classes)))
predictions_mn_tfidf_tuned = np.zeros((len(test), len(classes)))

In [None]:
estimator = ClassifierChain(MultinomialNB ())
mn_br_tf_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_br_tf_tuned.fit(tfidf_train_5000, y_train)

In [None]:
predictions = mn_br_tf_tuned.predict(tfidf_train_5000)
print('TF-IDF Vectors: \n', compute_accuracy(predictions.todense(), y_train), mn_br_tf_tuned.best_params_)

In [None]:
predictions_mn_tfidf_tuned = mn_br_tf_tuned.predict_proba(tfidf_test_5000)
to_submission_csv(predictions_mn_tfidf_tuned.todense(), 'submission_br_mn_tf_tuned')

In [None]:
mn_br_count_tuned = GridSearchCV(estimator, parameters_mn_multi, n_jobs = -1, verbose = 10, scoring = 'f1')
mn_br_count_tuned.fit(count_train_5000, y_train)   

In [None]:
predictions = mn_br_count_tuned.predict(count_train_5000)
print('Count Vectors: \n', compute_accuracy(predictions.todense(), y_train), mn_br_count_tuned.best_params_)

In [None]:
predictions_mn_count_tuned = mn_br_count_tuned.predict_proba(count_test_5000)
to_submission_csv(predictions_mn_count_tuned.todense(), 'submission_br_mn_count_tuned')

In [None]:
pd.DataFrame(
    data={'private': [0, 0], 'public': [0, 0]}, 
    index=['submission_br_mn_count_tuned.csv', 'submission_br_mn_tf_tuned.csv']
)

#### Model Selection

### Multinomial Naive Bayes using TF-IDF Vectorizer

#### Model Training

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

In [None]:
tfidf_test = tfidf_vectorizer.transform(X_test)

In [None]:
arr_model = []
counter = 0
for class_ in classes:
    y_train = train[class_]
    model = MultinomialNB ()
    model.fit(tfidf_train, y_train)
    predictions = model.predict(tfidf_train)
    arr_model.append(model)
    counter = counter + 1

In [None]:
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    predictions = arr_model [counter].predict(tf_idf_train)
    print(compute_accuracy(predictions, y_train))
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 

counter = 0
for class_ in classes:
    predictions = arr_model [counter].predict(tf_idf_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_tfidf_nb.csv', index = False) 

#### Hyperparameter Tuning

In [None]:
X = train ['comment_text']

In [None]:
final_hyperparameters = []
classes = train.columns [2:]
arr_model = []
counter = 0

for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    best_score = 0
    
    model = MultinomialNB ()
    
    X_train, X_val, y_train, y_val = train_test_split (X, y_train, test_size = 0.25, stratify = y_train)
    
    X_train_sparse_matrix = tf_idf_vectorizer.fit_transform(X_train)
    X_validation_sparse_matrix = tf_idf_vectorizer.transform(X_val)
    
    for g in ParameterGrid(parameters_mnb):

        model.set_params(**g)

        model.fit(X_train_sparse_matrix, y_train)
        predictions = model.predict (X_train_sparse_matrix)
        train_acc = compute_accuracy (predictions, y_train)

        predictions = model.predict (X_validation_sparse_matrix)
        val_acc = compute_accuracy (predictions, y_val)

        if val_acc > best_score:
            best_score = val_acc
            best_grid = g
    
    print("Best accuracy: ", best_score, "%")
    print("Best grid: ", best_grid)
    temp = mn_hyper_parameter (class_, best_grid['alpha'], best_grid['fit_prior'])
    final_hyperparameters.append(temp)

#### Model Selection

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
tf_idf_train = tf_idf_vectorizer.fit_transform(X_train)
tf_idf_test = tf_idf_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    
    y_train = train[class_]
    temp = final_hyperparameters [counter]
    model = MultinomialNB (alpha = temp.alpha, fit_prior = temp.fit_prior)

    model.fit(tf_idf_train, y_train)
    predictions = model.predict(tf_idf_train)
    print(compute_accuracy(predictions, y_train))
    
    arr_model.append(model)
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict(tf_idf_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_tf_idf_mn_tuned.csv', index = False) 

### Multinomial Naive Bayes using Count Vectorizer

#### Model Training

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
count_train = count_vectorizer.fit_transform(X_train)

In [None]:
count_test = count_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    y_train = train[class_]
    model = MultinomialNB ()
    model.fit(count_train, y_train)
    
    predictions = model.predict(count_train)
    arr_model.append(model)
    counter = counter + 1

In [None]:
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    predictions = arr_model [counter].predict(count_train)
    print(compute_accuracy(predictions, y_train))
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')

sample_submission ['id'] = test ['id'] 
counter = 0
for class_ in classes:
    predictions = arr_model [counter].predict(count_test)
    sample_submission [class_] = predictions
    counter = counter + 1
sample_submission.to_csv(f'results/submission_count_nb.csv', index = False) 

#### Hyperparameter Tuning

In [None]:
X = train ['comment_text']

In [None]:
final_hyperparameters = []
classes = train.columns [2:]
arr_model = []
counter = 0

for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    best_score = 0
    
    model = MultinomialNB ()
    
    X_train, X_val, y_train, y_val = train_test_split (X, y_train, test_size = 0.25, stratify = y_train)
    
    X_train_sparse_matrix = count_vectorizer.fit_transform(X_train)
    X_validation_sparse_matrix = count_vectorizer.transform(X_val)
    
    for g in ParameterGrid(parameters_mnb):

        model.set_params(**g)

        model.fit(X_train_sparse_matrix, y_train)
        predictions = model.predict (X_train_sparse_matrix)
        train_acc = compute_accuracy (predictions, y_train)

        predictions = model.predict (X_validation_sparse_matrix)
        val_acc = compute_accuracy (predictions, y_val)

        if val_acc > best_score:
            best_score = val_acc
            best_grid = g
    
    print("Best accuracy: ", best_score, "%")
    print("Best grid: ", best_grid)
    temp = mn_hyper_parameter (class_, best_grid['alpha'], best_grid['fit_prior'])
    final_hyperparameters.append(temp)

#### Model Selection

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    
    y_train = train[class_]
    temp = final_hyperparameters [counter]
    model = MultinomialNB (alpha = temp.alpha, fit_prior = temp.fit_prior)

    model.fit(count_train, y_train)
    predictions = model.predict(count_train)
    print(compute_accuracy(predictions, y_train))
    
    arr_model.append(model)
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict(count_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_count_mn_tuned.csv', index = False) 

### Logistic Regression using TF-IDF Vectorizer

#### Model Training

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
tf_idf_train = tf_idf_vectorizer.fit_transform(X_train)

In [None]:
tf_idf_test = tf_idf_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    model = LogisticRegression (n_jobs=-1)

    model.fit(tf_idf_train, y_train)
    predictions = model.predict(tf_idf_train)
    print(compute_accuracy(predictions, y_train))
    arr_model.append(model)
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict_proba(tf_idf_test)[:,1]
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_logreg_1.csv', index = False) 

#### Hyperparameter Tuning

In [None]:
X = train ['comment_text']

In [None]:
final_hyperparameters = []
classes = train.columns [2:]
arr_model = []
counter = 0

for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    best_score = 0
    
    model = LogisticRegression ()
    
    X_train, X_val, y_train, y_val = train_test_split (X, y_train, test_size = 0.25, stratify = y_train)
    
    X_train_sparse_matrix = tf_idf_vectorizer.fit_transform(X_train)
    X_validation_sparse_matrix = tf_idf_vectorizer.transform(X_val)
    
    for g in ParameterGrid(parameters_lr):

        model.set_params(**g)

        model.fit(X_train_sparse_matrix, y_train)
        predictions = model.predict (X_train_sparse_matrix)
        train_acc = compute_accuracy (predictions, y_train)

        predictions = model.predict (X_validation_sparse_matrix)
        val_acc = compute_accuracy (predictions, y_val)

        if val_acc > best_score:
            best_score = val_acc
            best_grid = g
    
    print("Best accuracy: ", best_score, "%")
    print("Best grid: ", best_grid)
    temp = lr_hyperparameter (class_, best_grid['C'], best_grid['max_iter'])
    final_hyperparameters.append(temp)

#### Model Selection

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
tf_idf_train = tf_idf_vectorizer.fit_transform(X_train)
tf_idf_test = tf_idf_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    temp = final_hyperparameters [counter]
    model = LogisticRegression (C = temp.c, max_iter = temp.max_iter)

    model.fit(tf_idf_train, y_train)
    predictions = model.predict(tf_idf_train)
    print(compute_accuracy(predictions, y_train))
    arr_model.append(model)
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict(tf_idf_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_tf_idf_log_reg_tuned.csv', index = False) 

### Logistic Regression using Count Vectorizer

#### Model Training

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
count_train = count_vectorizer.fit_transform(X_train)

In [None]:
count_test = count_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    model = LogisticRegression ()

    model.fit(count_train, y_train)
    predictions = model.predict(count_train)
    print(compute_accuracy(predictions, y_train))
    arr_model.append(model)
    counter = counter + 1

In [None]:
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    predictions = arr_model [counter].predict(count_train)
    print(compute_accuracy(predictions, y_train))
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict(count_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_count_log_reg.csv', index = False) 

#### Hyperparameter Tuning

In [None]:
X = train ['comment_text']

In [None]:
final_hyperparameters = []
classes = train.columns [2:]
arr_model = []
counter = 0

for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    best_score = 0
    
    model = LogisticRegression ()
    
    X_train, X_val, y_train, y_val = train_test_split (X, y_train, test_size = 0.25, stratify = y_train)
    
    X_train_sparse_matrix = count_vectorizer.fit_transform(X_train)
    X_validation_sparse_matrix = count_vectorizer.transform(X_val)
    
    for g in ParameterGrid(parameters_lr):

        model.set_params(**g)

        model.fit(X_train_sparse_matrix, y_train)
        predictions = model.predict (X_train_sparse_matrix)
        train_acc = compute_accuracy (predictions, y_train)

        predictions = model.predict (X_validation_sparse_matrix)
        val_acc = compute_accuracy (predictions, y_val)

        if val_acc > best_score:
            best_score = val_acc
            best_grid = g
    
    print("Best accuracy: ", best_score, "%")
    print("Best grid: ", best_grid)
    temp = lr_hyperparameter (class_, best_grid['C'], best_grid['max_iter'])
    final_hyperparameters.append(temp)

#### Model Selection

In [None]:
X_train = train ['comment_text']
X_test = test ['comment_text']

In [None]:
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

In [None]:
classes = train.columns [2:]
arr_model = []
counter = 0
for class_ in classes:
    print("Class: ", class_)
    y_train = train[class_]
    temp = final_hyperparameters [counter]
    model = LogisticRegression (C = temp.c, max_iter = temp.max_iter)

    model.fit(count_train, y_train)
    predictions = model.predict(count_train)
    print(compute_accuracy(predictions, y_train))
    arr_model.append(model)
    counter = counter + 1

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission ['id'] = test ['id'] 
counter = 0

for class_ in classes:
    predictions = arr_model [counter].predict(count_test)
    sample_submission [class_] = predictions
    counter = counter + 1
    
sample_submission.to_csv(f'results/submission_count_log_reg_tuned.csv', index = False) 