# Modeling - Classification Algorithms

## Table of Contents:
[1. Import Train and Test Set](#1.-Import-Train-and-Test-Set)
<br>[2. Classifiers](#2.-Classifiers)
<br>&emsp;&emsp;&emsp;[2.1.1. TF-IDF and Logistic Regression](#2.1.1.-TF-IDF-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.2. TF-IDF, Random Oversampler and Logistic Regression](#2.1.2.-TF-IDF,-Random-Oversampler-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.3. TF-IDF, SMOTE Oversampling and Logistic Regression](#2.1.3.-TF-IDF,-SMOTE-Oversampling-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.4. TF-IDF, Random Undersampler and Logistic Regression](#2.1.4.-TF-IDF,-Random-Undersampler-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.5. TF-IDF, NearMiss Undersampler and Logistic Regression](#2.1.5.-TF-IDF,-NearMiss-Undersampler-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.6. Countvectorizer and Logistic Regression](#2.1.6.-Countvectorizer-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.7. Countvectorizer, Random Oversampler and Logistic Regression](#2.1.7.-Countvectorizer,-Random-Oversampler-and-Logistic-Regression)
<br>&emsp;&emsp;&emsp;[2.1.8. Countvectorizer, SMOTE Oversampler and Logistic Regression](#2.1.8.-Countvectorizer,-SMOTE-Oversampler-and-Logistic-Regression)
<br>[2.2. Naive Bayes](#2.2.-Naive-Bayes)
<br>&emsp;&emsp;&emsp;[2.2.1. TFIDF and Naive Bayes](#2.2.1.-TFIDF-and-Naive-Bayes)
<br>&emsp;&emsp;&emsp;[2.2.2. TFIDF, Random Oversampler and Naive Bayes](#2.2.2.-TFIDF,-Random-Oversampler-and-Naive-Bayes)
<br>&emsp;&emsp;&emsp;[2.2.3. TFIDF, SMOTE and Naive Bayes](#2.2.3.-TFIDF,-SMOTE-and-Naive-Bayes)
<br>&emsp;&emsp;&emsp;[2.2.4. TFIDF, Random Undersampler and Naive Bayes](#2.2.4.-TFIDF,-Random-Undersampler-and-Naive-Bayes)
<br>&emsp;&emsp;&emsp;[2.2.5. TF-IDF, NearMiss Undersampler and Naive Bayes](#2.2.5.-TF-IDF,-NearMiss-Undersampler-and-Naive-Bayes)
<br>[2.3. Random Forests](#2.3.-Random-Forests)
<br>&emsp;&emsp;&emsp;[2.3.1. TF-IDF and Random Forests](#2.3.1.-TF-IDF-and-Random-Forests)
<br>&emsp;&emsp;&emsp;[2.3.2. TF-IDF, Random Oversampler and Random Forests](#2.3.2.-TF-IDF,-Random-Oversampler-and-Random-Forests)
<br>[3. Summary of Results](#3.-Summary-of-Results)

In [37]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
from sklearn.utils.fixes import loguniform
from itertools import cycle
from timeit import default_timer as timer

from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn import naive_bayes
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from xgboost import XGBClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.metrics import classification_report
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import RocCurveDisplay

from sklearn.ensemble import StackingClassifier

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import precision_recall_fscore_support

from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss
from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

from Bayes_Opt_Classification import bayes_kfold_pipeline

sns.set_context('talk', rc={"grid.linewidth": 0.8})

%matplotlib inline

  from pandas import MultiIndex, Int64Index


## 1. Import Train and Test Set

In [2]:
df = pd.read_csv('Library/cleaned_text_train_df.csv')
df.head()

Unnamed: 0,clean_text,toxic_type
0,explanation edit make username hardcore metall...,0
1,aww match background colour seemingly stuck th...,0
2,hey man really not try edit war guy constantly...,0
3,make real suggestion improvement wonder sectio...,0
4,sir hero chance remember page,0


In [3]:
df.isna().sum()

clean_text    54
toxic_type     0
dtype: int64

In [4]:
df.dropna(inplace=True)

In [5]:
df_test = pd.read_csv('Library/cleaned_text_test_df.csv')
df_test.head()

Unnamed: 0,clean_text,toxic_type
0,thank understand think highly would not revert...,0
1,dear god site horrible,0
2,somebody invariably try add religion really me...,0
3,say right type type institution need case thre...,0
4,add new product list make sure relevant add ne...,0


In [6]:
df_test.isna().sum()

clean_text    437
toxic_type      0
dtype: int64

In [7]:
df_test = df_test.dropna()

In [8]:
df_test.head()

Unnamed: 0,clean_text,toxic_type
0,thank understand think highly would not revert...,0
1,dear god site horrible,0
2,somebody invariably try add religion really me...,0
3,say right type type institution need case thre...,0
4,add new product list make sure relevant add ne...,0


In [9]:
X_train = df['clean_text']
y_train = df['toxic_type']

X_test = df_test['clean_text']
y_test = df_test['toxic_type']

summary_dic ={}

## 2. Classifiers

### 2.1. Text Vectorization

In [10]:
tfidf = TfidfVectorizer(max_features=5000)
tfidf_model = tfidf.fit(X_train)
X_train_vec = tfidf_model.transform(X_train).toarray()
X_test_vec = tfidf_model.transform(X_test).toarray()

## LR and Naive

In [11]:
LR = LogisticRegression(C=10.539, solver='liblinear')
NB = naive_bayes.MultinomialNB(alpha=1, fit_prior=True)

estimator_list1=[('LR', LR), ('NB', NB)]

clf1 = StackingClassifier(estimators=estimator_list1, final_estimator=LogisticRegression(solver='liblinear'),
                        verbose=1, n_jobs=6, cv=5)
clf1.fit(X_train_vec, y_train)

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:  1.6min finished
[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:  1.9min finished


In [13]:
y_pred = clf1.predict(X_test_vec)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96     57298
           1       0.58      0.80      0.67      6243

    accuracy                           0.92     63541
   macro avg       0.78      0.87      0.81     63541
weighted avg       0.94      0.92      0.93     63541



## LR, NB and RF Higher recall

In [32]:
LR = LogisticRegression(C=10.539, solver='liblinear', verbose=1, n_jobs=6)
NB = naive_bayes.MultinomialNB(alpha=1, fit_prior=True)
RF = RandomForestClassifier(max_depth=200, max_features='sqrt', n_estimators=500, verbose=1, n_jobs=6)

estimator_list2=[('LR', LR), ('NB', NB), ('RF', RF)]

clf2 = StackingClassifier(estimators=estimator_list2, final_estimator=LogisticRegression(n_jobs=6),
                        verbose=1, n_jobs=6, cv=5)
clf2.fit(X_train_vec, y_train)

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.


iter  1 act 5.842e+05 pre 5.354e+05 delta 1.742e+00 f 1.165e+06 |g| 6.764e+05 CG   1
cg reaches trust region boundary
iter  2 act 6.036e+04 pre 5.700e+04 delta 2.851e+00 f 5.811e+05 |g| 1.156e+05 CG   2
cg reaches trust region boundary
iter  3 act 4.672e+04 pre 4.646e+04 delta 1.140e+01 f 5.207e+05 |g| 2.377e+04 CG   2
cg reaches trust region boundary
iter  4 act 1.196e+05 pre 1.165e+05 delta 1.841e+01 f 4.740e+05 |g| 1.928e+04 CG   2
cg reaches trust region boundary
iter  5 act 8.896e+04 pre 7.314e+04 delta 2.511e+01 f 3.544e+05 |g| 1.879e+04 CG   2
cg reaches trust region boundary
iter  6 act 5.066e+04 pre 4.231e+04 delta 3.163e+01 f 2.654e+05 |g| 1.150e+04 CG   2
cg reaches trust region boundary
iter  7 act 2.834e+04 pre 2.470e+04 delta 3.781e+01 f 2.147e+05 |g| 7.210e+03 CG   3
cg reaches trust region boundary
iter  8 act 1.321e+04 pre 1.175e+04 delta 4.355e+01 f 1.864e+05 |g| 4.624e+03 CG   5
iter  9 act 4.606e+03 pre 4.423e+03 delta 4.355e+01 f 1.732e+05 |g| 2.825e+03 CG   8
iter

[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:  4.9min


[LibLinear]

[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed: 22.7min
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed: 52.2min
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 59.5min finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Using backend SequentialBackend with 1 concurrent workers.


iter  1 act 4.674e+05 pre 4.284e+05 delta 1.742e+00 f 9.322e+05 |g| 5.411e+05 CG   1
cg reaches trust region boundary
iter  2 act 4.836e+04 pre 4.567e+04 delta 2.853e+00 f 4.649e+05 |g| 9.249e+04 CG   2
cg reaches trust region boundary
iter  3 act 3.749e+04 pre 3.728e+04 delta 1.141e+01 f 4.165e+05 |g| 1.904e+04 CG   2
cg reaches trust region boundary
iter  4 act 9.590e+04 pre 9.319e+04 delta 1.839e+01 f 3.790e+05 |g| 1.548e+04 CG   2
cg reaches trust region boundary


[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:  2.7min finished


iter  5 act 7.126e+04 pre 5.868e+04 delta 2.514e+01 f 2.831e+05 |g| 1.488e+04 CG   2
cg reaches trust region boundary
iter  6 act 4.082e+04 pre 3.420e+04 delta 3.170e+01 f 2.119e+05 |g| 9.098e+03 CG   2
cg reaches trust region boundary
iter  7 act 2.322e+04 pre 2.040e+04 delta 3.812e+01 f 1.710e+05 |g| 5.751e+03 CG   3
cg reaches trust region boundary
iter  8 act 1.123e+04 pre 1.004e+04 delta 4.461e+01 f 1.478e+05 |g| 3.869e+03 CG   5
iter  9 act 3.726e+03 pre 3.560e+03 delta 4.461e+01 f 1.366e+05 |g| 2.445e+03 CG   6
iter 10 act 4.662e+02 pre 4.613e+02 delta 4.461e+01 f 1.329e+05 |g| 1.184e+03 CG   4
iter 11 act 5.282e+02 pre 5.200e+02 delta 4.461e+01 f 1.324e+05 |g| 1.799e+02 CG  15
iter 12 act 1.983e+00 pre 1.982e+00 delta 4.461e+01 f 1.319e+05 |g| 1.657e+02 CG   2
iter 13 act 1.050e+01 pre 1.048e+01 delta 4.461e+01 f 1.319e+05 |g| 1.221e+01 CG  23
iter  1 act 4.674e+05 pre 4.284e+05 delta 1.742e+00 f 9.322e+05 |g| 5.411e+05 CG   1
cg reaches trust region boundary
iter  2 act 4.825e

[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:  3.7min finished


[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 315.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 316.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   36.6s finished
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 316.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 316.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   25.7s finished
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   23.6s finished
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 316.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500

In [33]:
y_pred2 = clf2.predict(X_test_vec)

print(classification_report(y_test, y_pred))

[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    1.0s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    4.3s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:    9.7s


              precision    recall  f1-score   support

           0       0.98      0.93      0.96     57298
           1       0.57      0.81      0.67      6243

    accuracy                           0.92     63541
   macro avg       0.77      0.87      0.81     63541
weighted avg       0.94      0.92      0.93     63541



[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed:   11.0s finished


## LR, NB, RF, LGBM and XGB

In [38]:
LR = LogisticRegression(C=10.539, solver='liblinear', verbose=1, n_jobs=6)
NB = naive_bayes.MultinomialNB(alpha=1, fit_prior=True)
RF = RandomForestClassifier(max_depth=200, max_features='sqrt', n_estimators=500, verbose=1, n_jobs=6)
LGBM = LGBMClassifier(learning_rate= 0.0938, max_depth= 4, n_estimators= 4031, num_leaves= 4000, verbose=1, n_jobs=6)
XGB = XGBClassifier(objective='binary:logistic', seed=12, use_label_encoder=False, verbose=1, n_jobs=6)

estimator_list3=[('LR', LR), ('NB', NB), ('RF', RF), ('LGBM', LGBM), ('XGB', XGB)]

clf3 = StackingClassifier(estimators=estimator_list3,
                        final_estimator=LogisticRegression(n_jobs=6),
                        verbose=1, n_jobs=6, cv=5)

clf3.fit(X_train_vec, y_train)

  from pandas import MultiIndex, Int64Index
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.


iter  1 act 5.842e+05 pre 5.354e+05 delta 1.742e+00 f 1.165e+06 |g| 6.764e+05 CG   1
cg reaches trust region boundary
iter  2 act 6.036e+04 pre 5.700e+04 delta 2.851e+00 f 5.811e+05 |g| 1.156e+05 CG   2
cg reaches trust region boundary
iter  3 act 4.672e+04 pre 4.646e+04 delta 1.140e+01 f 5.207e+05 |g| 2.377e+04 CG   2
cg reaches trust region boundary
iter  4 act 1.196e+05 pre 1.165e+05 delta 1.841e+01 f 4.740e+05 |g| 1.928e+04 CG   2
cg reaches trust region boundary
iter  5 act 8.896e+04 pre 7.314e+04 delta 2.511e+01 f 3.544e+05 |g| 1.879e+04 CG   2
cg reaches trust region boundary
iter  6 act 5.066e+04 pre 4.231e+04 delta 3.163e+01 f 2.654e+05 |g| 1.150e+04 CG   2
cg reaches trust region boundary
iter  7 act 2.834e+04 pre 2.470e+04 delta 3.781e+01 f 2.147e+05 |g| 7.210e+03 CG   3
cg reaches trust region boundary
iter  8 act 1.321e+04 pre 1.175e+04 delta 4.355e+01 f 1.864e+05 |g| 4.624e+03 CG   5
iter  9 act 4.606e+03 pre 4.423e+03 delta 4.355e+01 f 1.732e+05 |g| 2.825e+03 CG   8
iter

























[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed: 10.0min




































[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed: 40.4min


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed: 80.0min
[Parallel(n_jobs=6)]: Done 500 out of 500 | elapsed: 87.9min finished
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
  from pandas import MultiIndex, Int64Index
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Using backend ThreadingBackend with 6 concurrent workers.
Process LokyProcess-67:
Traceback (most recent call last):
  File "/Users/zori/opt/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/Users/zori/opt/anaconda3/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/Users/zori/opt/anaconda3/lib/python3.

KeyboardInterrupt: 