#  Pre-processing & Training Data Development

## Table of Contents

[1. Introduction](#1.-Introduction)
<br>[2. Import Libraries and Data](#2.-Import-Libraries-and-Data)
<br>[3. Evaluate the Text Vectorization Algorithms](#3.-Evaluate-the-Text-Vectorization-Algorithms)
<br>&emsp;&emsp;&emsp;[3.1. BOW Term Frequency](#3.1.-BOW-Term-Frequency)
<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[3.1.1. Initial Exploration](#3.1.1.-Initial-Exploration)
<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[3.1.2. BOW Term Frequency - Parameters Evaluation](#3.1.2.-BOW-Term-Frequency---Parameters-Evaluation)
<br>&emsp;&emsp;&emsp;[3.2. Normalized TF-IDF](#3.2.-Normalized-TF-IDF)
<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[3.2.1. Initial Exploration](#3.2.1.-Initial-Exploration)
<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[3.2.2. TF-IDF Vectorization - Parameters Evaluation](#3.2.2.-TF-IDF-Vectorization---Parameters-Evaluation)
<br>&emsp;&emsp;&emsp;[3.3. Summary of Text Vecorization](#3.3.-Summary-of-Text-Vecorization)
<br>[4. Evaluate Imbalanced Algorithm](#4.-Evaluate-Imbalanced-Algorithm)
<br>&emsp;&emsp;&emsp;[4.1. Oversampling Algorithms](#4.1.-Oversampling-Algorithms)
<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;[4.1.1. Random Oversampling](#4.1.1.-Random-Oversampling)

## 1. Introduction 

Classification predictive modelling is a category of machine learning problems where a class label is assigned to a given dataset. The main types of classification problems are binary classification, multi-class classification, multi-label classification and imbalanced classification. The present project is a binary imbalanced classification problem with the following labels. The labels are “toxic” and “non-toxic”. The distribution of comments in these two labels is not uniformly distributed, therefore it is an imbalanced classification.

    
|  | <b>Label | <b>Number of Comments |
| :- | :-: | :-: |
|1| Non-toxic | 143346 |
|2| Toxic | 16225 |

   The imbalanced classification is a challenging problem. It requires a high-level framework for systematically handling the skewed class distribution. In this project, I follow the below systematic framework to evaluate various methods of classification:
    
#### 1. Select a performance metric
    
One of the main steps for the evaluation of the model is selecting the right metric. There are mainly two factors for the choice of classification: imbalanced or balanced dataset, and the business use-case to solve. In the case of imbalanced classification, selecting the appropriate metric is more challenging. Since most of the widely used standard metrics assume a balanced class distribution. For example, classification accuracy is one of the most common standard metrics for the classification problem. However, this metric for the imbalanced classification problem is very dangerous and misleading. 
There are four types of outcomes for the classification prediction: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Four performance metrics based on these outcomes are defined as follows:
 

\begin{gather*}
Accuray = \frac {TP + TN} {TP + FP + TN + FN}
\end{gather*}
  
    
\begin{gather*}
Precision = \frac {TP} {TP + FP}
\end{gather*}
  
    
\begin{gather*}
Recall = \frac {TP} {TP + FN}
\end{gather*}
  
    
\begin{gather*}
F1 = \frac {2 * Precision * Recall} {Precision + Recall}
\end{gather*}
    
For binary imbalanced classification tasks, the majority class is normal (called the “negative class”), and the minority class is the exception (called the “positive class”). As mentioned earlier, the accuracy metric is not an appropriate choice for imbalanced classification problems. Based on the level of tolerance with respect to false positives and false negatives, the right performance metric can be selected.
In the current study, the assumption is that a social media platform is willing to detect toxic comments and block the corresponding accounts. Therefore, it is essential to decrease false positives, since we don’t want to block an account by mistake (high precision). On the other hand, it is important to detect toxic comments and block them (high recall). In practice, we can’t achieve both high precision and high recall. An increase in precision metric reduces recall metric and vice versa. This is called the precision/recall tradeoff. Therefore, the F1 score is better metric performance when we are seeking a balance between precision and recall. 
Moreover, the receiver operator characteristic (ROC/AUC curve) is another useful metric for classification evaluation. For classification problems with probability outputs, a threshold can be used for the classification. ROC plots out the sensitivity and specificity for the different thresholds and possible outputs. 

\begin{gather*}
Sensitivity = \frac {TP} {TP + FN}
\end{gather*}
    
\begin{gather*}
Specifity = \frac {FP} {FP + TN}
\end{gather*}

In the current study, the performance metric to evaluate the models are F1 score and ROC/AUC.

#### 2. Evaluate the text vectorization algorithms

The text vectorization algorithms used in this project are as follows:
 - Bag of Words (BOW) Term Frequency
 - Normalized TF-IDF

#### 3. Evaluate classification algorithms
I evaluate the performance of various classification algoritms as follows:
 - Logistic Regression
 - Random Forest Classifier
 - XGBOOST Classifier
 - LGBM Classifier
 - Naive Bayes
 - KNN

#### 4. Evaluate imbalanced algorithms
The performance of various undersampling and oversampling methods are evaluated as follows:
 - Random Oversampling
 - Synthetic Minority Oversampling (SMOTE)
 - Adaptive Synthetic Sampling (ADASYN)
 - Borderline SMOTE
 - Random Undersampling
 - NearMiss
 - Edited Nearest Neighbor
    
#### 5. Hyperparameter tuning
GridSearchCV and BayesSearchCV are applied to select the suitable hyperparameters.


## 2. Import Libraries and Data

In [194]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

# Model selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Resampling techniques selection
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import KMeansSMOTE
from imblearn.over_sampling import SVMSMOTE

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import NearMiss
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.combine import SMOTETomek
from imblearn.combine import SMOTEENN

from imblearn.pipeline import Pipeline as imbpipeline


pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('max_colwidth', 150)
pd.set_option('display.notebook_repr_html', True)

In the preceding notebook, text vectorization and resampling techniques are examined. The initial model are trained using Logistic Regression model. Two techniques of Countvectorizer and TF-IDF are employed to vectorize the text. Moreover, several undersampling and oversampling techniques and their combinations are examined.

In [158]:
df = pd.read_csv('Library/cleaned_text_train_df.csv')
df.head()

Unnamed: 0,clean_text,toxic_type
0,explanation edit make username hardcore metallica fan revert vandalisms closure gas vote new york dolls fac please remove template talk page since...,0
1,aww match background colour seemingly stuck thank talk january utc,0
2,hey man really not try edit war guy constantly remove relevant information talk edit instead talk page seem care formatting actual info,0
3,make real suggestion improvement wonder section statistic later subsection type accident think reference may need tidy exact format ie date format...,0
4,sir hero chance remember page,0


In [159]:
df.isna().sum()

clean_text    54
toxic_type     0
dtype: int64

In [160]:
df.dropna(inplace=True)

## 3. Evaluate the Text Vectorization Algorithms

The main goal is to investigate the text vectorization algorithms and find out the optimum parameters. Initially, the models are tested with the default values. Afterward, GridSearch technique is employed to investigate the effects of parameters on the model.
First, I split the dataset to train/test set and then fit the text vectorization model on the training set. Afterwards, both train and test are transformed using this fit. It is important to perform the split first to avoid data leakage.

### 3.1. BOW Term Frequency

### 3.1.1. Initial Exploration

In [5]:
# Train/test split
X = df['clean_text']
y = df['toxic_type']

In [191]:
# Define function to perform train/test split and then fitting and transforming using Countvectorizer

def countvectorize(X, y, max_df=0.95, min_df=0.001, ngram_range=(1,1)):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                                stratify=y, random_state=1)
    
    vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, ngram_range=(1,1))
    Dict_count = vectorizer.fit(X_train)
    
    X_train_vec = Dict_count.transform(X_train).toarray()
    X_test_vec = Dict_count.transform(X_test).toarray()
    
    X_train_df = pd.DataFrame(data=X_train_vec, columns=vectorizer.get_feature_names())
    X_test_df = pd.DataFrame(data=X_test_vec, columns=vectorizer.get_feature_names())
    
    y_train_df = y_train
    y_test_df = y_test
    
    print('Number of labels in the train set: \n', y_train.value_counts(normalize=True))
    print('-----------------------------------')
    print('Number of labels in the test set: \n', y_test.value_counts(normalize=True))
    print('-----------------------------------')
    
    X_train = np.array(X_train_df)
    X_test = np.array(X_test_df)
    y_train = np.array(y_train_df)
    y_test = np.array(y_test_df)
    
    print("X train type and shape: ", type(X_train), X_train.shape)
    print("y train type and shape: ", type(y_train), y_train.shape)
    
    print("X test type and shape: ", type(X_test), X_test.shape)
    print("y test type and shape: ", type(y_test), y_test.shape)
    print('-----------------------------------')
    
    
    return X_train, X_test, y_train, y_test

In [7]:
# BOW Word Frequency

X_train_bow, X_test_bow, y_train_bow, y_test_bow = countvectorize(X, y, max_df=0.95, min_df=0.001, ngram_range=(1,1))

Number of labels in the train set: 
 0    0.898286
1    0.101714
Name: toxic_type, dtype: float64
-----------------------------------
Number of labels in the test set: 
 0    0.898289
1    0.101711
Name: toxic_type, dtype: float64
-----------------------------------
X train type and shape:  <class 'numpy.ndarray'> (127613, 2958)
y train type and shape:  <class 'numpy.ndarray'> (127613,)
X test type and shape:  <class 'numpy.ndarray'> (31904, 2958)
y test type and shape:  <class 'numpy.ndarray'> (31904,)
-----------------------------------


In [8]:
# Logistic Regression model

clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=1)
clf_model = clf.fit(X_train_bow, y_train_bow)

y_pred_train_bow = clf_model.predict(X_train_bow)

y_pred_test_bow = clf_model.predict(X_test_bow)

I check the classification report for both train and test sets to check underfitting or overfitting of the model.

In [9]:
# Check the classification metrics for the test set

print("Test Classification Report")
print(classification_report(y_test_bow, y_pred_test_bow))

Test Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     28659
           1       0.86      0.64      0.74      3245

    accuracy                           0.95     31904
   macro avg       0.91      0.82      0.86     31904
weighted avg       0.95      0.95      0.95     31904



In [10]:
# Check the classification metrics for the train set

print("Test Classification Report")
print(classification_report(y_train_bow, y_pred_train_bow))

Test Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98    114633
           1       0.91      0.66      0.76     12980

    accuracy                           0.96    127613
   macro avg       0.93      0.83      0.87    127613
weighted avg       0.96      0.96      0.96    127613



As mentioned earlier, the classification metric for this study is f1-score; however, I keep checking precision and recall, as well. 

In terms of minority group (toxic comments), the initial model has a higher precision than recall. It was expected!
The goal is to improve recall and thus f1-score.

In the next section, I investiagte the effects of parameters on the output of BOW model.

### 3.1.2. BOW Term Frequency - Parameters Evaluation

In [11]:
X = df['clean_text']
y = df['toxic_type']

#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=1)


In [12]:
# Define the BOW and Logistic Regression models with different parameters

pipeline = Pipeline([('countvect', CountVectorizer()),
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'countvect__max_df': (0.95, 0.9, 0.85, 0.8),
    'countvect__min_df': (0.0001, 0.0005, 0.001, 0.002),
    'countvect__ngram_range': ((1,1), (2,2))
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, scoring='f1')
model_bow = grid_search.fit(X_train, y_train)

In [14]:
# Rank all models based on the f1-score metric

model_bow_results = pd.DataFrame(model_bow.cv_results_).sort_values(by='mean_test_score', ascending=False)
model_bow_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_countvect__max_df,param_countvect__min_df,param_countvect__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,10.432982,1.866859,0.662657,0.068872,0.95,0.0001,"(1, 1)","{'countvect__max_df': 0.95, 'countvect__min_df...",0.750159,0.764893,0.764932,0.745038,0.757407,0.756486,0.007924,1
8,9.090825,1.215975,0.566545,0.005077,0.9,0.0001,"(1, 1)","{'countvect__max_df': 0.9, 'countvect__min_df'...",0.750159,0.764893,0.764932,0.745038,0.757407,0.756486,0.007924,1
24,8.839809,1.022437,0.558752,0.007119,0.8,0.0001,"(1, 1)","{'countvect__max_df': 0.8, 'countvect__min_df'...",0.750159,0.764893,0.764932,0.745038,0.757407,0.756486,0.007924,1
16,8.854013,1.047872,0.566128,0.007749,0.85,0.0001,"(1, 1)","{'countvect__max_df': 0.85, 'countvect__min_df...",0.750159,0.764893,0.764932,0.745038,0.757407,0.756486,0.007924,1
18,8.173404,1.001998,0.549761,0.009949,0.85,0.0005,"(1, 1)","{'countvect__max_df': 0.85, 'countvect__min_df...",0.737915,0.755251,0.748811,0.730634,0.742808,0.743084,0.008516,5
26,8.149528,1.013364,0.542664,0.008855,0.8,0.0005,"(1, 1)","{'countvect__max_df': 0.8, 'countvect__min_df'...",0.737915,0.755251,0.748811,0.730634,0.742808,0.743084,0.008516,5
2,8.699656,1.057683,0.618047,0.009326,0.95,0.0005,"(1, 1)","{'countvect__max_df': 0.95, 'countvect__min_df...",0.737915,0.755251,0.748811,0.730634,0.742808,0.743084,0.008516,5
10,8.169474,1.032815,0.554974,0.00663,0.9,0.0005,"(1, 1)","{'countvect__max_df': 0.9, 'countvect__min_df'...",0.737915,0.755251,0.748811,0.730634,0.742808,0.743084,0.008516,5
12,7.220552,1.042784,0.546806,0.00953,0.9,0.001,"(1, 1)","{'countvect__max_df': 0.9, 'countvect__min_df'...",0.724054,0.736449,0.735216,0.712009,0.730004,0.727547,0.008919,9
28,7.159653,1.030859,0.537663,0.008782,0.8,0.001,"(1, 1)","{'countvect__max_df': 0.8, 'countvect__min_df'...",0.724054,0.736449,0.735216,0.712009,0.730004,0.727547,0.008919,9


In [75]:
# Best model based on f1-score metric

print(model_bow.best_estimator_.steps)

[('countvect', CountVectorizer(max_df=0.95, min_df=0.0001)), ('clf', LogisticRegression(max_iter=20000))]


In [15]:
# Check the classification metrics for the test set

y_pred_bow = model_bow.predict(X_test)
print(classification_report(y_test, y_pred_bow))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     28659
           1       0.85      0.70      0.77      3245

    accuracy                           0.96     31904
   macro avg       0.91      0.84      0.87     31904
weighted avg       0.95      0.96      0.95     31904



In [192]:
# Check the classification metrics for the test set

y_pred_train_bow = model_bow.predict(X_train)
print(classification_report(y_train, y_pred_train_bow))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98    114633
           1       0.95      0.76      0.84     12980

    accuracy                           0.97    127613
   macro avg       0.96      0.88      0.91    127613
weighted avg       0.97      0.97      0.97    127613



In [57]:
# Summary of the results including fit time and score

bow = model_bow_results[['mean_fit_time', 'mean_score_time', 'params', 'mean_test_score']].head(1).transpose()
bow

Unnamed: 0,0
mean_fit_time,10.432982
mean_score_time,0.662657
params,"{'countvect__max_df': 0.95, 'countvect__min_df..."
mean_test_score,0.756486


### 3.2. Normalized TF-IDF

The next text vectorization model that I use in this project is TF-IDF algorithm.

### 3.2.1. Initial Exploration

In [16]:
# Train/test split
X = df['clean_text']
y = df['toxic_type']


In [17]:
# Define function to perform train/test split and then fitting and transforming using TF-IDF

def tfidfvectorize(X, y, max_df=0.95, min_df=0.001, ngram_range=(1,1), norm='l2', 
                   use_idf=True, smooth_idf=True, sublinear_tf=False):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                                stratify=y, random_state=1)
    
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.001, ngram_range=(1,1), norm='l2', 
                   use_idf=True, smooth_idf=True, sublinear_tf=False)
    
    Dict_count = vectorizer.fit(X_train)
    
    X_train_vec = Dict_count.transform(X_train).toarray()
    X_test_vec = Dict_count.transform(X_test).toarray()
    
    X_train_df = pd.DataFrame(data=X_train_vec, columns=vectorizer.get_feature_names())
    X_test_df = pd.DataFrame(data=X_test_vec, columns=vectorizer.get_feature_names())
    
    y_train_df = y_train
    y_test_df = y_test
    
    print('Number of labels in the train set: \n', y_train.value_counts(normalize=True))
    print('-----------------------------------')
    print('Number of labels in the test set: \n', y_test.value_counts(normalize=True))
    print('-----------------------------------')
    
    X_train = np.array(X_train_df)
    X_test = np.array(X_test_df)
    y_train = np.array(y_train_df)
    y_test = np.array(y_test_df)
    
    print("X train type and shape: ", type(X_train), X_train.shape)
    print("y train type and shape: ", type(y_train), y_train.shape)
    
    print("X test type and shape: ", type(X_test), X_test.shape)
    print("y test type and shape: ", type(y_test), y_test.shape)
    print('-----------------------------------')
    
    
    return X_train, X_test, y_train, y_test

In [18]:
# TFIDF vectorizer

X_train_tf, X_test_tf, y_train_tf, y_test_tf = tfidfvectorize(X, y, max_df=0.95, min_df=0.001, ngram_range=(1,1), 
                                                 norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Number of labels in the train set: 
 0    0.898286
1    0.101714
Name: toxic_type, dtype: float64
-----------------------------------
Number of labels in the test set: 
 0    0.898289
1    0.101711
Name: toxic_type, dtype: float64
-----------------------------------
X train type and shape:  <class 'numpy.ndarray'> (127613, 2958)
y train type and shape:  <class 'numpy.ndarray'> (127613,)
X test type and shape:  <class 'numpy.ndarray'> (31904, 2958)
y test type and shape:  <class 'numpy.ndarray'> (31904,)
-----------------------------------


In [19]:
# Logistic Regression model

clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=1)
clf_model = clf.fit(X_train_tf, y_train_tf)

y_pred_train_tf = clf_model.predict(X_train_tf)

y_pred_test_tf = clf_model.predict(X_test_tf)

In [20]:
# Check the classification metrics for the test set

print("Test Classification Report")
print(classification_report(y_test_tf, y_pred_test_tf))

Test Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98     28659
           1       0.91      0.63      0.75      3245

    accuracy                           0.96     31904
   macro avg       0.94      0.81      0.86     31904
weighted avg       0.96      0.96      0.95     31904



In [21]:
# Check the classification metrics for the train set

print("Test Classification Report")
print(classification_report(y_train_tf, y_pred_train_tf))

Test Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98    114633
           1       0.92      0.63      0.75     12980

    accuracy                           0.96    127613
   macro avg       0.94      0.81      0.86    127613
weighted avg       0.96      0.96      0.95    127613



### 3.2.2. TF-IDF Vectorization - Parameters Evaluation

In [22]:
X = df['clean_text']
y = df['toxic_type']

#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=1)


In [23]:
# Define the TF-IDF and Logistic Regression models with different parameters

pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'tfidf__max_df': (0.95, 0.9, 0.85, 0.8),
    'tfidf__min_df': (0.0001, 0.0005, 0.001, 0.002),
    'tfidf__ngram_range': ((1,1), (2,2)),
    'tfidf__norm': ('l1', 'l2'),
    'tfidf__use_idf': (True, False),
    'tfidf__smooth_idf': (True, False),
    'tfidf__sublinear_tf': (True, False)
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, scoring='f1')
model_tf = grid_search.fit(X_train, y_train)

In [24]:
# Rank the results based on the f1-score metric

model_tf_results = pd.DataFrame(model_tf.cv_results_).sort_values(by='mean_test_score', ascending=False)
model_tf_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tfidf__max_df,param_tfidf__min_df,param_tfidf__ngram_range,param_tfidf__norm,param_tfidf__smooth_idf,param_tfidf__sublinear_tf,param_tfidf__use_idf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
172,3.052689,0.043315,0.512436,0.010285,0.9,0.0005,"(1, 1)",l2,False,True,True,"{'tfidf__max_df': 0.9, 'tfidf__min_df': 0.0005...",0.740230,0.759591,0.746905,0.730402,0.748690,0.745164,0.009653,1
300,3.061287,0.033217,0.507843,0.004753,0.85,0.0005,"(1, 1)",l2,False,True,True,"{'tfidf__max_df': 0.85, 'tfidf__min_df': 0.000...",0.740230,0.759591,0.746905,0.730402,0.748690,0.745164,0.009653,1
40,3.311951,0.125871,0.561447,0.008321,0.95,0.0005,"(1, 1)",l2,True,True,True,"{'tfidf__max_df': 0.95, 'tfidf__min_df': 0.000...",0.740230,0.759591,0.746905,0.730402,0.748690,0.745164,0.009653,1
44,3.381707,0.076519,0.572371,0.019592,0.95,0.0005,"(1, 1)",l2,False,True,True,"{'tfidf__max_df': 0.95, 'tfidf__min_df': 0.000...",0.740230,0.759591,0.746905,0.730402,0.748690,0.745164,0.009653,1
424,3.041076,0.120493,0.508623,0.004067,0.8,0.0005,"(1, 1)",l2,True,True,True,"{'tfidf__max_df': 0.8, 'tfidf__min_df': 0.0005...",0.740230,0.759591,0.746905,0.730402,0.748690,0.745164,0.009653,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497,5.244102,0.083984,0.689252,0.007039,0.8,0.002,"(2, 2)",l1,True,True,False,"{'tfidf__max_df': 0.8, 'tfidf__min_df': 0.002,...",0.059524,0.059502,0.052926,0.021220,0.056569,0.049948,0.014567,497
247,5.337852,0.054408,0.694013,0.004190,0.9,0.002,"(2, 2)",l1,False,False,False,"{'tfidf__max_df': 0.9, 'tfidf__min_df': 0.002,...",0.059524,0.059502,0.052926,0.021220,0.056569,0.049948,0.014567,497
371,5.277293,0.034150,0.685507,0.006322,0.85,0.002,"(2, 2)",l1,True,False,False,"{'tfidf__max_df': 0.85, 'tfidf__min_df': 0.002...",0.059524,0.059502,0.052926,0.021220,0.056569,0.049948,0.014567,497
113,5.350179,0.058524,0.709494,0.018204,0.95,0.002,"(2, 2)",l1,True,True,False,"{'tfidf__max_df': 0.95, 'tfidf__min_df': 0.002...",0.059524,0.059502,0.052926,0.021220,0.056569,0.049948,0.014567,497


In [27]:
# Classification report for the test set

y_pred_tf = model_tf.predict(X_test)
print(classification_report(y_test, y_pred_tf))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98     28659
           1       0.92      0.66      0.77      3245

    accuracy                           0.96     31904
   macro avg       0.94      0.83      0.87     31904
weighted avg       0.96      0.96      0.96     31904



In [207]:
# Classification report for the train set

y_pred_train_tf = model_tf.predict(X_train)
print(classification_report(y_train, y_pred_train_tf))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98    114633
           1       0.93      0.66      0.77     12980

    accuracy                           0.96    127613
   macro avg       0.95      0.83      0.87    127613
weighted avg       0.96      0.96      0.96    127613



In [28]:
# The best model based on the f1-score metric

best_tfidf_model = model_tf.best_estimator_.steps[0][1]
print(best_tfidf_model)

TfidfVectorizer(max_df=0.95, min_df=0.0005, sublinear_tf=True)


In [59]:
# Summary of the results including fit time and score

tf = model_tf_results[['mean_fit_time', 'mean_score_time', 'params', 'mean_test_score']].head(1).transpose()
tf

Unnamed: 0,172
mean_fit_time,3.052689
mean_score_time,0.512436
params,"{'tfidf__max_df': 0.9, 'tfidf__min_df': 0.0005..."
mean_test_score,0.745164


### 3.3. Summary of Text Vecorization

The results for BOW and TF-IDF show that the perfromance of both model (in terms of f1-score metric) is very close. However, TF-IDF model performs better in terms of fit-time. The TF-IDF model is three times faster than the BOW model. Therefore, the TF-IDF technique is selected for the rest of modelling.

In [66]:
vec_sum = pd.concat([bow, tf], axis=1)
vec_sum.columns = ['BOW Terms Frequency', 'TF-IDF']
vec_sum

Unnamed: 0,BOW Terms Frequency,TF-IDF
mean_fit_time,10.432982,3.052689
mean_score_time,0.662657,0.512436
params,"{'countvect__max_df': 0.95, 'countvect__min_df': 0.0001, 'countvect__ngram_range': (1, 1)}","{'tfidf__max_df': 0.9, 'tfidf__min_df': 0.0005, 'tfidf__ngram_range': (1, 1), 'tfidf__norm': 'l2', 'tfidf__smooth_idf': False, 'tfidf__sublinear_t..."
mean_test_score,0.756486,0.745164


In [68]:
print('The best vectorization model:')
model_tf.best_estimator_

The best vectorization model:


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_df=0.95, min_df=0.0005,
                                 sublinear_tf=True)),
                ('clf', LogisticRegression(max_iter=20000))])

### 4. Evaluate Imbalanced Algorithm

As explained earlier, this dataset is imbalanced. The toxic-labeled comments are 10 percent of the non-toxic comments. This imbalance results in a small recall. In order to address the imbalance issue, I use oversamling, undersampling and combined techniques. Resampling techniques in conjunction with TF-IDF and GridSearch is employed to investigate the most optimal technique with the optimum sampling ratio.

### 4.1. Oversampling Algorithms

### 4.1.1. Random Oversampling

In [69]:
X = df['clean_text']
y = df['toxic_type']

#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    stratify=y, random_state=1)

In [70]:
# TF-IDF text vectorization technique in conjunction with Random Oversampling and Logistic Regression techniques

pipeline = imbpipeline([('tfidf', TfidfVectorizer(max_df=0.95,
                            min_df=0.0005, sublinear_tf=True)),
                     ('over', RandomOverSampler(random_state=0)),
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'over__sampling_strategy': (1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3)
}

grid_search = GridSearchCV(pipeline, parameters, cv=5, scoring='f1')
model_R1 = grid_search.fit(X_train, y_train)

In [71]:
# Rank the model based on the f1-score metric

model_R1_results = pd.DataFrame(model_R1.cv_results_).sort_values(by='mean_test_score', ascending=False)
model_R1_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_over__sampling_strategy,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,4.487857,0.214337,0.631202,0.02221,0.3,{'over__sampling_strategy': 0.3},0.771588,0.782678,0.774142,0.760073,0.775526,0.772801,0.007352,1
6,3.763166,0.164054,0.627233,0.036916,0.4,{'over__sampling_strategy': 0.4},0.768094,0.769608,0.768848,0.760953,0.770489,0.767598,0.003416,2
5,4.822719,0.647701,0.69638,0.089506,0.5,{'over__sampling_strategy': 0.5},0.758445,0.766158,0.761238,0.754214,0.758218,0.759655,0.003948,3
4,4.588342,0.360693,0.58725,0.015516,0.6,{'over__sampling_strategy': 0.6},0.748862,0.751494,0.748798,0.742684,0.748458,0.748059,0.0029,4
3,4.269987,0.42333,0.597438,0.051972,0.7,{'over__sampling_strategy': 0.7},0.738326,0.742964,0.735081,0.731716,0.73564,0.736745,0.003754,5
2,4.6778,0.34903,0.645635,0.029531,0.8,{'over__sampling_strategy': 0.8},0.728114,0.728239,0.727303,0.71844,0.724768,0.725373,0.003685,6
1,5.425109,0.342835,0.634329,0.035885,0.9,{'over__sampling_strategy': 0.9},0.717916,0.716442,0.718581,0.708757,0.716045,0.715548,0.00352,7
0,5.400107,0.176558,0.632666,0.015415,1.0,{'over__sampling_strategy': 1},0.707425,0.705136,0.707614,0.701428,0.707937,0.705908,0.002449,8


In [72]:
# Classification report for the test set

y_pred_R1 = model_R1.predict(X_test)
print(classification_report(y_test, y_pred_R1))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98     28659
           1       0.81      0.77      0.79      3245

    accuracy                           0.96     31904
   macro avg       0.89      0.87      0.88     31904
weighted avg       0.96      0.96      0.96     31904



In [73]:
# Classification report for the train set

y_pred_train_R1 = model_R1.predict(X_train)
print(classification_report(y_train, y_pred_train_R1))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98    114633
           1       0.82      0.79      0.81     12980

    accuracy                           0.96    127613
   macro avg       0.90      0.89      0.89    127613
weighted avg       0.96      0.96      0.96    127613



In [77]:
# Summary of the results including fit time and score

R1_est = model_R1_results[['mean_fit_time', 'mean_score_time', 'params', 'mean_test_score']].head(1).transpose()
R1_est

Unnamed: 0,7
mean_fit_time,4.487857
mean_score_time,0.631202
params,{'over__sampling_strategy': 0.3}
mean_test_score,0.772801


Oversampling decreased the precision while increased the recall and f1-score.