# Toxic Comment Classifier

## Introduction

In this kernel we are going to address the toxic comment classification problem, a multi-label classification problem, via various machine and deep learning techniques.\
We first start by analyzing the data. Then we try to apply techniques such as naive-bayes, logistic regressor, neural network and lstm; we even try a BERT fine tuning.\
The resulting models will then be compared based on their ROC AUC score.

In [None]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

# For evaluation
from tqdm import tqdm
import transformers
import torchmetrics

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer#, CountVectorizer
from sklearn import metrics
from sklearn.metrics import accuracy_score

import re, string

import torch

from transformers import pipeline
from tqdm.notebook import tqdm

# For LSTM
from keras.preprocessing.text import Tokenizer
#from keras_preprocessing.sequence import pad_sequences
import keras_preprocessing
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

## Load training and test data

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
test_labels = pd.read_csv('../input/test_labels.csv')

## Identify the classes

The comments are labelled as one or more of the following six categories: toxic, severe toxic, obscene, threat, insult and identity hate.

In [None]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

## Data analysis

The training data contains a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.

In [None]:
train.sample(5)

Here's a couple of examples of comments, one toxic (marked as toxic, obscene, insult), and one with no labels.

In [None]:
train['comment_text'][67547]

In [None]:
train['comment_text'][156031]

First of all, let's check if there are any null values in the dataset.\
These will need to be cleaned up eventually later on.

In [None]:
train.isnull().any(),test.isnull().any()

All rows in the training dataset don't contain null values; specifically, they all contain comments, so there will be no need to clean up null fields.

Let's create a summary of the dataset. We also create a 'none' label so we can see how many comments have no labels.

In [None]:
train['none'] = 1-train[label_cols].max(axis=1)
train.describe()

The mean values are very small (some way below 0.05), as 89.8321% of the comments are not labelled in any of the six categories and therefore not considered toxic.\
Let's see the exact numbers for the various categories as well.

In [None]:
print('Total rows in train is {}'.format(len(train)))
print('Number of unlabelled (positive) comments: {}'.format(train['none'].sum()))
print(train[label_cols].sum())

As mentioned, majority of the comments in the training data are not labelled in one or more of these categories.\
Let's look at the character length for the rows in the training data.

In [None]:
lens = train.comment_text.str.len()
lens.mean(), lens.std(), lens.max()

The length of the comments varies a lot. Let's look at the histogram plot for text length.

In [None]:
sns.set()
lens.hist()
plt.show()

Most of the text length are within 500 characters, with some up to 5,000 characters long.\
Next, let's examine the correlations among the target variables.

In [None]:
data = train[label_cols]

In [None]:
colormap = plt.cm.plasma
plt.figure(figsize=(7,7))
plt.title('Correlation of features & targets',y=1.05,size=14)
sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,
           linecolor='white',annot=True);

Indeed, it looks like some of the labels are higher correlated, e.g. insult-obscene has the highest at 0.74, followed by toxic-obscene and toxic-insult.

## Data pre-processing

### Clean test data

Rows with -1 values in test_labels are not used for evaluation.
Therefore, we remove them from test_labels and store their indexes so we can remove them from predictions as well (we need to mantain them in test, otherwise we'll have problems with indixes in predictions).

In [None]:
indexes = []
for index, row in test_labels.iterrows():
    if row['toxic'] == -1:
        indexes.append(index)
test_labels = test_labels.drop(indexes)

### Clean up the comment text

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

In [None]:
train['comment_text'] = train['comment_text'].map(lambda com : clean_text(com))

In [None]:
test['comment_text'] = test['comment_text'].map(lambda com : clean_text(com))

### Vectorize the data

Create a *bag of words* representation, as a *term document matrix*.\
First of all, Tokenization

In [None]:
COMMENT = 'comment_text'
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

Instantiate TfidfVectorizer.

In [None]:
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
              min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
              smooth_idf=1, sublinear_tf=1, max_features=5000, stop_words='english' )

Learn the vocabulary in the training data, then use it to create a document-term matrix.

In [None]:
trn_term_doc = vec.fit_transform(train[COMMENT])

Transform the test data using the earlier fitted vocabulary, into a document-term matrix.

In [None]:
test_term_doc = vec.transform(test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements (*stored elements* in the representation  below).

In [None]:
trn_term_doc.shape, test_term_doc.shape

### Alt method

In [None]:
# list_sentences_train = train["comment_text"]
# list_sentences_test = test["comment_text"]

# max_features = 500 #20000
# tokenizer = Tokenizer(num_words=max_features)
# tokenizer.fit_on_texts(list(list_sentences_train))
# list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
# list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

# maxlen = 5000#200
# trn_term_doc = pad_sequences(list_tokenized_train, maxlen=maxlen)
# test_term_doc = pad_sequences(list_tokenized_test, maxlen=maxlen)

## Logistic regression
One way to approach a multi-label classification problem is to transform the problem into separate single-class classifier problems. This is known as 'problem transformation'. There are three methods:
* _**Binary Relevance.**_ This is probably the simplest which treats each label as a separate single classification problems. The key assumption here though, is that there are no correlation among the various labels.
* _**Classifier Chains.**_ In this method, the first classifier is trained on the input X. Then the subsequent classifiers are trained on the input X and all previous classifiers' predictions in the chain. This method attempts to draw the signals from the correlation among preceding target variables.
* _**Label Powerset.**_ This method transforms the problem into a multi-class problem  where the multi-class labels are essentially all the unique label combinations. In our case here, where there are six labels, Label Powerset would in effect turn this into a 2^6 or 64-class problem.

Next we will try to address the toxic classification problem using the Binary Relevance and the Classifier Chains approaches.

Instantiate the Logistic Regression model.

In [None]:
logreg = LogisticRegression(C=12.0,max_iter=500)

Instantiate matrix to take note of predictions for test data.

In [None]:
preds = np.zeros((len(test), len(label_cols)))

### Binary Relevance

In [None]:
for i,label in enumerate(label_cols):
    print('... Processing {}'.format(label))
    y = train[label]
    # train the model using X_dtm & y
    logreg.fit(trn_term_doc, y)
    # compute the training accuracy
    y_pred_X = logreg.predict(trn_term_doc)
    print('Training accuracy is {}'.format(accuracy_score(y, y_pred_X)))
    # compute the predicted probabilities for X_test_dtm
    preds[:,i] = logreg.predict_proba(test_term_doc)[:,1]
    #preds[:,i] = logreg.predict_proba(list_tokenized_test)[:,1]

### Model evaluation

In [None]:
test_ids = pd.DataFrame({'id': test["id"]})
predictions = pd.concat([test_ids, pd.DataFrame(preds, columns = label_cols)], axis=1)

Drop rows not used for evaluation.

In [None]:
predictions = predictions.drop(indexes)

Calculating ROC AUC score for each category.

In [None]:
for cat in label_cols:
    
    print(f"Category: {cat}")
    print(f"Sklearn score: {metrics.roc_auc_score(test_labels[cat], predictions[cat], multi_class='ovr')}")
    print(f"torchmetrics score: {torchmetrics.functional.classification.binary_auroc(torch.tensor(predictions[cat].values),torch.tensor(test_labels[cat].values), thresholds=None)}")
    print("#" * 30)
    print()

Calculating mean column-wise ROC AUC score on all categories.

In [None]:
print(f"Sklearn score: {metrics.roc_auc_score(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, average='macro')}")
print(F"Torchmetrics score: {torchmetrics.functional.classification.multilabel_auroc(torch.tensor(predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),torch.tensor(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),num_labels=6,thresholds=None )}")

### Classifier Chains

Create a function to add features.

In [None]:
def add_feature(X, feature_to_add):
    '''
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    '''
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

Copy trn_term_doc and test_term_doc in train_X and test_X for reusing.

In [None]:
train_X, test_X = trn_term_doc, test_term_doc

In [None]:
for i,label in enumerate(label_cols):
    print('... Processing {}'.format(label))
    y = train[label]
    # train the model using X_dtm & y
    logreg.fit(train_X,y)
    # compute the training accuracy
    y_pred_X = logreg.predict(train_X)
    print('Training Accuracy is {}'.format(accuracy_score(y,y_pred_X)))
    # make predictions from test_X
    test_y = logreg.predict(test_X)
    test_y_prob = logreg.predict_proba(test_X)[:,1]
    preds[:,i] = test_y_prob
    # chain current label to X_dtm
    train_X = add_feature(train_X, y)
    print('Shape of X_dtm is now {}'.format(train_X.shape))
    # chain current label predictions to test_X_dtm
    test_X = add_feature(test_X, test_y)
    print('Shape of test_X_dtm is now {}'.format(test_X.shape))

### Model evaluation

In [None]:
test_ids = pd.DataFrame({'id': test["id"]})
predictions = pd.concat([test_ids, pd.DataFrame(preds, columns = label_cols)], axis=1)

Drop rows not used for evaluation.

In [None]:
predictions = predictions.drop(indexes)

Calculating ROC AUC score for each category.

In [None]:
for cat in label_cols:
    
    print(f"Category: {cat}")
    print(f"Sklearn score: {metrics.roc_auc_score(test_labels[cat], predictions[cat], multi_class='ovr')}")
    print(f"torchmetrics score: {torchmetrics.functional.classification.binary_auroc(torch.tensor(predictions[cat].values),torch.tensor(test_labels[cat].values), thresholds=None)}")
    print("#" * 30)
    print()

Calculating mean column-wise ROC AUC score on all categories.

In [None]:
print(f"Sklearn score: {metrics.roc_auc_score(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, average='macro')}")
print(F"Torchmetrics score: {torchmetrics.functional.classification.multilabel_auroc(torch.tensor(predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),torch.tensor(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),num_labels=6,thresholds=None )}")

## Naive Bayes - Logistic Regression

Here we try using NBSVM (Naive Bayes - Support Vector Machine) but using sklearn's logistic regression rather than SVM, although in practice the two are nearly identical.\
NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf).

Here's the basic naive bayes feature equation:

In [None]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
x = trn_term_doc
test_x = test_term_doc

Fit a model for one dependent at a time:

In [None]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    #m = LogisticRegression(C=4, dual=True) # This gives an error
    m = LogisticRegression(C=4, dual=False, max_iter=500)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [None]:
preds = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('fit', j)
    m,r = get_mdl(train[j])
    preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]

### Model evaluation

In [None]:
test_ids = pd.DataFrame({'id': test["id"]})
predictions = pd.concat([test_ids, pd.DataFrame(preds, columns = label_cols)], axis=1)

Drop rows not used for evaluation.

In [None]:
predictions = predictions.drop(indexes)

Calculating ROC AUC score for each category.

In [None]:
for cat in label_cols:
    
    print(f"Category: {cat}")
    print(f"Sklearn score: {metrics.roc_auc_score(test_labels[cat], predictions[cat], multi_class='ovr')}")
    print(f"torchmetrics score: {torchmetrics.functional.classification.binary_auroc(torch.tensor(predictions[cat].values),torch.tensor(test_labels[cat].values), thresholds=None)}")
    print("#" * 30)
    print()

Calculating mean column-wise ROC AUC score on all categories.

In [None]:
print(f"Sklearn score: {metrics.roc_auc_score(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, average='macro')}")
print(F"Torchmetrics score: {torchmetrics.functional.classification.multilabel_auroc(torch.tensor(predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),torch.tensor(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),num_labels=6,thresholds=None )}")

## LSTM

The inputs into our networks are our list of encoded sentences. We begin our defining an Input layer that accepts a list of sentences that has a dimension of 200.\
By indicating an empty space after comma, we are telling Keras to infer the number automatically.

In [None]:
#inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
inp = Input(shape=(5000, )) # maxlen

In [None]:
embed_size = 128
x = Embedding(5000, embed_size)(inp) # max_features

In [None]:
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)

In [None]:
x = GlobalMaxPool1D()(x)

In [None]:
# x = Dropout(0.1)(x)

In [None]:
# x = Dense(50, activation="relu")(x)

In [None]:
# x = Dropout(0.1)(x)

In [None]:
x = Dense(6, activation="sigmoid")(x)

In [None]:
model = Model(inputs=inp, outputs=x)
#model.compile(loss='binary_crossentropy',
model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
model.summary()

Using a batch generator.

In [None]:
def batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    #index = np.arange(np.shape(y_data)[0])
    shuffle_index = np.arange(np.shape(y_data)[0])
    np.random.shuffle(shuffle_index)
    while 1:
        #index_batch = index[batch_size*counter:batch_size*(counter+1)]
        index_batch = shuffle_index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        #y_batch = y_data[index_batch]

        y_batch = np.zeros((len(index_batch), len(label_cols)), dtype='int64')
        for i, label in enumerate(label_cols):
            y_batch[:,i] = y_data[label][index_batch]

        counter += 1
        #yield np.array(X_batch),y_batch
        yield np.array(X_batch),np.array(y_batch)
        #yield X_batch,np.array(y_batch)
        if (counter > number_of_batches):
            np.random.shuffle(shuffle_index)
            counter=0

In [None]:
batch_size = 32
epochs = 2
#model.fit(trn_term_doc,train[label_cols], batch_size=batch_size, epochs=epochs, validation_split=0.1)
model.fit(trn_term_doc.todense(),train[label_cols], batch_size=batch_size, epochs=epochs, validation_split=0.1)
#model.fit(batch_generator(trn_term_doc, train[label_cols], batch_size), epochs=epochs)

### Model evaluation

In [None]:
preds = model.predict(test_term_doc)

In [None]:
test_ids = pd.DataFrame({'id': test["id"]})
predictions = pd.concat([test_ids, pd.DataFrame(preds, columns = label_cols)], axis=1)

Drop rows not used for evaluation.

In [None]:
predictions = predictions.drop(indexes)

Calculating ROC AUC score for each category.

In [None]:
for cat in label_cols:
    
    print(f"Category: {cat}")
    print(f"Sklearn score: {metrics.roc_auc_score(test_labels[cat], predictions[cat], multi_class='ovr')}")
    print(f"torchmetrics score: {torchmetrics.functional.classification.binary_auroc(torch.tensor(predictions[cat].values),torch.tensor(test_labels[cat].values), thresholds=None)}")
    print("#" * 30)
    print()

Calculating mean column-wise ROC AUC score on all categories.

In [None]:
print(f"Sklearn score: {metrics.roc_auc_score(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values, average='macro')}")
print(F"Torchmetrics score: {torchmetrics.functional.classification.multilabel_auroc(torch.tensor(predictions[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),torch.tensor(test_labels[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values),num_labels=6,thresholds=None )}")