# Toxic Comment Classifier


Welcome to this jupyter notebook, here we will try to make a model to classify given pieces of text for multiple labels using a single input multiple output model, to classify whether given piece of comment is one or more than one of the following labels

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

This problem ([Toxic Comment Classifier](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview)) is hosted on kaggle


## Outline

- [Part 1 Data Preparation](#1)
     - [1.1 Importing the Data](#1.1)
     - [1.2 Tokenizing and Padding the Data](#1.2)
     
- [Part 2. Creating and Training the Model ](#2)
    - [2.1 Creating The Model](#2.1)
    - [2.2 Compiling and Training the Model](#2.2)
    
- [Part 3. Testing The Model](#3)
    - [3.1 Loading the Test Data](#3.1)
    - [3.2 Tokenizing and Padding the Data](#3.2)
    - [3.3 Predicting on the Test Data](#3.3)
    - [3.4 Final Result](#3.4)

In [1]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

<a name='1'></a>
## 1 Data Preparation

<a name='1.1'></a>
### 1.1 Importing the Data

We will first import the training dataset which has 8 columns `id,comment_text,toxic,sever_toxic,obscene,threat,insult,identity_hate` where the last 6 are the target labels

In [30]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [16]:
texts = df.comment_text.values
texts[:10]

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       '"\nMore\nI can\'t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences f

<a name='1.2'></a>

### 1.2 Tokenizing and Padding the Data

First we will start by removing the `\n`and `\t` tags from the text thrn we will tokenize the data using the tensorflow [Tokenizer](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/preprocessing/text/Tokenizer) and then padding the input so that all the training samples become uniform using the [pad_sequence](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) function from tensorflow

In [20]:
def remove(text):
    
    text = re.sub('\n',' ',text)
    text = re.sub('\t',' ',text)
    return text

In [21]:
texts = [remove(x) for x in texts]

In [22]:
texts[:10]

["Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
 "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
 "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
 '" More I can\'t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on refe

Here we will tokenize the data the `oov_token` it is used to replace out of vocabulary word when the tokenizer is called to tokenize a given piece of text where vocabulary is the mapping of words to index 

In [24]:
tokenizer = Tokenizer(oov_token='<UNK>') # Tokenizing
tokenizer.fit_on_texts(texts)

In [25]:
vocab = tokenizer.word_index
texts = tokenizer.texts_to_sequences(texts)

Here we are trying to find the right length to be used as the uniform lengths for all the training samples

In [28]:
lengths = pd.Series([len(x) for x in texts])
lengths.describe()

count    159571.000000
mean         68.221569
std         101.073763
min           1.000000
25%          17.000000
50%          36.000000
75%          76.000000
max        1403.000000
dtype: float64

In [73]:
toxic     = df.toxic.values
sev_toxic = df.severe_toxic.values
obs       = df.obscene.values
threat    = df.threat.values
ins       = df.insult.values
idn_hate  = df.identity_hate.values

In [29]:
tensors = pad_sequences(texts,maxlen=128,padding='post',truncating='post')

<a name='2'></a>
## 2. Creating and Training the Model

<a name='2.1'></a>
### 2.1 Creating the Model

The following layers are used in the creation of the model


- [tf.keras.layers.Input()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/InputLayer): It is used to instantiate a Keras tensor which is used as an input to the model,in our case the shape ofthe tensor will be (batch_size,128)`tf.keras.layers.Input(shape=(128,))`


- [tf.keras.layers.Embedding()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/Embedding): this converts each token to its vector representation. In this case, it is the the size of the vocabulary by the dimension of the model: `tf.keras.layers.Embedding(vocab_size, d_model)`. `vocab_size` is the number of entries in the given vocabulary. `d_model` is the number of elements in the word embedding. 

- [tf.keras.layers.LSTM()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/LSTM): LSTM layer of size d_model and which returns a sequence

- [tf.tf.keras.layers.GlobalAveragePooling1D()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) : Global average pooling operation for temporal data.

- [tf.keras.layers.BatchNormalization()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/BatchNormalization):Normalize and scale inputs or activations.

- [tf.keras.layers.Dropout()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/Dropout):Applies Dropout to the input.(Prevents from overfitting by deactivating a fraction of neurons from the previous layer)

- [tf.keras.layers.Dense()](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/layers/Dense): We have used a Dense layer with [Relu](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/activations/relu) activation and 6 output Dense layer with [Sigmoid](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/activations/sigmoid) activation

In [77]:
def Classifier(num_words = len(vocab) + 1, d_model = 128,droput=0.1):
    
    inp = tf.keras.layers.Input(shape=(128,))
    
    x   = tf.keras.layers.Embedding(num_words,d_model)(inp)
    x   = tf.keras.layers.LSTM(d_model,return_sequences=True)(x)
    x   = tf.keras.layers.GlobalAveragePooling1D()(x)
    x   = tf.keras.layers.BatchNormalization()(x)
    x   = tf.keras.layers.Dropout(droput)(x)
    x   = tf.keras.layers.Dense(64,'relu')(x)
    
    tox     = tf.keras.layers.Dense(1,'sigmoid',name='Toxic_Classifier')(x)
    sev_tox = tf.keras.layers.Dense(1,'sigmoid',name='Severe_Toxic_Classifier')(x)
    obs     = tf.keras.layers.Dense(1,'sigmoid',name='Obscene_Classifier')(x)
    thr     = tf.keras.layers.Dense(1,'sigmoid',name='Threat_Classifier')(x)
    ins     = tf.keras.layers.Dense(1,'sigmoid',name='Insult_Classifier')(x)
    idh     = tf.keras.layers.Dense(1,'sigmoid',name='Identity_Hate_Classifier')(x)
    
    Model = tf.keras.models.Model(inp,[tox,sev_tox,obs,thr,ins,idh])
    
    return Model

In [78]:
Model = Classifier()
Model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 128)]        0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 128, 128)     26923392    input_4[0][0]                    
__________________________________________________________________________________________________
lstm_3 (LSTM)                   (None, 128, 128)     131584      embedding_3[0][0]                
__________________________________________________________________________________________________
global_average_pooling1d_3 (Glo (None, 128)          0           lstm_3[0][0]                     
____________________________________________________________________________________________

<a name='2.2'></a>
### 2.2 Compiling and Training the Model
We are going to use the [Adam](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/optimizers/Adam) optimizer and [Binary Crossentropy](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf/keras/losses/BinaryCrossentropy) loss for all the predictions and train the model for 5 epochs with 0.1 Validation split and with 100 steps per epoch

In [82]:
Model.compile(optimizer=tf.keras.optimizers.Adam(0.01),
              loss={'Toxic_Classifier':tf.keras.losses.BinaryCrossentropy(),
                    'Severe_Toxic_Classifier':tf.keras.losses.BinaryCrossentropy(),
                    'Obscene_Classifier':tf.keras.losses.BinaryCrossentropy(),
                    'Threat_Classifier':tf.keras.losses.BinaryCrossentropy(),
                    'Insult_Classifier':tf.keras.losses.BinaryCrossentropy(),
                    'Identity_Hate_Classifier':tf.keras.losses.BinaryCrossentropy()})

In [83]:
Model.fit(tensors,[toxic,sev_toxic,obs,threat,ins,idn_hate],epochs=5,validation_split=0.1,steps_per_epoch=100)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1abaafbd310>

<a name='3'></a>
## 3. Testing the Model

<a name='3.1'></a>

### 3.1 Importing The Test Data

In [84]:
df_test = pd.read_csv('test.csv') # Reading the test data set
df_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [85]:
test_comment = df_test.comment_text.values

In [86]:
test_comment[:5]

array(["Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,",
       '== From RfC == \n\n The title is fine as it is, IMO.',
       '" \n\n == Sources == \n\n * Zawe Ashton on Lapland —  /  "',
       ":If you have a look back at the source, the information I updated was the correct form. I can only guess the source hadn't updated. I shall update the information once again but thank you for your message.",
       "I don't anonymously edit articles at all."], dtype=object)

<a name='3.2'></a>
### 3.2 Tokenizing and Padding the Test Data

In [87]:
test_tensors = tokenizer.texts_to_sequences(test_comment) # tokenizing the comments

In [88]:
test_tensors = pad_sequences(test_tensors,maxlen=128,padding='post',truncating='post') #padding the tensors to a length of 128

In [89]:
test_tensors.shape

(153164, 128)

<a name='3.3'></a>

### 3.3 Predicting on Test Data

In [91]:
y_hat = Model.predict(test_tensors)  # predicting

In [92]:
test_tox = y_hat[0]
test_sev_tox = y_hat[1]
test_obs = y_hat[2]
test_threat = y_hat[3]
test_ins = y_hat[4]
test_idn_hate = y_hat[5]

In [93]:
df_sub = pd.DataFrame({'id':df_test.id.values,
                       'toxic':np.squeeze(test_tox),
                       'severe_toxic':np.squeeze(test_sev_tox),
                       'obscene':np.squeeze(test_obs),
                       'threat':np.squeeze(test_threat),
                       'insult':np.squeeze(test_ins),
                       'identity_hate':np.squeeze(test_idn_hate)})   # Creating the submission dataframe

In [94]:
df_sub.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.99619,0.541441,0.986939,0.082921,0.941591,0.315915
1,0000247867823ef7,0.002316,1e-05,6.1e-05,0.000237,0.000214,4.3e-05
2,00013b17ad220c46,0.00032,7e-06,1.6e-05,0.000109,3.9e-05,2.1e-05
3,00017563c3f7919a,6.8e-05,3e-06,2.2e-05,1.6e-05,1.7e-05,3e-06
4,00017695ad8997eb,0.000363,5e-06,2.4e-05,0.00012,5.4e-05,1.5e-05


In [95]:
df_sub.to_csv('submission.csv',index=False)

<a name='3.4'></a>
## 3.4 Final Result

<img src='accuracy.png'>