## 0. Introduction

In earlier notebook(https://www.kaggle.com/anirbansen3027/jtcc-fasttext-supervised) we used fastText library for both generating embeddings for the sentences as well as multi-label text classification on output variables - toxic, severe_toxic, obscene, threat, insult, identity_hate.

In this one, we will be using Keras Library (which is a wrapper over tensorflow) for creating 1-D Convolutional Neural Networks(CNNs) for multi-label text classification.

**Intuition:**

**How did it start ?**

CNNs were first introduced in the 1980s by Yann LeCun to recognize handwritten digits. But ConvNets remained on the sidelines of computer vision because they faced a serious problem: They could not scale. CNNs needed a lot of data and compute resources to work efficiently for large images. In 2012, AlexNet showed that perhaps the time had come to revisit deep learning as it had won various competitions. The availability of large sets of data, namely the ImageNet dataset with millions of labeled pictures, and vast compute resources enabled researchers to create complex CNNs that could perform computer vision tasks that were previously impossible.

**What is a CNN ??**

<img src = https://marketing3.topcoder.com/wp-content/uploads/2019/08/image-15-1024x450.png width = 500> </img>

There are four main operations in the ConvNet shown in the image above:

**1. Convolution**
<img src = https://miro.medium.com/max/1920/1*D6iRfzDkz-sEzyjYoVZ73w.gif width = 400> </img>

This layer is the heart of CNNs. CNNs use feature maps/kernels to learn features of the input. For e.g. the above kernel [[1, 0, -1], [1, 0, -1], [1, 0, -1]] detects vertical lines in images. 
The magic is that, we dont need to specify the numbers in kernels. We just need to mention the number of kernels and the model will learn on itself the kernels, just like weights in a normal ANN. The general idea is that, as we keep on increasing the number of Conv and Pool layers, the more complex features the model will be able to detect. The 1st layers recognize simple things like lines/colors and subsequent layers recognize more complex patterns.

**2. Non Linearity (ReLU)**

An artificial neuron without an activation function will just produce the sum of dot products between all inputs and their weights. By using appropriate nonlinear activation function we can help the neural networks to understand this nonlinear relationship. Here is an indepth blog on activations ([Activation Functions](https://machinelearningknowledge.ai/activation-functions-neural-network/#Why_we_need_Activation_Functions_in_Neural_Network))

Sigmoid function, is used in output neurons in case of binary classification problem to convert the incoming signal into a range of 0 to 1 so that it can be interpreted as a probability.

We have used ReLU or rectified linear unit, which applies the non-saturating activation function f(x)=max(0,x).ReLU is often preferred in the hidden layers to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.It also does not suffer from phenomena of vanishing gradient like Sigmoid and Tanh activation functions



<img src="https://i.imgur.com/qsAiZ5G.jpg" width = 700/>

**3. Pooling or Sub Sampling**
<img src = https://developers.google.com/machine-learning/practica/image-classification/images/maxpool_animation.gif width = 200> </img>

Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.

Here is an indepth blog on types of pooling layers ([Pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/))

In this notebook, we will be using MaxPooling and GlobalMaxPooling

The above image is for MaxPooling - the operation simply involves computing the max value over a block at a time. 

Another type is Global Max Pooling layer. Here, we set the pool size equal to the input size, so that the max of the entire input is computed as the output value. For the above image, if we would have applied Global Max Pooling, we would get 9 as the output.

**4. Classification (Fully Connected Layer)**

Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. Fully Connected layers perform classification based on the features extracted by the previous layers. Typically, this layer is a traditional ANN which multiplies weights with inputs and pass through an activation to give an output

Let's dive into the code then

### Table of Contents:

[1. Importing Libraries](#1)

[2. Reading Dataset](#2)

[3. Text Preprocessing](#3)

[4. Defining a 1D CNN model](#4)

[5. Compile and fit the CNN model](#5)

[6. Predicting and Submitting for Test Data](#6)

[7. TODOs](#7)

## 1. Importing Libraries <a class="anchor" id="1"></a>

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, GlobalMaxPool1D, Embedding, Input

MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000 
EMBEDDING_DIM = 100 
VALIDATION_SPLIT = 0.2

## 2.Reading Dataset <a class="anchor" id="2"></a>
All the datasets are provided as zipped files. First we will have to unzip them and then read them into dataframes

In [2]:
#unzipping all the zip folders and saving it /kaggle/working and saving the verbose in /dev/null to keep it quiet
# -o for overwrite -d for destination directory of unzipped file
!unzip -o '/kaggle/input/jigsaw-toxic-comment-classification-challenge/*.zip' -d /kaggle/working > /dev/null
#Reading input csv files
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")


4 archives were successfully processed.


In [3]:
#Assinging the texts to list of strings
train_texts= df_train.comment_text.values
test_texts= df_test.comment_text.values
#Assignings the labels as a separate df
train_labels = df_train[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]]
#Printing the list of strings
print("First comment text in training set:\n\n", train_texts[0])

First comment text in training set:

 Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27


## 3. Text Preprocessing<a class="anchor" id="3"></a>

Text data must be encoded as numbers to be used as input or output for ML/DL models. The Keras library provides some basic tools to help us prepare our text data. We will be using Tokenizer class, a Text tokenization utility class that allows to vectorize a text corpus, by turning each text to a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf. 
This will be a 3-step process:

**1. Initializing the Tokenizer class** 

* By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized. 0 is a reserved index that won't be assigned to any word.
* We set num_words to MAX_NUM_WORDS (20000) which is the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

**2. Calling the fit_on_texts function - Updates internal vocabulary based on a list of texts**

This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2 it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot). 

**3. Calling the texts_to_sequences function - Transforms each text in texts to a sequence of integers**

So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.

***N.B.***
*After fit_on_texts, which is essentially creating a word_index matrix for the vocabulary, we could do 2 things*

*texts_to_sequences which is used when we use a embedding layer otherwise we can call*

*text_to_matrix which converts the texts to a bag of words*

In [4]:
#Initializing the class
tokenizer = Tokenizer(num_words = MAX_NUM_WORDS)
#Updates internal vocabulary based on a list of texts.
tokenizer.fit_on_texts(train_texts)
#Transforms each text in texts to a sequence of integers.
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)
word_index = tokenizer.word_index
print("Length of word Index:", len(word_index))
print("First 5 elements in the word_index dictionary:", dict(list(word_index.items())[0: 5]) )
print("First comment text in training set:\n", train_sequences[0])

Length of word Index: 210337
First 5 elements in the word_index dictionary: {'the': 1, 'to': 2, 'of': 3, 'and': 4, 'a': 5}
First comment text in training set:
 [688, 75, 1, 126, 130, 177, 29, 672, 4511, 12052, 1116, 86, 331, 51, 2278, 11448, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 15190, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]


Now that we have tokenized the comment texts, we need to pad the sentences to make all the sentences of equal length.

**Why So?**

Deep learning libraries assume a vectorized representation of your data. In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same length. This vectorization allows code to efficiently perform the matrix operations in batch for your chosen deep learning algorithms. 

This is also done in Computer Vision, where we generally tend to resize all the images to a fixed size which will be the input size of the Neural Network.

In [5]:
#Pad tokenized sequences
trainvalid_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print("Shape of padded sequence list:\n", trainvalid_data.shape)
print("First comment text in training set - 0 for padding - only last 50 sequences as the rest are paddings:\n", trainvalid_data[0][-50:])

Shape of padded sequence list:
 (159571, 1000)
First comment text in training set - 0 for padding - only last 50 sequences as the rest are paddings:
 [    0     0     0   688    75     1   126   130   177    29   672  4511
 12052  1116    86   331    51  2278 11448    50  6864    15    60  2756
   148     7  2937    34   117  1221 15190  2825     4    45    59   244
     1   365    31     1    38    27   143    73  3462    89  3085  4583
  2273   985]


## 4. Defining a 1D CNN model<a class="anchor" id="4"></a>

In keras, the easiest way to define a model is initiate a Sequential model class and keep adding required layers. A Sequential model is a plain stack of layers where each layer has exactly one input tensor and one output tensor.

A standard model for document classification is to use an Embedding layer as input, followed by a one-dimensional convolutional neural network, pooling layer, and then a prediction output layer. We used 1 embedding layer, 3 sets of Convolution and Pooling layer and 2 sets of Dense layer. We can either use a pre-trained embedding (like Word2Vec) to generate a embedding matrix of size Vocabulary * Dimension of embedding or train a fresh embedding putting it as an input layer along with other weights.

**Conv1D**

Convolutional Neural Network (CNN)  models were developed for image classification, in which the model accepts a two-dimensional input representing an image’s pixels and color channels. This same process can be applied to 1D sequences of data. The model extracts features from sequences data and maps the internal features of the sequence. CNNs take into account the proximity of words to create trainable patterns.
The kernel size/height in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document, providing a grouping parameter. In our case, it will consider 5 words at a time and in the image it will consider 2 words at a time

<img src="https://i.imgur.com/zEapf5O.png" width = 300/>

**Max Pooling 1D**

Max Pooling layer will consolidate the output from the convolutional layer. We had earlier seen MaxPooling 2D. In Maxpooling 1D, the same thing happens only in 1 direction.

We use sigmoid activation in the output layer. Sigmoid function gives us a probability score between 0 and 1 from each out of the output node. If we would have used softmax it gives a probability distribution across the output nodes that adds to 1.

In general,
* For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross entropy loss
* For multi class classification, we can have N output units, use softmax activation in the output layer and use categorical cross entropy loss
* For multi label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross entropy loss

In [6]:
cnn_model = Sequential()
cnn_model.add(Embedding(MAX_NUM_WORDS, 128))
cnn_model.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn_model.add(MaxPooling1D(pool_size = 5))
cnn_model.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn_model.add(MaxPooling1D(pool_size = 5))
cnn_model.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn_model.add(GlobalMaxPool1D())
cnn_model.add(Dense(units = 128, activation = 'relu'))
cnn_model.add(Dense(units = 6, activation = 'sigmoid'))

print(cnn_model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 128)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0

## 5. Compile and fit the CNN model <a class="anchor" id="5"></a>

Before starting to train the model, we need to configure it. We need to mention the loss fucntion which will be used to calculate the error at each iteration, optimizer which will specify how the weights will be updated and the metrics which is to be evaluated by the model during training and testing

While fitting/ training the model, along with the training set we also pass the following parameters:

batch_size = Number of samples that goes through the network at a time and updates the network parameters by calculating loss (in Mini Batch Gradient Descent)

epochs = Number of times the whole set of training samples goes through the network

validation_data = the dataset that will be used to evaluate the loss and any model metrics at the end of each epoch. This set will not be used for training.

In [7]:
#Configures the model for training.
cnn_model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["AUC"])

#Split the dataset into train and validation set for training and evaludating the model
X_train, X_val, y_train, y_val = train_test_split(trainvalid_data, train_labels, shuffle = True, random_state = 123)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
#Trains the model for a fixed number of epochs (iterations on a dataset)
history = cnn_model.fit(X_train, y_train, batch_size = 128, epochs = 1, validation_data = (X_val, y_val))

(119678, 1000) (119678, 6) (39893, 1000) (39893, 6)


## 6. Predicting and Submitting for Test Data <a class="anchor" id="6"></a>

In [8]:
# Merging the test dataset with sample_submission to have all the columns:
#id,text_data and the target variables in one dataframe
df_test = pd.merge(df_test, sample_submission, on = "id")
#Use the CNN model to output probabilities on test data
y_preds = cnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_preds
#Drop Comment Text as the sample submission doesnt have it and wouldnt be expected
df_test.drop(["comment_text"], axis = 1, inplace = True)
#Save the dataset as a csv to submit it
df_test.to_csv("sample_submission.csv", index = False)

## 7. TODOs <a class="anchor" id="7"></a>
* Use a multichannel CNN which would combine looking at different length (e.g. kernel size of 3, 5 and 7) of a sentences at a time
* Tune the model layers and hyperparameters to improve the performance

***Do upvote if you find it helpful 😁***