### Table of Contents:

[0. Introduction](#0)

[1. Importing Libraries](#1)

[2. Reading Dataset](#2)

[3. Text Preprocessing](#3)

[4. Defining a Multi-Label LSTM model](#4)

[5. Compile and train the LSTM model](#5)

[6. Predicting and Submitting for Test Data](#6)

[7. TODOs](#7)

## 0. Introduction <a class="anchor" id="0"></a>
In earlier notebook(https://www.kaggle.com/anirbansen3027/jtcc-cnn) we used g Keras Library (which is a wrapper over tensorflow) for creating 1-D Convolutional Neural Networks(CNNs) for multi-label text classification on output variables - toxic, severe_toxic, obscene, threat, insult, identity_hate.

In this one, we will be using the same Keras Library (which is a wrapper over tensorflow) for creating Long Short Term Memory (LSTM) which is an improvement over regular RNNs for multi-label text classification. We will be first going through a bit of intuition of how RNNs and LSTM work and then implement it using a minimalistic single output layer network for multilabel classification (instead of creating 6 separate networks for each type of toxicity or creating a multiple output layer network). We will be just using a single LSTM layer and in just a single epoch it gives ~96 AUC on leaderboard 

### Why do we need for RNNs?
In a traditional neural network we assume that all inputs (and outputs) are independent of each other. They dont share features learnt across different positions of text. This might be an issue for sequential information such as text data or time-series data where each instance is also dependent on the previous ones. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a ‚Äúmemory‚Äù which captures information about what has been calculated so far. 

<img src="https://i.imgur.com/FQyAYBP.png" title="source: imgur.com" width = 700/>

### What is the architecture of RNNs?
The overall architecture of the RNN depends on the task in hand. For this task which is a classification task, we will be using the 3rd one: many-to-one. But for intuition purpose, let's look at the 5th one which is a more generalised notation for RNNs. If we know how the 5th notation work, it will be just a matter to change a small part.

<img src="https://www.di.ens.fr/~lelarge/dldiy/slides/lecture_8/images/rnn_variants_4.png" width = 500/>
Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video)

<img src="https://i.imgur.com/S7AKfYi.png"/>

### What is vanishing gradients?
The vanishing gradient problem arises in very deep Neural Networks, typically Recurrent Neural Networks, that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to ‚Äúvanish‚Äù or reduce to 0 throughout the layers, preventing the network from learning long-range dependencies. As the sequence gets longer, the gradients/ derivatives passed to the previous states become smaller and smaller.There are many solutions to this problem. One of that is using LSTM.


### What is an LSTM?
Long Short Term Memory networks ‚Äì usually just called ‚ÄúLSTMs‚Äù ‚Äì are a special kind of RNN, capable of learning long-term dependencies.All RNNs have the form of a chain of repeating modules of neural network. LSTMs also have this chain like structure, but instead of the hidden layer we have something called LSTM cell and we have another connection that runs through all the time steps along with the hidden state.This is the called the "Cell State" vector from which information can be retrieved and removed as and when required.

<img src="https://i.imgur.com/utWg9yZ.png"/></a>

Let's look at the 6 steps:

1. This is the forget gate which is responsible for how much to forget and since it passes through a sigmoid function, it will give a value of 0 to 1 which is the amount of memory to be retained.
2. This is the input gate which is responsible for how much new information is to be added to the cell state. Similar to forget gate this will also give a value of 0 to 1 which is the amount of new memory to be added
3. This is the creation of new candidate vector/ cell state 
4. This is where the cell state is updated which is a combination of previos cell state and current cell state, the contribution of each is controlled using the forget gate and input gate respectively.
5. This is the output gate which is responsible for what part of the updated cell state is to be remembered in the hidden state having a value between 0 and 1
6. This is the updated hidden state which will be the input for next cell and is based on cell state controlled by output gate 

This is an awesome link to deep dive further into LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/

### How does LSTM solve vanishing gradients?
* The LSTM architecture makes it easier for the RNN to preserve information over many timesteps e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
* By contrast, it‚Äôs harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state
* LSTM doesn‚Äôt guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies

Before the birth of Transformers, LSTMs ruled the world of NLP. Even today it is used in many places

**2015:**
Google started using an LSTM for speech recognition on Google Voice. According to the official blog post, the new model cut transcription errors by 49%.

**2016:**
2016: Google started using an LSTM to suggest messages in the Allo conversation app. In the same year, Google released the Google Neural Machine Translation system for Google Translate which used LSTMs to reduce translation errors by 60%.

Apple announced in its Worldwide Developers Conference that it would start using the LSTM for quicktype in the iPhone and for Siri.

Amazon released Polly, which generates the voices behind Alexa, using a bidirectional LSTM for the text-to-speech technology.

**2017:**
Facebook performed some 4.5 billion automatic translations every day using long short-term memory networks.

Enough of context, let's dive into the code üë®‚Äçüíª

## 1. Importing Libraries <a class="anchor" id="1"></a>

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#sklearn libraries
from sklearn.model_selection import train_test_split
#keras libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Input
#Constants
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000 
EMBEDDING_DIM = 100 
VALIDATION_SPLIT = 0.2

## 2.Reading Dataset <a class="anchor" id="2"></a>
All the datasets are provided as zipped files. First we will have to unzip them and then read them into dataframes

In [2]:
#unzipping all the zip folders and saving it /kaggle/working and saving the verbose in /dev/null to keep it quiet
# -o for overwrite -d for destination directory of unzipped file
!unzip -o '/kaggle/input/jigsaw-toxic-comment-classification-challenge/*.zip' -d /kaggle/working > /dev/null

#Reading input csv files
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")

print(df_train.shape, df_test.shape, sample_submission.shape)
df_train.head()


4 archives were successfully processed.
(159571, 8) (153164, 2) (153164, 7)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
#Assinging the texts to list of strings
train_texts= df_train.comment_text.values
test_texts= df_test.comment_text.values
#Assignings the labels as a separate df
train_labels = df_train[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]]
#Printing the list of strings
print("First comment text in training set:\n\n", train_texts[0])

First comment text in training set:

 Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27


## 3. Text Preprocessing <a class="anchor" id="3"></a>
The preprocessing for the LSTM model is pretty much same as the CNN one. We use the Tokenizer class from Keras to tokenize the strings into a sequence of numbers by mapping each word to a number based of frequency. We also use pad_sequences from Keras to pad the tokenized sequence of integers to make all the sequences of same size as the ANN be it CNN or LSTM will be expecting a fixed sized input each time for vectorized calculations. I would recommend to look at the notebook for an elaborate read (https://www.kaggle.com/anirbansen3027/jtcc-cnn#3.-Text-Preprocessing)

*We will follow these steps going ahead for Multi-Label text classification using LSTM:*

**Input String -> Tokenization -> Padding -> Embedding -> LSTM -> Classifier**

In [4]:
#Initializing the class
tokenizer = Tokenizer(num_words = MAX_NUM_WORDS)
#Updates internal vocabulary based on a list of texts.
tokenizer.fit_on_texts(train_texts)
#Transforms each text in texts to a sequence of integers.
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)
word_index = tokenizer.word_index
print("Length of word Index:", len(word_index))
print("First 5 elements in the word_index dictionary:", dict(list(word_index.items())[0: 5]) )
print("First comment text in training set:\n", train_sequences[0])

Length of word Index: 210337
First 5 elements in the word_index dictionary: {'the': 1, 'to': 2, 'of': 3, 'and': 4, 'a': 5}
First comment text in training set:
 [688, 75, 1, 126, 130, 177, 29, 672, 4511, 12052, 1116, 86, 331, 51, 2278, 11448, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 15190, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985]


In [5]:
#Pad tokenized sequences
trainvalid_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print("Shape of padded sequence list:\n", trainvalid_data.shape)
print("First comment text in training set - 0 for padding - only last 50 sequences as the rest are paddings:\n", trainvalid_data[0][-50:])

Shape of padded sequence list:
 (159571, 1000)
First comment text in training set - 0 for padding - only last 50 sequences as the rest are paddings:
 [    0     0     0   688    75     1   126   130   177    29   672  4511
 12052  1116    86   331    51  2278 11448    50  6864    15    60  2756
   148     7  2937    34   117  1221 15190  2825     4    45    59   244
     1   365    31     1    38    27   143    73  3462    89  3085  4583
  2273   985]


## 4. Defining a Multi-Label LSTM model <a class="anchor" id="4"></a>

In keras, the easiest way to define a model is initiate a Sequential model class and keep adding required layers. A Sequential model is a plain stack of layers where each layer has exactly one input tensor and one output tensor.

In this NN model, a new paramter called dropout is being used:

**Dropout**

Dropout is a technique for addressing the problem of overfitting. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.
A new hyperparameter is introduced that specifies the probability at which outputs of the layer are dropped out.

Recurrent dropout masks (or "drops") the connections between the recurrent units.

#### Important Note: In general,

***For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross entropy loss**

***For multi class classification, we can have N output units, use softmax activation in the output layer and use categorical cross entropy loss**

***For multi label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross entropy loss**

In [6]:
rnn_model = Sequential()
rnn_model.add(Embedding(MAX_NUM_WORDS, 128))
rnn_model.add(LSTM(units = 128, dropout = 0.2, recurrent_dropout = 0.2))
rnn_model.add(Dense(units = 6, activation = 'sigmoid'))
print(rnn_model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 6)                 774       
Total params: 2,692,358
Trainable params: 2,692,358
Non-trainable params: 0
_________________________________________________________________
None


## 5. Compile and train the LSTM model <a class="anchor" id="5"></a>
The compiling and training/fitting code is also pretty much same as the CNN model.

Before starting to train the model, we need to configure it. We need to mention the loss fucntion which will be used to calculate the error at each iteration, optimizer which will specify how the weights will be updated and the metrics which is to be evaluated by the model during training and testing

While fitting/ training the model, along with the training set we also pass the following parameters:

batch_size = Number of samples that goes through the network at a time and updates the network parameters by calculating loss (in Mini Batch Gradient Descent)

epochs = Number of times the whole set of training samples goes through the network

validation_data = the dataset that will be used to evaluate the loss and any model metrics at the end of each epoch. This set will not be used for training.

In [7]:
#Configures the model for training.
rnn_model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["AUC"])

#Split the dataset into train and validation set for training and evaludating the model
X_train, X_val, y_train, y_val = train_test_split(trainvalid_data, train_labels, shuffle = True, random_state = 123)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

#Trains the model for a fixed number of epochs (iterations on a dataset)
history = rnn_model.fit(X_train, y_train, batch_size = 128, epochs = 1, validation_data = (X_val, y_val))

(119678, 1000) (119678, 6) (39893, 1000) (39893, 6)


## 6. Predicting and Submitting for Test Data <a class="anchor" id="6"></a>

In [8]:
# Merging the test dataset with sample_submission to have all the columns:
#id,text_data and the target variables in one dataframe
df_test = pd.merge(df_test, sample_submission, on = "id")
#Use the CNN model to output probabilities on test data
y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_preds
#Drop Comment Text as the sample submission doesnt have it and wouldnt be expected
df_test.drop(["comment_text"], axis = 1, inplace = True)
#Save the dataset as a csv to submit it
df_test.to_csv("sample_submission.csv", index = False)

## 7. TODOs <a class="anchor" id="7"></a>
* Stack more LSTM layers 
* Hyperparameter Tune the parameters

Do upvote if you find it helpful üòÅ