# Understanding Advance RNN units

In this Implemetation, we will be comparing some of the advanced RNN units, like Long Short term memory (LSTM) and Gated Recurrent Units (GRU).


LSTM units have very intuitive structure. It has two internal states, whereas vanilla RNN has only one hidden state. The cell state in the LSTM is like a conveyor belt which runs on the top of the unit as shown in the diagram. The cell state is highly regulated by gates attached to it. Gates are the way to let the information through. LSTM has three gates to control the information flow.

![](figures/LSTM.png) 


Figure: Showing various Gates present in LSTM.


**Forget gate**: It regulates the information flow. A sigmoid gate looks at the input  and previous hidden state .  The sigmoid output value of  1 means let everything go through and 0 means nothing to get through. 

$$ f_t = \sigma_g (W_f[W_{t-1},x_t] + b_f)  $$

To keep or not is gradually learned by weights and bias attached to forget gate. 

**Input gate: **Next is the input gate that decides what information we are going to keep in the cell state. The input gate has two inputs one is controlled by sigmoid and another is controlled by tanh. The input gate is defined by below-given equations.

$$i_t = \sigma_g (W_i\bullet [h_{t-1}, x_t]  + b_i) \\
\widetilde{C}_t = tanh(W_c\bullet [h_{t-1}, x_t] + b_c) $$

Output gate: It decides what information to let through according to cell state and hidden state. A sigmoid gate decides what information from the hidden state goes to output. Tanh decides what information from cell state goes to output gate. Output gate can be mathematically represented as follows: 

$$o_t = \sigma_g (W_o[h_{t-1}, x_t]+ b_o) \\
h_t = o_t * tanh(C_t) $$

The information controlled by gate then merges into the cell state as shown in the below-given equation.

$$c_t = f_t \circ c_{t-1} + i_t \circ \widetilde C_t $$

LSTM can be very simply implemented using Pytorch. Pytorch has a function LSTM and it takes similar input shape as described in case of vanilla RNN,  it can be used as follow. 


# Importing Requirements

In [None]:

import json
import os
import random
import tarfile
import urllib
import zipfile

import matplotlib.pyplot as plt
import nltk
import torch
from torch import nn, optim
from torchtext import data
from torchtext import vocab
from tqdm import tqdm

nltk.download('popular')
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# Downloading required datasets
To demonstrate how embeddings can help, we will be conducting an experiment on sentiment analysis task. I have used movie review dataset having 5331 positive and 5331 negative processed sentences. The entire experiment is divided into 5 sections. 

Downloading Dataset: Above discussed dataset is available at http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz.



In [None]:
data_exists = os.path.isfile('data/rt-polaritydata.tar.gz')
if not  data_exists:
    urllib.request.urlretrieve("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz",
                                       "data/rt-polaritydata.tar.gz")
    tar = tarfile.open("data/rt-polaritydata.tar.gz")
    tar.extractall(path='data/')

# Downloading embedding
The pre-trained embeddings are available and can be easily used in our model.  we will be using the FastText vector trained on the wiki news corpus.

In [None]:
embed_exists = os.path.isfile('../embeddings/wiki-news-300d-1M.vec.zip')
if embed_exists:
    print("FastText embeddings exists, if not downloaded properly, then delete the `../embeddings/wiki-news-300d-1M.vec.zip")
    urllib.request.urlretrieve("https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip","../embeddings/wiki-news-300d-1M.vec.zip")
    zip_ref = zipfile.ZipFile("../embeddings/wiki-news-300d-1M.vec.zip", 'r')
    zip_ref.extractall("../embeddings/")
    zip_ref.close()

# Preprocessing
I am using TorchText to preprocess downloaded data. The preprocessing includes following steps:

- Reading and parsing data 
- Defining sentiment and label fields
- Dividing data into train, valid and test subset
- forming the train, valid and test iterators

In [None]:
SEED = 1
split = 0.80

In [None]:
data_block = []
negative_data  = open('data/rt-polaritydata/rt-polarity.neg',encoding='utf8',errors='ignore').read().splitlines()
for i in negative_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 0}) 
positve_data  = open('data/rt-polaritydata/rt-polarity.pos',encoding='utf8',errors='ignore').read().splitlines()
for i in positve_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 1}) 

In [None]:
random.shuffle(data_block)

train_file = open('data/train.json', 'w')
test_file = open('data/test.json', 'w')
for i in  range(0,int(len(data_block)*split)):
    train_file.write(str(json.dumps(data_block[i]))+"\n")
for i in  range(int(len(data_block)*split),len(data_block)):
    test_file.write(str(json.dumps(data_block[i]))+"\n")

In [None]:
def tokenize(sentiments):
#     print(sentiments)
    return sentiments
def pad_to_equal(x):
    if len(x) < 61:
        return x + ['<pad>' for i in range(0, 61 - len(x))]
    else:
        return x[:61]
def to_categorical(x):
    if x == 1:
        return [0,1]
    if x == 0:
        return [1,0]
    

In [None]:
SENTIMENT = data.Field(sequential=True , preprocessing =pad_to_equal , use_vocab = True, lower=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing =to_categorical)
fields = {'sentiment': ('sentiment', SENTIMENT), 'label': ('label', LABEL)}

In [None]:
train_data , test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields                                
)

In [None]:
print("Printing an example data : ",vars(train_data[1]))

**Splitting data in to test and train**

In [None]:
train_data, valid_data = train_data.split(random_state=random.seed(SEED))

In [None]:
print('Number of training examples: ', len(train_data))
print('Number of validation examples: ', len(valid_data))
print('Number of testing examples: ',len(test_data))

**Loading Embedding to vocab**

In [None]:
vec = vocab.Vectors(name = "glove.840B.300d.txt",cache = "../embeddings/")

In [None]:
SENTIMENT.build_vocab(train_data, valid_data, test_data, max_size=100000, vectors=vec)

**Constructing Iterators**

In [None]:
train_iter, val_iter, test_iter = data.Iterator.splits(
        (train_data, valid_data, test_data), sort_key=lambda x: len(x.sentiment),
        batch_sizes=(32,32,32), device=-1,)

In [None]:
sentiment_vocab = SENTIMENT.vocab

In [None]:
sentiment_vocab.vectors.shape

# Training
 Training will be conducted for two models one with Vanilla RNN  pre-trained embedding and one with LSTM. I am using FastText embeddings trained on wikipedia corpus with a vector size of 300. 

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
#     print(rounded_preds)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

## Training using Vanilla RNN

In [None]:
class VANILA_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(VANILA_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))

        output, hidden = self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return torch.softmax(self.fc(hidden.squeeze(0)),dim = 1)

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

vanila_rnn = VANILA_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
vanila_rnn = vanila_rnn.to(device)

In [None]:
optimizer = optim.SGD(vanila_rnn.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
def train(vanila_rnn, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    vanila_rnn.train()
    
    for batch in iterator:
        optimizer.zero_grad()       
        predictions = vanila_rnn(batch.sentiment.to(device)).squeeze(1)
        loss = criterion(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        acc = binary_accuracy(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
rnn_loss = []
rnn_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(vanila_rnn, train_iter, optimizer, criterion)
    print("Loss : ",loss, "Accuracy : ", accuracy )
    rnn_loss.append(loss)
    rnn_accuracy.append(accuracy)

## Training using LSTM

In [None]:
class LSTM_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(LSTM_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell)= self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return self.fc(hidden.squeeze(0))

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

lstm_rnn = LSTM_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
lstm_rnn = lstm_rnn.to(device)

In [None]:
optimizer = optim.SGD(lstm_rnn.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
lstm_loss = []
lstm_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(lstm_rnn, train_iter, optimizer, criterion)
    print("Loss : ",loss, "Accuracy : ", accuracy )
    lstm_loss.append(loss)
    lstm_accuracy.append(accuracy)

## Comparision
When the sentiment analysis test was run for 100 epochs. I found that the performance of the LSTM is recommendable. 

![](figures/LSTM_RNN.png)
Figure: Showing Difference between accuracy when LSTM and RNN used for text classification

The accuracy of train data was 95+% with LSTM and was around 70% with RNN. 


In [None]:
plt.plot(rnn_accuracy , label = "RNN Accuracy")
plt.plot(lstm_accuracy , label = "LSTM Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(loc='upper left')
plt.show()
