# Homework 3: Recurrent Neural Networks (100 points)

### Overview

We now move from image recognition to natural language processing. For this assignment, we will work with a common sentiment analysis task using the IMDB dataset. This set consists of review-label pairs, where the task is to predict whether the text of the given movie review is positive or negative, a binary classification.

### RNN Architecture

You will be comparing four different recurrent neural network architectures: a standard RNN, a Gated Recurrent Unit (GRU), a standard Long Short-Term Memory (LSTM), and a bidirectional LSTM. 

Note that a GRU/LSTM cell _is_ an RNN cell, but we will refer to an RNN in the code and questions below as the most basic RNN.

### Your Task

At the bottom of this notebook file, there are three short answer questions testing your understanding of this neural network architecture. 

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import pickle

In [2]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [3]:
root_dir = 'assets_week3'
reviewVocabVectors = pickle.load(open(root_dir + '/reviewVocabVectors', 'rb'))
trainIterator = pickle.load(open(root_dir + '/trainIterator', 'rb'))
testIterator = pickle.load(open(root_dir + '/testIterator', 'rb'))

In [4]:
embeddingSize = 100
hiddenSize = 10
dropoutRate = 0.5
numEpochs = 5
vocabSize = 20002
pad = 1
unk = 0

class MyRNN(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.name = model
        self.LSTM = (model == 'LSTM' or model == 'BiLSTM')
        self.bidir = (model == 'BiLSTM')
        
        self.embed = nn.Embedding(vocabSize, embeddingSize, padding_idx = pad)
        
        if model == 'RNN': 
            self.rnn = nn.RNN(embeddingSize, hiddenSize)
        elif model == 'GRU': 
            self.rnn = nn.GRU(embeddingSize, hiddenSize)
        else: 
            self.rnn = nn.LSTM(embeddingSize, hiddenSize, bidirectional=self.bidir)

        self.dense = nn.Linear(hiddenSize * (2 if self.bidir else 1), 1)
        self.dropout = nn.Dropout(dropoutRate)
        
    def forward(self, text, textLengths):
        embedded = self.dropout(self.embed(text))
        
        packedEmbedded = nn.utils.rnn.pack_padded_sequence(embedded, textLengths)
        if self.LSTM: 
            packedOutput, (hidden, cell) = self.rnn(packedEmbedded)
        else: 
            packedOutput, hidden = self.rnn(packedEmbedded)

        output, outputLengths = nn.utils.rnn.pad_packed_sequence(packedOutput)
        if self.bidir: 
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        else: 
            hidden = hidden[0]

        return self.dense(self.dropout(hidden))

In [5]:
basicRNN = MyRNN(model='RNN')
GRU = MyRNN(model='GRU') # Construct a GRU model, as above
LSTM = MyRNN(model='LSTM') # Construct a LSTM model, as above
biLSTM = MyRNN(model='BiLSTM') # Construct a BiLSTM model, as above
models = [basicRNN, GRU, LSTM, biLSTM]

In [6]:
for model in models:
    if model is None:
        continue
    model.embed.weight.data.copy_(reviewVocabVectors)
    model.embed.weight.data[unk] = torch.zeros(embeddingSize)
    model.embed.weight.data[pad] = torch.zeros(embeddingSize)

In [7]:
criterion = nn.BCEWithLogitsLoss()

def batchAccuracy(preds, targets):
    roundedPreds = (preds >= 0)
    return (roundedPreds == targets).sum().item() / len(preds)

In [8]:
# Training

for model in models: 
    if model is not None:
        model.train()

for model in models:
    if model is None:
        continue
        
    torch.manual_seed(0)
    optimizer = torch.optim.Adam(model.parameters())
    for epoch in range(numEpochs):
        epochLoss = 0
        for batch in trainIterator:
            optimizer.zero_grad()
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            loss.backward()
            optimizer.step()
            epochLoss += loss.item()
        print(f'Model: {model.name}, Epoch: {epoch + 1}, Train Loss: {epochLoss / len(trainIterator)}')
    print()

Model: RNN, Epoch: 1, Train Loss: 0.7023946511775941
Model: RNN, Epoch: 2, Train Loss: 0.6915372308257901
Model: RNN, Epoch: 3, Train Loss: 0.6850547656378783
Model: RNN, Epoch: 4, Train Loss: 0.6643099472345904
Model: RNN, Epoch: 5, Train Loss: 0.6322632445703686

Model: GRU, Epoch: 1, Train Loss: 0.7004434326115776
Model: GRU, Epoch: 2, Train Loss: 0.6853625318583321
Model: GRU, Epoch: 3, Train Loss: 0.6165663341579535
Model: GRU, Epoch: 4, Train Loss: 0.4745495242383474
Model: GRU, Epoch: 5, Train Loss: 0.38989744939462606

Model: LSTM, Epoch: 1, Train Loss: 0.6942260134250612
Model: LSTM, Epoch: 2, Train Loss: 0.6677144465544035
Model: LSTM, Epoch: 3, Train Loss: 0.5870749384850797
Model: LSTM, Epoch: 4, Train Loss: 0.4947256227893293
Model: LSTM, Epoch: 5, Train Loss: 0.42434181212006933

Model: BiLSTM, Epoch: 1, Train Loss: 0.693617703054872
Model: BiLSTM, Epoch: 2, Train Loss: 0.6840410749320789
Model: BiLSTM, Epoch: 3, Train Loss: 0.5963251682955896
Model: BiLSTM, Epoch: 4, Tra

In [9]:
# Evaluation

for model in models: 
    if model is not None:
        model.eval()

with torch.no_grad():
    
    for model in models:
        
        if model is None:
            continue

        accuracy = 0.0
        for batch in testIterator:
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            acc = batchAccuracy(predictions, batch[1])
            accuracy += acc
        print('Model: {}, Validation Accuracy: {}%'.format(model.name, accuracy / len(testIterator) * 100))

Model: RNN, Validation Accuracy: 70.63459079283886%
Model: GRU, Validation Accuracy: 82.07800511508951%
Model: LSTM, Validation Accuracy: 83.81953324808184%
Model: BiLSTM, Validation Accuracy: 80.93190537084399%


## Homework Questions

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: Coding (50 points)

First, run the code given above to assess accuracy of the default RNN model. 

Next, you will need to construct three other model types (GRU, LSTM, BiLSTM) for comparison purposes. Follow the comments in box 5 to initialize the three other model types then rerun the code with all models enabled.

Finally, compare the accuracies of all four models (the accuracy of the default RNN should not change from the initial run). Explain your results. In doing so, overview the advantages of the best performing architecture.

**Answer:**

The following are the accuracies (rounded off) of the models run for the sentiment analysis task on the IMDb dataset:

1. RNN: 70.63%
2. GRU: 82.08%
3. LSTM: 83.82%
4. BiLSTM: 80.93%

The vanilla RNN performs the least well unsurprisingly, while the "gated" recurrent networks perform much better, with LSTM performing the best amongst all of them.

RNNs are typically difficult to parallelize and slow. They also find it difficult to access information from many timesteps back and are susceptible to vanishing and exploding gradients, which explains the poor performance of the vanilla RNN on the task above. On the other hand, "gated" networks such as GRU, LSTM, and BiLSTM help in mitigating the problem of vanishing gradients, explaining the superior results. Furthermore, LSTMs are able to control the information flow through many timesteps, allowing the modelling of long-term dependencies unlike vanilla RNNs. GRU provides a comparable performance to that of the LSTM and affords a measure of model parsimony by simplifying the LSTM architecture. The BiLSTM takes into account the contextual information from forward and backward directions by incorporating forward and backward RNNs. While it is more sophisticated than a unidirectional LSTM, it likely overfit to the training data and wasn't able to generalize well with the validation data, making it have a slightly worse validation accuracy than that of the LSTM.

### Question 2: LSTM Gates (30 points)

LSTMs improve upon the naive RNN architecture by adding a series of gates instead of a simple matrix-vector computation. Name the gates and explain each of their functions.

**Answer:**

The following are the gates of LSTM:

1. Forget Gate: It is a weighted non-linear combination of the previous hidden state, the current state, and the bias. It decides whether to keep or forget the information from the previous timestep in the cell given the previous hidden state and the new input data.


2. Input Gate: It is also a weighted non-linear combination of the previous hidden state, the current state, and the bias. It decides the extent of cell contents written to the cell state given the previous hidden state and the new input data and is a way to quantify the importance of the new information carried by the input.


3. Output Gate: It is a weighted non-linear combination of the previous hidden state, the current state, and the bias as well. It decides what part of the cell contents are generated as an output to the hidden state given the previous hidden state and the new input data. 

### Question 3: Applications (20 points)

LSTMs are used across many different fields, from music generation to sentiment classification to text generation. Where could they be useful in your life, whether at home, for your family, or in the workplace? Give a specific problem or application for an LSTM model that was not covered in the course slides (**though it can be related to the applications covered in the slides**). Then, with your application in mind, specifically identify the input to your model, the output from your model, and an applicable loss function. 

(As an optional extension, try implementing your LSTM on your own using the code framework given in this homework!)

**Answer:**

I often do not have the patience to read large walls of text, especially in the context of informational articles where I just want to learn something quickly or get what I came to that page for without going through the trouble of reading the entire article. A succinct summary of the article would be tremendously helpful for people like me.

The problem statement is to summarize the input text (i.e. a large article). Since this is a many-to-many problem, an encoder-decoder architecture, both components of which are LSTMs, would be quite useful in summarizing the input text.

The input to my model would be the sequence of words from the large article that I want a summary of. This text would be cleaned up by converting to lower case, removing stopwords, special characters, etc. and would be converted to a context vector by the encoder, which would be the input for the decoder. The decoder would generate the output depending on the information encoded by the encoder. The final output would be the sequence of words that summarize the input text. The loss function applicable in this case would be Categorical Cross-entropy as I would be treating this loosely as a kind of multi-class classification problem.