# Homework 3: Recurrent Neural Networks (100 points)

### Overview

We now move from image recognition to natural language processing. For this assignment, we will work with a common sentiment analysis task using the IMDB dataset. This set consists of review-label pairs, where the task is to predict whether the text of the given movie review is positive or negative, a binary classification.

### RNN Architecture

You will be comparing four different recurrent neural network architectures: a standard RNN, a Gated Recurrent Unit (GRU), a standard Long Short-Term Memory (LSTM), and a bidirectional LSTM. 

Note that a GRU/LSTM cell _is_ an RNN cell, but we will refer to an RNN in the code and questions below as the most basic RNN.

### Your Task

At the bottom of this notebook file, there are four short answer questions testing your understanding of this neural network architecture. As before, some questions will require you to experiment with model hyperparameters. Additionally, you will need to produce and analyze a graph.

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a .PDF file by choosing File - Download as - pdf (.pdf). You will be submitting this .pdf to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import pickle

In [2]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [3]:
root_dir = 'assets_week3'
reviewVocabVectors = pickle.load(open(root_dir + '/reviewVocabVectors', 'rb'))
trainIterator = pickle.load(open(root_dir + '/trainIterator', 'rb'))
testIterator = pickle.load(open(root_dir + '/testIterator', 'rb'))

In [4]:
embeddingSize = 100
hiddenSize = 10
dropoutRate = 0.5
numEpochs = 5
vocabSize = 20002
pad = 1
unk = 0

class MyRNN(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.name = model
        self.LSTM = (model == 'LSTM' or model == 'BiLSTM')
        self.bidir = (model == 'BiLSTM')
        
        self.embed = nn.Embedding(vocabSize, embeddingSize, padding_idx = pad)
        
        if model == 'RNN': 
            self.rnn = nn.RNN(embeddingSize, hiddenSize)
        elif model == 'GRU': 
            self.rnn = nn.GRU(embeddingSize, hiddenSize)
        else: 
            self.rnn = nn.LSTM(embeddingSize, hiddenSize, bidirectional=self.bidir)

        self.dense = nn.Linear(hiddenSize * (2 if self.bidir else 1), 1)
        self.dropout = nn.Dropout(dropoutRate)
        
    def forward(self, text, textLengths):
        embedded = self.dropout(self.embed(text))
        
        packedEmbedded = nn.utils.rnn.pack_padded_sequence(embedded, textLengths)
        if self.LSTM: 
            packedOutput, (hidden, cell) = self.rnn(packedEmbedded)
        else: 
            packedOutput, hidden = self.rnn(packedEmbedded)

        output, outputLengths = nn.utils.rnn.pad_packed_sequence(packedOutput)
        if self.bidir: 
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        else: 
            hidden = hidden[0]

        return self.dense(self.dropout(hidden))

In [5]:
basicRNN = MyRNN(model='RNN')
GRU = MyRNN(model='GRU') # Construct a GRU model, as above
LSTM = MyRNN(model='LSTM') # Construct a LSTM model, as above
biLSTM = MyRNN(model='BiLSTM') # Construct a BiLSTM model, as above
models = [basicRNN, GRU, LSTM, biLSTM]

In [6]:
for model in models:
    if model is None:
        continue
    model.embed.weight.data.copy_(reviewVocabVectors)
    model.embed.weight.data[unk] = torch.zeros(embeddingSize)
    model.embed.weight.data[pad] = torch.zeros(embeddingSize)

In [7]:
criterion = nn.BCEWithLogitsLoss()

def batchAccuracy(preds, targets):
    roundedPreds = (preds >= 0)
    return (roundedPreds == targets).sum().item() / len(preds)

In [8]:
# Training

for model in models: 
    if model is not None:
        model.train()

for model in models:
    if model is None:
        continue
        
    torch.manual_seed(0)
    optimizer = torch.optim.Adam(model.parameters())
    for epoch in range(numEpochs):
        epochLoss = 0
        for batch in trainIterator:
            optimizer.zero_grad()
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            loss.backward()
            optimizer.step()
            epochLoss += loss.item()
        print(f'Model: {model.name}, Epoch: {epoch + 1}, Train Loss: {epochLoss / len(trainIterator)}')
    print()

Model: RNN, Epoch: 1, Train Loss: 0.7023946528544511
Model: RNN, Epoch: 2, Train Loss: 0.6915372325026471
Model: RNN, Epoch: 3, Train Loss: 0.6850547603024241
Model: RNN, Epoch: 4, Train Loss: 0.6643173133625704
Model: RNN, Epoch: 5, Train Loss: 0.6291575803781104

Model: GRU, Epoch: 1, Train Loss: 0.7004434310871622
Model: GRU, Epoch: 2, Train Loss: 0.6853625314010073
Model: GRU, Epoch: 3, Train Loss: 0.6251700029653662
Model: GRU, Epoch: 4, Train Loss: 0.4790666857186486
Model: GRU, Epoch: 5, Train Loss: 0.387260852453044

Model: LSTM, Epoch: 1, Train Loss: 0.694226009461581
Model: LSTM, Epoch: 2, Train Loss: 0.6681870629110604
Model: LSTM, Epoch: 3, Train Loss: 0.6659568180056179
Model: LSTM, Epoch: 4, Train Loss: 0.560460474027697
Model: LSTM, Epoch: 5, Train Loss: 0.5663308347277629

Model: BiLSTM, Epoch: 1, Train Loss: 0.6936177038170798
Model: BiLSTM, Epoch: 2, Train Loss: 0.684041075541845
Model: BiLSTM, Epoch: 3, Train Loss: 0.6048695712595644
Model: BiLSTM, Epoch: 4, Train Lo

In [9]:
# Evaluation

for model in models: 
    if model is not None:
        model.eval()

with torch.no_grad():
    
    for model in models:
        
        if model is None:
            continue

        accuracy = 0.0
        for batch in testIterator:
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            acc = batchAccuracy(predictions, batch[1])
            accuracy += acc
        print('Model: {}, Validation Accuracy: {}%'.format(model.name, accuracy / len(testIterator) * 100))

Model: RNN, Validation Accuracy: 72.0428388746803%
Model: GRU, Validation Accuracy: 81.72234654731459%
Model: LSTM, Validation Accuracy: 79.09846547314578%
Model: BiLSTM, Validation Accuracy: 82.27621483375958%


## Homework Questions

### Question 1: Coding (50 points)

First, run the code given above to assess accuracy of the default RNN model. Note that it is normal for box 3 to take a significant amount of time to run as it is loading a large text dataset.

Next, you will need to construct three other model types (GRU, LSTM, BiLSTM) for comparison purposes. Follow the comments in box 4 to initialize the three other model types then rerun the code with all models enabled.

Finally, compare the accuracies of all four models (the accuracy of the default RNN should not change from the initial run). Explain your results. In doing so, overview the advantages of the best performing architecture.

RNN MODEL: The base level RNN model has a calculated validation accuracy of 72.04 on the test set. 

GRU MODEL: The GRU model has a calculated validation accuracy of 81.72 on the test set.

LSTM MODEL: The LSTM model has a calculated validation accuracy of 79.09 on the test set.

BiLSTM MODEL: The ViLSTM had the best validation accuracy on the test set, at 82.28.

The fact that the RNN MODEL had the lowest accuracy is not entirely surprising. The RNN model is much more susceptible to the vanishing/exploding gradient problem as compared to the other models since backpropogation through time necessitates lots of multiplication operations. If the gradient shrinks exponentially, then that means the weights of the initial layers of the network will not be updated effectively on the backwards pass. On the flipside, if the gradient explodes excessively, the model weights will be updated by extremely large increments, which may cause the model to converge on a suboptimal solution too early and makes it difficult for the model to learn on long input sequences. In regards to other shortcomings, vanila RNNs are typically slow to train because RNN steps are difficult to parralelize. Since each step is dependent on the previous one, one cannot parralelize the computations.

Of the other three model types - LSTM, BiLSTM, and GRU, LSTM performs the worst comparatively, GRU the second best, and BiLSTM the best, though the accuracies of all three are quite close to one another. I somewhat expected for the LSTM to at least have better performance than the RNN, because the LSTM is not nearly as susceptible to the vanishing/exploding gradient problem as vanilla RNNs. This is because the forget gate, input gate, and output gate all work in tandem to mitigate this issue by controlling the value of the gradient as it passes through, thereby preventing it from getting excessively small or excessively large. The GRU happens to perform better than the LSTM. It has fewer parameters than the former, and only has two gates to control the flow of information - the reset gate and the update gate - thereby potentially reducing some redundancies in functionality of the gates that may have occured in the LSTM model, and also took less time to train than the LSTM model and RNN model. 

The BiLSTM model, however, slightly edges out the GRU model and performs the best of the bunch. BiLSTMs still have all the advantages that LSTMs have over vanilla RNNs(controlling the gradient size and accessing long range sequences), with the added benefit that, for each point in a given sequence, the bidirectional neural net knows about points not only before the point but also after it. In the context of this movie review problem, this gives the BiLSTM an edge in the sense that the model can gain information by also taking into account what hasn't been said yet about the movie in question, in addition to what has already been said, providing it with potentially valuable additional context.


### Question 2: LSTM Gates (30 points)

LSTMs improve upon the naive RNN architecture by adding a series of gates instead of a simple matrix-vector computation. Name the gates and explain each of their functions.

Three different gates regulate the flow of information in an LSTM - the forget gate, the input gate, and the output gate. 

Forget Gate: This gate decides what information that passes through it should be kept, and what should be thrown away. The information that flows to it comes from the previous hidden state and the current input. This info is fed into a sigmoid activation function within the gate, and if the value is close to zero, it implies that the info should be forgotten, and if it is close to 1, it implies that the info should be kept.

Input Gate: The input gate is used to update the cell state. The info from the previous hidden state and the current input, like the forget gate, is passed into a sigmoid function within the gate. If the resulting value is close to 0, that value will not be updated, but if it is close to 1, then it will. However, unlike the forget gate, there is also a tanh activation function within the input gate, and the hidden state/current input info is passed into that function as well. Then, the tanh output is multiplied with the sigmoid output, resulting in either a zero or nonzero value. If the value is nonzero, then that means the info is important.


Output Gate: The output gate is the final gate, and it ultimately decides what the next hidden state should be. Firstly, the previous hidden state and the current input is, like the previous two gates, passed into a sigmoid function. Then, the newly modified cell state from the input gate is passed to a tanh function(also in the output gate). The tanh output is multiplied with the sigmoid output to determine what info the hidden state should carry(using zero/nonzero as usual). Then, the hidden state and the modified cell state is output and passed along to the next time step.

### Question 3: Applications (20 points)

LSTMs are used across many different fields, from music generation to sentiment classification to text generation. Where could they be useful in your life, whether at home, for your family, or in the workplace? Give a specific problem or application for an LSTM model that was not covered in the course slides. Then, with your application in mind, specifically identify the input to your model, the output from your model, and an applicable loss function. 

(As an extension, try implementing your LSTM on your own using the code framework given in this homework!)

LSTM Many inputs to one output(Stock Market): In my daily life, LSTMs could certainly play a role in my investment finances, specifically when it comes to investing in individual stocks, where it can be difficult and risky for one to solely rely on their own intuition. In the realm of stock market investing, LSTMs could be particularly advantageous, since they are not really inhibited by sequence length(whereas RNNs are inhibited by long sequences) and can store long-running historical information about a stock's price. This is important, because the historical price of a stock is crucial to know when attempting to make any kind of prediction regarding its future price. For this model, the input will be many historical data points pertaining to stock price, and it will output a single stock price prediction for the future. A linear activation function might work well in this case since we are trying to predict a numerical value.


LSTM Many inputs to one output(Rhythm Generation): LSTMs would also be beneficial in my daily life in regards to music. As an avid drummer and percussionist, a robust LSTM could could create novel rhythmic compositions for me by feeding it example compositions during training. For this particular use case, an LSTM would be superior to an RNN because, in most real-life rhytmic compositions, the beats often do not follow each other rapidly one after the other. In many cases, there are significant intervals of time where there is no sound at all between beats, and an RNN might have difficulty accessing rhythmic data in a sequence where there are long intervals of pauses(in other words, no sound). The input to this model would be a sequence of rhythms, and the output would be a single note indicating the next rhythm.  RelU might be a good choice of activation function here since the goal would be to make the model robust to sparse inputs(cases where there are long pauses).


LSTM Many inputs to one output(Feedback Classification): Although this doesn't have as much bearing in my current line of work as a software developer, in my previous role as a Quality Data Analyst, utilizing LSTMs for the purpose of feedback classification would have been a big asset. I dealt primarily with historical time series data of vehicle reviews written and submitted online by customers, and in my role, I had to manually read the aforementioned reviews and classify them into "good" and "bad" buckets. A robust LSTM would have been able to take each review as input, classify it as either good or bad, and improve its accuracy with time. In this particular case, a BiLSTM may be even better, since it would be able to take valuable context from the distant "past" for each particular review, as well as the "future", due to its ability to retain long sequences both preceding any given time step as well as after any given time step. The input to this model would be many reviews, and the output would be a single classification(good or bad review). Since we are predicting a binary outcome, sigmoid might be an appropriate activation function here.