# Coursework 3: RNNs

#### Instructions

Please submit on CATe a zip file named *CW3_RNNs.zip* containing a version of this notebook with your answers. Write your answers in the cells below for each question.


## Recurrent models coursework

This coursework is separated into a coding and a theory component.

For the first part, you will use the Google Speech Commands v0.02 subset that you used in the RNN tutorial: http://www.doc.ic.ac.uk/~pam213/co460_files/ 

### Part 1 - Coding
In this part you will have to:

- Implement an LSTM
- Implement a GRU

### Part 2 - Theory

Here you will answer some theoretical questions about RNNs -- no detailed proofs and no programming.

### Part 1: Coding

### Dataset

We will be using the Google [*Speech Commands*](https://www.tensorflow.org/tutorials/sequences/audio_recognition) v0.02 [1] dataset.

[1] Warden, P. (2018). [Speech commands: A dataset for limited-vocabulary speech recognition](https://arxiv.org/abs/1804.03209). *arXiv preprint arXiv:1804.03209.*

In [None]:
# Mount drive

from google.colab import drive
from pathlib import Path
%load_ext google.colab.data_table
content_path = '/content/drive/MyDrive/DL_cw/'
data_path = './data/'
drive.mount('/content/drive/')
content_path = Path(content_path)

%cd '/content/drive/MyDrive/DL_cw/'

Mounted at /content/drive/
/content/drive/MyDrive/DL_cw


In [None]:
## MAKE SURE THIS POINTS INSIDE THE DATASET FOLDER.
dataset_folder = str(content_path) + '/' # this should change depending on where you have stored the data files
print(dataset_folder)

/content/drive/MyDrive/DL_cw/


### Initial code before coursework questions start:

In [None]:
import math
import os
import random
from collections import defaultdict

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset
import numpy as np
from scipy.io.wavfile import read
import librosa
from matplotlib import pyplot as plt

cuda = True if torch.cuda.is_available() else False

Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor


In [None]:
def set_seed(seed_value):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

set_seed(42)

In [None]:
class SpeechCommandsDataset(Dataset):
    """Google Speech Commands dataset."""

    def __init__(self, root_dir, split):
        """
        Args:
            root_dir (string): Directory with all the data files.
            split    (string): In ["train", "valid", "test"].
        """
        self.root_dir = str(root_dir)
        self.split = split

        self.number_of_classes = len(self.get_classes())

        self.class_to_file = defaultdict(list)

        self.valid_filenames = self.get_valid_filenames()
        self.test_filenames = self.get_test_filenames()

        for c in self.get_classes():
            file_name_list = sorted(os.listdir(self.root_dir + "data_speech_commands_v0.02/" + c))
            for filename in file_name_list:
                if split == "train":
                    if (filename not in self.valid_filenames[c]) and (filename not in self.test_filenames[c]):
                        self.class_to_file[c].append(filename)
                elif split == "valid":
                    if filename in self.valid_filenames[c]:
                        self.class_to_file[c].append(filename)
                elif split == "test":
                    if filename in self.test_filenames[c]:
                        self.class_to_file[c].append(filename)
                else:
                    raise ValueError("Invalid split name.")

        self.filepath_list = list()
        self.label_list = list()
        for cc, c in enumerate(self.get_classes()):
            f_extension = sorted(list(self.class_to_file[c]))
            l_extension = [cc for i in f_extension]
            f_extension = [self.root_dir + "data_speech_commands_v0.02/" + c + "/" + filename for filename in f_extension]
            self.filepath_list.extend(f_extension)
            self.label_list.extend(l_extension)
        self.number_of_samples = len(self.filepath_list)

    def __len__(self):
        return self.number_of_samples

    def __getitem__(self, idx):
        sample = np.zeros((16000, ), dtype=np.float32)

        sample_file = self.filepath_list[idx]

        sample_from_file = read(sample_file)[1]
        sample[:sample_from_file.size] = sample_from_file
        sample = sample.reshape((16000, ))
        
        sample = librosa.feature.mfcc(y=sample, sr=16000, hop_length=512, n_fft=2048).transpose().astype(np.float32)

        label = self.label_list[idx]

        return sample, label

    def get_classes(self):
        return ['one', 'two', 'three']

    def get_valid_filenames(self):
        class_names = self.get_classes()

        class_to_filename = defaultdict(set)
        with open(self.root_dir + "data_speech_commands_v0.02/validation_list.txt", "r") as fp:
            for line in fp:
                clean_line = line.strip().split("/")

                if clean_line[0] in class_names:
                    class_to_filename[clean_line[0]].add(clean_line[1])

        return class_to_filename

    def get_test_filenames(self):
        class_names = self.get_classes()

        class_to_filename = defaultdict(set)
        with open(self.root_dir + "data_speech_commands_v0.02/testing_list.txt", "r") as fp:
            for line in fp:
                clean_line = line.strip().split("/")

                if clean_line[0] in class_names:
                    class_to_filename[clean_line[0]].add(clean_line[1])

        return class_to_filename

In [None]:
train_dataset = SpeechCommandsDataset(dataset_folder,
                                      "train")
valid_dataset = SpeechCommandsDataset(dataset_folder,
                                      "valid")

test_dataset = SpeechCommandsDataset(dataset_folder,
                                     "test")

batch_size = 100


num_epochs = 5
valid_every_n_steps = 20
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)
valid_loader = torch.utils.data.DataLoader(dataset=valid_dataset,
                                           batch_size=batch_size,
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

In [None]:
train_dataset.get_classes()
train_dataset.__getitem__(100)[0][0].shape

(20,)

### Question 1:  Finalise the LSTM and GRU cells by completing the missing code

You are allowed to use nn.Linear.

In [None]:
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        
        ########################################################################
        ## START OF YOUR CODE - Question 1a) Complete the missing code
        ########################################################################
        # forget gate layers
        self.forget_u1 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.forget_v1 = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.sigmoid_forget = nn.Sigmoid()

        # input gate layers
        self.input_gate_u2 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.input_gate_v2 = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.sigmoid_gate = nn.Sigmoid()

        # cell memory layers
        self.mem_gate_u3 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.mem_gate_v3 = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.activation_gate = nn.Tanh()

        # out gate layers
        self.out_gate_u4 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.out_gate_v4 = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.sigmoid_hidden_out = nn.Sigmoid()

        self.activation_final = nn.Tanh()

        self.reset_parameters()
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

        ########################################################################
        ## START OF YOUR CODE - Question 1b) Complete the missing code
        ########################################################################
    def forget_gate(self, x, h):
        x = self.forget_u1(x)
        h = self.forget_v1(h)

        return self.sigmoid_forget(x + h)


    def input_gate(self, x, h):
        x_temp = self.input_gate_u2(x)
        h_temp = self.input_gate_v2(h)

        return self.sigmoid_gate(x_temp + h_temp)
    

    def cell_memory_gate(self, i, f, x, h, c_prev):
        x = self.mem_gate_u3(x)
        h = self.mem_gate_v3(h)

        k = self.activation_gate(x + h)
        g = k * i
        
        c = f * c_prev
        c_next = g + c

        return c_next


    def out_gate(self, x, h):
        x = self.out_gate_u4(x)
        h = self.out_gate_v4(h)

        return self.sigmoid_hidden_out(x + h)


    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)
            hx = (hx, hx)
            
        # We used hx to pack both the hidden and cell states
        hx, cx = hx

        i = self.input_gate(input, hx) # Pass through input gate 
        f = self.forget_gate(input, hx) # Pass through forget gate
        cy = self.cell_memory_gate(i, f, input, hx ,cx) # Pass through cell memory gate
        o = self.out_gate(input, hx) # Pass through out gate
        hy = o * self.activation_final(cy) # Final activation
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################

        return (hy, cy)

class BasicRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh"):
        super(BasicRNNCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.nonlinearity = nonlinearity
        if self.nonlinearity not in ["tanh", "relu"]:
            raise ValueError("Invalid nonlinearity selected for RNN.")

        self.x2h = nn.Linear(input_size, hidden_size, bias=bias)
        self.h2h = nn.Linear(hidden_size, hidden_size, bias=bias)

        self.reset_parameters()
        

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

            
    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)

        activation = getattr(nn.functional, self.nonlinearity)
        hy = activation(self.x2h(input) + self.h2h(hx))

        return hy
    
    
class GRUCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(GRUCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias

        ########################################################################
        ## START OF YOUR CODE - Question 1c) Complete the missing code
        ########################################################################
        # reset gate layers
        self.reset_gate_u1 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.reset_gate_v1 = nn.Linear(self.hidden_size, self.hidden_size, bias=bias)

        self.reset_gate_u2 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.reset_gate_v2 = nn.Linear(self.hidden_size, self.hidden_size, bias=bias)
        self.activation_1 = nn.Sigmoid()

        # update gate layers
        self.mem_gate_u3 = nn.Linear(self.input_size, self.hidden_size, bias=bias)
        self.mem_gate_v3 = nn.Linear(self.hidden_size, self.hidden_size, bias=bias)
        self.activation_2 = nn.Sigmoid()

        self.activation_3 = nn.Tanh()
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################
        self.reset_parameters()
        

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

        ########################################################################
        ## START OF YOUR CODE - Question 1d) Complete the missing code
        ########################################################################
        x = input

    def reset_gate(self, x, h):
        x_1 = self.reset_gate_u1(x)
        h_1 = self.reset_gate_v1(h)
        reset = self.activation_1(x_1 + h_1)

        return reset


    def update_gate(self, x, h):
        x_2 = self.reset_gate_u2(x)
        h_2 = self.reset_gate_v2(h)
        z = self.activation_2(h_2 + x_2)

        return z


    def update_component(self, x,h,r):
        x_3 = self.mem_gate_u3(x)
        h_3 = r * self.mem_gate_v3(h) 
        gate_update = self.activation_3(x_3 + h_3)

        return gate_update


    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)

        r = self.reset_gate(input, hx) # Pass through reset gate
        z = self.update_gate(input, hx) # Pass through update gate
        n = self.update_component(input, hx, r) # Pass through almost update component
        hy = (1 - z) * n  + z * hx
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################
        
        return hy

### Question 2:  Finalise the RNNModel and BidirRecurrentModel

Note that there are serveral different ways that one can implement a bi-directional recurrent neural network. Specifically in this coursework we ask for implementation of the following type of architecture (with e.g. 2 layers for each direction as an example, your implementation should work for any number of layers):

In [None]:
from IPython.display import Image, display
display(Image(filename='bidirectional_rnn_arch.png', width=300))

FileNotFoundError: ignored

In [None]:
class RNNModel(nn.Module):
    def __init__(self, mode, input_size, hidden_size, num_layers, bias, output_size):
        super(RNNModel, self).__init__()
        self.mode = mode
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size
        
        self.rnn_cell_list = nn.ModuleList()
        
        if mode == 'LSTM':
        ########################################################################
        ## START OF YOUR CODE - Question 2a) Complete the missing code
        #  Append the appropriate LSTM cells to rnn_cell_list
        ########################################################################
            self.rnn_cell_list.append(LSTMCell(input_size, hidden_size))
            
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(LSTMCell(hidden_size, 
                                                       hidden_size))
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################

        elif mode == 'GRU':
        ########################################################################
        ## START OF YOUR CODE - Question 2b) Complete the missing code
        #  Append the appropriate GRU cells to rnn_cell_list
        ########################################################################
            self.rnn_cell_list.append(GRUCell(input_size, hidden_size))
            
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(GRUCell(hidden_size, hidden_size))
        ########################################################################
        ## END OF YOUR CODE
        ######################################################################## 
        
        elif mode == 'RNN_TANH':
        ########################################################################
        ## START OF YOUR CODE - Question 2c) Complete the missing code
        #  Append the appropriate RNN cells to rnn_cell_list
        ########################################################################
            self.rnn_cell_list.append(BasicRNNCell(input_size, hidden_size))
            
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(BasicRNNCell(hidden_size, 
                                                           hidden_size))
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################

        elif mode == 'RNN_RELU':
        ########################################################################
        ## START OF YOUR CODE - Question 2d) Complete the missing code
        #  Append the appropriate RNN cells to rnn_cell_list
        ########################################################################
            self.rnn_cell_list.append(BasicRNNCell(input_size, 
                                                   hidden_size, 
                                                   nonlinearity='relu'))
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(BasicRNNCell(hidden_size, 
                                                           hidden_size, 
                                                           nonlinearity='relu'))
        ########################################################################
        ## END OF YOUR CODE
        ########################################################################

        else:
            raise ValueError("Invalid RNN mode selected.")

        self.att_fc = nn.Linear(self.hidden_size, 1)
        self.fc = nn.Linear(self.hidden_size, self.output_size)

        
    def forward(self, input, hx=None):

        h0 = [None] * self.num_layers if hx is None else list(hx)
        
        # In this forward pass we want to create our RNN from the rnn cells,
        # ..taking the hidden states from the final RNN layer and passing these 
        # ..through our fully connected layer (fc).
        
        # The multi-layered RNN should be able to run when the mode is either 
        # .. LSTM, GRU, RNN_TANH or RNN_RELU.

        ########################################################################    
        ## START OF YOUR CODE - Question 2e) Complete the missing code  
        ########################################################################

        # Rearrange to (sequence_dim X batch_size X input_dim)
        X = list(input.permute(1, 0, 2))

        sd = input.shape[1] # Sequence dimension

        # Loops through cells then through sequence
        for k, cell in enumerate(self.rnn_cell_list):
            for i in range(sd):

                # If first i, set hidden state input to None
                if i == 0:
                    hx = h0[k]

                # Take i'th input in sequence as input with hx from previous 
                # cell and pass through network
                hx_minus_one = X[i]
                hx = cell(hx_minus_one, hx)
                
                # Take first element of tuple if LSTM
                if self.mode != "LSTM":
                    X[i] = hx
                else:
                    X[i] = hx[0]

        outs = X
        ########################################################################    
        ## END OF YOUR CODE 
        ########################################################################

        out = outs[-1].squeeze()

        out = self.fc(out)
        
        return out
    

class BidirRecurrentModel(nn.Module):
    def __init__(self, mode, input_size, hidden_size, num_layers, bias, output_size):
        super(BidirRecurrentModel, self).__init__()
        self.mode = mode
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size
        
        self.rnn_cell_list = nn.ModuleList()
        self.rnn_cell_list_rev = nn.ModuleList()

        # THIS HAS BEEN CHANGED TO ACCOMODATE THE CONCATATENATED OUTPUTS
        self.fc = nn.Linear(self.hidden_size*2, self.output_size) 

        ########################################################################
        ## START OF YOUR CODE - Question 2f) Complete the missing code
        ########################################################################
        if mode == 'LSTM':
        #  Append the appropriate LSTM cells to rnn_cell_list
            self.rnn_cell_list.append(LSTMCell(input_size, hidden_size))
            self.rnn_cell_list_rev.append(LSTMCell(input_size, hidden_size))
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(LSTMCell(hidden_size, 
                                                       hidden_size))
                    self.rnn_cell_list_rev.append(LSTMCell(hidden_size, 
                                                       hidden_size))

        elif mode == 'GRU':
        #  Append the appropriate GRU cells to rnn_cell_list
            self.rnn_cell_list.append(GRUCell(input_size, hidden_size))
            self.rnn_cell_list_rev.append(GRUCell(input_size, hidden_size))
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(GRUCell(hidden_size, hidden_size))
                    self.rnn_cell_list_rev.append(GRUCell(hidden_size, hidden_size))
        
        elif mode == 'RNN_TANH':
        #  Append the appropriate RNN cells to rnn_cell_list
            self.rnn_cell_list.append(BasicRNNCell(input_size, hidden_size))
            self.rnn_cell_list_rev.append(BasicRNNCell(input_size, hidden_size))
            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(BasicRNNCell(hidden_size, 
                                                           hidden_size))
                    self.rnn_cell_list_rev.append(BasicRNNCell(hidden_size, 
                                                           hidden_size))
                
        elif mode == 'RNN_RELU':
        #  Append the appropriate RNN cells to rnn_cell_list
            self.rnn_cell_list.append(BasicRNNCell(input_size, 
                                                   hidden_size, 
                                                   nonlinearity='relu'))
            self.rnn_cell_list_rev.append(BasicRNNCell(input_size, 
                                                       hidden_size,
                                                       nonlinearity='relu'))

            if num_layers > 1:
                for layer in range(num_layers - 1):
                    self.rnn_cell_list.append(BasicRNNCell(hidden_size, 
                                                           hidden_size, 
                                                           nonlinearity='relu'))
                    self.rnn_cell_list_rev.append(BasicRNNCell(hidden_size, 
                                                           hidden_size, 
                                                           nonlinearity='relu'))
        ########################################################################    
        ## END OF YOUR CODE 
        ########################################################################
        
    def forward(self, input, hx=None):

        h0 = [None] * self.num_layers if hx is None else list(hx)
        
        # In this forward pass we want to create our Bidirectional RNN from the rnn cells,
        # .. taking the hidden states from the final RNN layer with their reversed counterparts
        # .. before concatening these and running them through the fully connected layer (fc)
        
        # The multi-layered RNN should be able to run when the mode is either 
        # .. LSTM, GRU, RNN_TANH or RNN_RELU.

        ########################################################################
        ## START OF YOUR CODE  - Question 2g) Complete the missing code
        ########################################################################

        # Rearrange to (sequence_dim X batch_size X input_dim)
        X = list(input.permute(1, 0, 2)) 
        X_rev = X.copy() # Use same data but pass backwards through

        sd = input.shape[1] # Sequence dimension

        # FORWARD
        # Loops through cells then through sequence
        for k, cell in enumerate(self.rnn_cell_list):
            for i in range(sd):
                
                # If first i, set hidden state input to None
                if i == 0:
                    hx = h0[k]

                # Take i'th input in sequence as input with hx from previous 
                # cell and pass through network
                hx_minus_one = X[i]
                hx = cell(hx_minus_one, hx)
                
                # Take first element of tuple if LSTM
                if self.mode != "LSTM":
                    X[i] = hx
                else:
                    X[i] = hx[0]
        outs = X

        # REVERSE
        for k, cell in enumerate(self.rnn_cell_list_rev):
            for i in range(sd):

                if i == 0:
                    hx = h0[k]

                hx_minus_one = X_rev[-(i+1)]
                hx = cell(hx_minus_one, hx)
                
                # Update inputs
                if self.mode != "LSTM":
                    X_rev[-(i+1)] = hx
                else:
                    X_rev[-(i+1)] = hx[0]

        outs_rev = X_rev
        ########################################################################    
        ## END OF YOUR CODE 
        ########################################################################

        out = outs[-1].squeeze()
        out_rev = outs_rev[0].squeeze()
        out = torch.cat((out, out_rev), 1)

        out = self.fc(out)
        return out

The code below trains a network based on your code above. This should work without error:

In [None]:
seq_dim, input_dim = train_dataset[0][0].shape
output_dim = 3

hidden_dim = 32
layer_dim = 2
bias = True

### Change the code below to try running different models:
model = RNNModel("RNN_RELU", input_dim, hidden_dim, layer_dim, bias, output_dim)
# model = BidirRecurrentModel("GRU", input_dim, hidden_dim, layer_dim, bias, output_dim)

if torch.cuda.is_available():
    model.cuda()
    
criterion = nn.CrossEntropyLoss()

learning_rate = 0.01
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

loss_list = []
iter = 0
max_v_accuracy = 0
reported_t_accuracy = 0
max_t_accuracy = 0
for epoch in range(num_epochs):
    print(f'epoch: {epoch}')
    for i, (audio, labels) in enumerate(train_loader):
        # print(f'iteration: {i}')
        if torch.cuda.is_available():
            audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            audio = Variable(audio.view(-1, seq_dim, input_dim))
            labels = Variable(labels)

        optimizer.zero_grad()

        outputs = model(audio)

        loss = criterion(outputs, labels)

        if torch.cuda.is_available():
            loss.cuda()

        loss.backward()

        optimizer.step()

        loss_list.append(loss.item())
        iter += 1

        if iter % valid_every_n_steps == 0:
            correct = 0
            total = 0
            for audio, labels in valid_loader:
                if torch.cuda.is_available():
                    audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                else:
                    audio = Variable(audio.view(-1, seq_dim, input_dim))

                outputs = model(audio)

                _, predicted = torch.max(outputs.data, 1)

                total += labels.size(0)

                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()

            v_accuracy = 100 * correct // total
            
            is_best = False
            if v_accuracy >= max_v_accuracy:
                max_v_accuracy = v_accuracy
                is_best = True

            if is_best:
                for audio, labels in test_loader:
                    if torch.cuda.is_available():
                        audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                    else:
                        audio = Variable(audio.view(-1, seq_dim, input_dim))

                    outputs = model(audio)

                    _, predicted = torch.max(outputs.data, 1)

                    total += labels.size(0)

                    if torch.cuda.is_available():
                        correct += (predicted.cpu() == labels.cpu()).sum()
                    else:
                        correct += (predicted == labels).sum()

                t_accuracy = 100 * correct // total
                reported_t_accuracy = t_accuracy

            print('Iteration: {}. Loss: {}. V-Accuracy: {}  T-Accuracy: {}'.format(iter, loss.item(), v_accuracy, reported_t_accuracy))



epoch: 0




Iteration: 20. Loss: 1.2095831632614136. V-Accuracy: 35  T-Accuracy: 35
Iteration: 40. Loss: 1.1152595281600952. V-Accuracy: 38  T-Accuracy: 37
Iteration: 60. Loss: 1.0666297674179077. V-Accuracy: 38  T-Accuracy: 36
Iteration: 80. Loss: 1.0519579648971558. V-Accuracy: 39  T-Accuracy: 38
epoch: 1
Iteration: 100. Loss: 1.0632435083389282. V-Accuracy: 41  T-Accuracy: 40
Iteration: 120. Loss: 1.0557938814163208. V-Accuracy: 37  T-Accuracy: 40
Iteration: 140. Loss: 1.059493899345398. V-Accuracy: 37  T-Accuracy: 40
Iteration: 160. Loss: 1.1052502393722534. V-Accuracy: 36  T-Accuracy: 40
Iteration: 180. Loss: 1.0554403066635132. V-Accuracy: 39  T-Accuracy: 40
epoch: 2
Iteration: 200. Loss: 1.0627682209014893. V-Accuracy: 43  T-Accuracy: 41
Iteration: 220. Loss: 1.1014127731323242. V-Accuracy: 40  T-Accuracy: 41
Iteration: 240. Loss: 1.1135965585708618. V-Accuracy: 39  T-Accuracy: 41
Iteration: 260. Loss: 1.0812163352966309. V-Accuracy: 43  T-Accuracy: 41
epoch: 3
Iteration: 280. Loss: 1.08120

## Part 2: Theoretical questions

#### Theory question 1: 
What is the _vanishing gradients problem_ and why does it occur? Which activation functions are more or less impacted by this, and why?

#### Your answers:
* Your answer here describing vanishing gradients problem
* Two examples of activation functions more impacted by vanishing gradients
* Two examples of activation functions less impacted by vanishing gradients, why are they impacted less?

- In the RNN, information travels forward through time in the network via a series of cells to the output neurons, the information from previous time steps is then used for all the next time points. The cost function and then error is calculated at each time point. Instead of backpropagating errors through a single feedforward network, the errors are propagated through all the time points and then backpropagated through time via the cells. Between each time step we have to perform a weight matrix multiplication. This involves many factors of this weight matrix and repeated computation of the gradients with respect to this weight matrix. If we have many values in this chain of matrix multiplications where the gradient values or weight values are less than one, then these gradients become increasingly smaller as they multiply, eventually 'vanishing' whereby we cannot effectively train the network. This causes errors due to further back time steps having increasingly smaller gradients. It can also mean that the model places a bias on parameters to capture short term dependencies in a sequence.

- The two main examples of activation functions that are impacted by this effect are the tanh and sigmoid functions. This is because they 'saturate' at between 0 and 1 for sigmoid and between -1 and 1 for tanh. In both cases the derivatives become extremely close to 0, and then the products of these gradients disappear to nothing.

- Two functions that mitigate this effect are ReLU and leaky ReLU. Both of these have gradient 1 when input > 0 and mean that taking the product of many of these combined have the useful property of being either 1 or 0, and there is no diminishing as before. Leaky ReLU has a very small gradient for negative values instead of a flat slope as for ReLU. This means that gradients less than zero dont completely reduce to zero, solving the 'dying' ReLU problem when gradients on the left hand side of x = 0 are saturated.

#### Theory question 2: 
Why do LSTMs help address the vanishing gradient problem compared to a vanilla RNN?

LSTMs are a type of gated cell. This is a more complex recurrent unit that uses gates to control what information is passed through in place of the basic RNN cell. This controls what information passes through the cell by tracking information throughout many time steps. The cell now contains a series of standard neural network operations (sigmoid and tanh) and point wise matrix multiplications combined to form gates. Three gates exist with different functions:

- forget gate forgets irrelevent part of the previous state
- store gate stores relevent new information into the cell state
- output gate controls what information is sent to the next time step

The LSTM also maintains a seperate value of the cell state (c_t) as well as output h_t. The combination of these components mean the cells can regulate the flow of information. The most important effect of this is that these gates result in the uninterupted flow of gradients across time by backpropagation through the cell states. The gradient computation is now taken wrt the c_t values only, meaning the gradients are less susceptible to the vanishing gradient effect. The individual outputs of the different gates are now summed as the cell state (c_t). This balances the c_t value across the different gates. This means that when gradients are taken, they are unlikely to be so small that the product after backpropagation through time vanishes.

#### Theory question 3: 

The plot below shows the training curves for three models A, B, and C, trained on the same dataset up to 100 epochs. The three models are a RNN, a LSTM and a GRU, not necessarily in that order.

* Which could plausibly be which? Why? Please explain your reasoning.

(In the cell below please set the values for A_model, B_model and C_model to be 'RNN', 'LSTM' or 'GRU'. This needs to be exact for the automatic marking.)

In [None]:
from IPython.display import Image, display
display(Image(filename='Performance by epoch.png', width=550))

In [None]:
# Answers below:

A_model = 'GRU'
B_model = 'LSTM'
C_model = 'RNN'

# Give your reasons below:


The RNN curve (curve C) is separated from the other two as the performance levels out at a lower value after time. This is due to it not being able to recognise long term dependencies, as the explanation of vanishing gradient and information flow above has demonstrated. It will never be able to perform as well as the other gated cell techniques. It also demonstrates a higher performance more quickly. This is due to it having a fewer number of parameters with no gating, therefore it is able to train and update weights more quickly.
The two remaining curves reach a similar level of performance but curve A demonstrates a higher performance sooner. This is likely to be as the GRU has two gates in comparison with the LSTM which has three. More gates means more parameters and longer time required to train.


#### Theory question 4: 

When might you choose to use each of the three different types of models?

#### Your answers:
* Type of problem when best to use vanilla RNN:
* Type of problem to use GRU:
* Type of problem to use LSTM:


- Vanilla RNNs could be put to use for very short sequences in which the user is not concerned about the issue of vanishing gradient to pick up long term dependencies. It will always be outperformed by the other two but will use less computational effort and memory.
- GRUs are simpler with two gates rather than three for the LSTM. This means they train faster as alluded to in the question above. They are known to perform better than LSTMs on less training data in some cases. They can also be considered more flexible to modify and so are a good general choice for language modelling.
- LSTMs in theory remember longer sequence with the highest level of information regulation. For the largest datasets these should be the best option as they can pick up the longest term dependencies. This comes at the cost that they are computationally more expensive as well as using more memory than GRUs.
