# IMDB Reviews Recurrent Neural Network


The intention behind this project is to build a **Recurrent Neural Network** that can perform sentiment analysis on the IMDB dataset. In order to pursue such a project, it is necessary to have a thorough comprehension of Backpropagation Through Time (BPTT) and how sequential data affects the training the process. This project will demonstrate such a knowledge and understanding.<br>

**Main Objectives**:
- <u>Import the Data</u>
- <u>Clean the Data</u>
- <u>Preprocessing the data</u>
- <u>Train/Test split</u>
- <u>Creating Datasets and DataLoaders</u>
- <u>Building the Model</u>
- <u>Creating the Training Loop</u>
- Predictions.

**Extra**:
- <u>Implement a **Recurrent Neural Network** from scratch.</u>
- <u>Non-modular</u>
- <u>Forward pass:</u>
  - <u>Hidden state update and output computation</u>
- <u>**Backpropagation Through Time** (BPTT)</u>
  - <u>Compute gradients for weights and biases over multiple time steps.</u>
  - <u>Clip gradients to prevent exploding gradients.</u>
- Input Embedding Layer
  - Learn a low-dimensional representation of the input text.
- <u>Handling initial states, allow passing of initial hidden state $h_0$.</u>
- <u>Activation functions and loss function.</u>

In [2]:
import pandas as pd
import numpy as np
import sklearn
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

## Recurrent Neural Network (RNN) Implementation

The implementation of a non-modular **Recurrent Neural Network**. The RNN will consist of an input layer, a recurrent layer, and an output layer.

Note that the input size is (batch_size, seq_length, num_features).



### **Forward propagation**

$$M_t = X_tW_{hx}^T+H_{t-1}W_{hh}^T+b_h$$
$$H_t = tanh(M_t)$$
$$O_t = H_tW_{qh}^T+b_q$$
$$L=\frac{1}{T}\sum^T_{t=1}l(O_t, Y_t)$$

### **Backpropagation Through Time (BPTT)**
$$\overline{O}_t=\frac{1}{T}(Y_t-T_t)$$
$$\overline{H}_t=\overline{M}_{t+1}W_{hh}^T + \overline{O}_tW_{qh} $$
$$\overline{M}_t=\overline{H}_t*(1-tanh^2(M_t))$$
$$\overline{X}_t=\overline{M}_tW_{hx}$$
$$\overline{W}_{qh}=\sum^T_{t=1}(\overline{O}_t)^TH_t$$
$$\overline{W}_{hh}=\sum^T_{t=1}(\overline{M}_t)^TH_{t-1}$$
$$\overline{W}_{hx}=\sum^T_{t=0}(\overline{M}_t)^TX_t$$
$$\overline{b}_{q}=\sum^T_{t=1}(\overline{O}_t)^T1$$
$$\overline{b}_{h}=\sum^T_{t=1}(\overline{M}_t)^T1$$<br>

Note that given that the task is prediction, for the gradient calculation of the weight and bias of the output layer, there is no summation, only the output of the last time stamp.

To be continued.
- Explain vanishing gradient problem
- Add Xt, Whh,

In [311]:
class RecurrentNeuralNetwork():
  """
  This is a custom implementation of a recurrent neural network with an input layer, a recurrent layer, and an output layer.

  Parameters:
  - self.Whx: the weights matrix that connects the input at the current time step to the hidden state.
  - self.Whh: the weights matrix that connects the hidden state at the previous time step to the hidden state at the current one.
  - self.Whq: the weights matrix for computing the output at the current time step.
  - self.bh: the bias for the computation of the hidden state at the current time step.
  - self.bq: the bias for the computation of the output at the current time step.
  """
  def __init__(self, hidden_size=16, epochs=5, optimizer="None", learning_rate=0.001, clipping=0.5, lamb=0.001, batch_size=8, seq_length=57, B1=0.9, B2=0.999):
    self.Whh = None
    self.Wqh = None
    self.Whx = None
    self.bh = None
    self.bq = None
    self.M = None # current pre-activation hidden state value.
    self.hidden_size = hidden_size # size of recurrent layer
    self.epochs = epochs
    self.ht = []

    # Save hidden states, outputs, and gradients for BPTT
    self.Mts = None
    self.Hts = None
    self.Ots = None
    self.Yts = None
    self.Mt_bars = None
    self.Ht_bars = None
    self.Ot_bar = 0

    # Optimizer
    self.optimizer = optimizer

    # Learning rate
    self.alpha = learning_rate

    # Regularization strength
    self.lamb = lamb

    # Batch size
    self.batch_size = batch_size

    # Sequence length
    self.seq_length = seq_length

    # Clipping threshold
    self.clipping = clipping

    ## For Adam
    # Betas
    self.B1 = B1
    self.B2 = B2
    # Recursive
    self.mt = []
    self.vt = []

    # Bias-corrected
    self.mt_b = []
    self.vt_b = []

    self.t = 0 # current time step

  def fit(self, X, T):
    num_classes = 2
    seq_length = X.shape[1]
    vocab_size = 4
    N = X.shape[0]*X.shape[1]

    # Convert to numpy
    if isinstance(X, pd.DataFrame):
      X = X.values
    else:
      X = X

    if isinstance(T, pd.DataFrame):
      T = T.values
    else:
      T = T

    # Initialize weights, biases, and initial hidden state
    c1 = self.xavier_initialization(n_inputs=vocab_size, n_outputs=self.hidden_size)
    c2 = self.xavier_initialization(n_inputs=self.hidden_size, n_outputs=self.hidden_size)
    c3 = self.xavier_initialization(n_inputs=self.hidden_size, n_outputs=num_classes)

    self.Whx = np.random.uniform(low=-c1, high=c1, size=(self.hidden_size, vocab_size))
    self.Whh = np.random.uniform(low=-c2, high=c2, size=(self.hidden_size, self.hidden_size))
    self.Wqh = np.random.uniform(low=-c3, high=c3, size=(num_classes, self.hidden_size))

    self.bh = np.zeros((self.hidden_size, 1)).flatten()
    self.bq = np.zeros((num_classes, 1)).flatten()

    for epoch in range(self.epochs):
      # Training Loop
      random_indices = np.random.permutation(X.shape[0])
      mode = "training"

      X_train_shuffle = X[random_indices]
      T_train_shuffle = T[random_indices]

      batches = self.mini_batch(T_train_shuffle, X_train_shuffle, X.shape[0])

      total_loss = 0

      for key, value in batches.items():
        x_batch = value[0] # already in OHE form
        t_batch = value[1]
        h0 = np.zeros((x_batch.shape[0], self.hidden_size))

        # x_batch already one-hot encoded, (batch_size, seq_len, features) - (8, 57, 4)
        # t_batch_one_hot (8,)
        t_ohe_batch = self.one_hot_targets(t_batch) # problem here, t is (8, )

        # Forward propagation
        y_batch = self.forward_prop(x_batch, h0, mode, seq_length)

        # Compute loss
        loss = self.compute_loss(y_batch, t_ohe_batch)
        total_loss += loss

        # Backpropagate loss
        grads = self.back_prop_through_time(x_batch, y_batch, t_ohe_batch, loss)

        # Gradient clipping
        grads = self.norm_gradient_clipping(grads)

        # Adam Optimizer
        if self.optimizer == "Adam":
          self.adam_optimizer(grads)


        # Gradient Descent
        elif self.optimizer == "None":
          self.gradient_descent(grads)

      ## To do Validation Loop

      epoch_loss = (total_loss / len(batches)).flatten()
      print("The training loss at epoch {0} is {1:.4f}".format(epoch+1, epoch_loss[0]))

  def forward_prop(self, X, h0, mode, seq_length):
    # Set initialize hidden state to h0
    Ht = h0

    # Create lists for storage
    self.Mts = []
    self.Hts = [h0]
    self.Ots = []
    self.Yts = []

    # Go through all time steps
    for t in range(seq_length):
      Mt = X[:, t, :]@self.Whx.T + Ht@self.Whh.T + self.bh
      Ht = np.tanh(Mt)
      Ot = Ht@self.Wqh.T + self.bq
      Yt = self.softmax(Ot)

      # Save for use in backprop
      self.Mts.append(Mt)
      self.Hts.append(Ht)
      self.Ots.append(Ot)
      if t == seq_length-1:
        self.Yts.append(Yt)

    Yts = np.array(self.Yts).squeeze(axis=0) # (batch_size, num_classes)
    return Yts

  def back_prop_through_time(self, X, Y, T, loss):
    N = X.shape[0]

    # Create empty lists for storage
    self.Mt_bars = [None]*self.seq_length
    self.Ht_bars = [None]*self.seq_length
    Wqh_bar = 0
    Whh_bar = 0
    Whx_bar = 0

    # Backpropagation Through Time (BPTT)
    Ot_bar = (1/N)*(Y - T)
    self.Ot_bar = Ot_bar

    for t in reversed(range(self.seq_length)):
      if t < self.seq_length-1:
        Ht_bar = self.Mt_bars[t+1]@self.Whh.T
        self.Ht_bars[t] = Ht_bar
      else:
        Ht_bar = Ot_bar@self.Wqh
        self.Ht_bars[t] = Ht_bar

      Mt_bar = Ht_bar*(1-(np.tanh(self.Mts[t]))**2)
      self.Mt_bars[t] = Mt_bar

    for t in range(self.seq_length):
      if t > 0:
        Whh_bar +=  self.Mt_bars[t].T @ self.Hts[t-1]
      Whx_bar += self.Mt_bars[t].T@X[:, t, :]

    Wqh_bar += self.Ot_bar.T@self.Hts[-1]

    bq_bar = np.sum(self.Ot_bar, axis=0)
    bh_bar = np.sum(sum(self.Mt_bars), axis=0)

    return [Whx_bar, Whh_bar, Wqh_bar, bh_bar, bq_bar]

  def norm_gradient_clipping(self, gradients):
    clipped = []
    for grad in gradients:
      norm = np.sqrt(np.sum(grad**2))
      if norm >= self.clipping:
        grad = self.clipping*grad/norm

      clipped.append(grad)

    return clipped

  def gradient_descent(self, gradients):
    self.Whx = self.Whx - self.alpha*(gradients[0])
    self.Whh = self.Whh - self.alpha*(gradients[1])
    self.Wqh = self.Wqh - self.alpha*(gradients[2])
    self.bh = self.bh - self.alpha*gradients[3]
    self.bq = self.bq - self.alpha*gradients[4]

  def adam_optimizer(self, grads):
    grads[0] = grads[0] + self.lamb*self.Whx
    grads[1] = grads[1] + self.lamb*self.Whh
    grads[2] = grads[2] + self.lamb*self.Wqh

    self.t += 1 # update time step

    if self.mt == [] and self.vt == []:
      for i in range(len(grads)):
        self.mt.append(np.zeros_like(grads[i]))
        self.vt.append(np.zeros_like(grads[i]))
        self.mt_b.append(np.zeros_like(grads[i]))
        self.vt_b.append(np.zeros_like(grads[i]))

    for i in range(len(grads)):
      # First moment
      self.mt[i] = self.B1*self.mt[i] + (1-self.B1)*grads[i]

      # Second moment
      self.vt[i] = self.B2*self.vt[i] + (1-self.B2)*(grads[i]**2)

      # Bias corrections
      self.mt_b[i] = self.mt[i]/(1-self.B1**self.t)
      self.vt_b[i] = self.vt[i]/(1-self.B2**self.t)

    self.Whx = self.Whx - self.alpha*(self.mt_b[0]/(np.sqrt(self.vt_b[0])+1e-8))
    self.Whh = self.Whh - self.alpha*(self.mt_b[1]/(np.sqrt(self.vt_b[1])+1e-8))
    self.Wqh = self.Wqh - self.alpha*(self.mt_b[2]/(np.sqrt(self.vt_b[2])+1e-8))
    self.bh = self.bh - self.alpha*self.mt_b[3]/(np.sqrt(self.vt_b[3])+1e-8)
    self.bq = self.bq - self.alpha*self.mt_b[4]/(np.sqrt(self.vt_b[4])+1e-8)

  def compute_loss(self, y, t):
    # get N
    N = t.shape[0]
    y_clipped = np.clip(y, 1e-9, 1-1e-9) # to avoid nan in the loss

    # Computes cost - binary cross-entropy with weight decay
    return (- np.sum(t * np.log(y_clipped)) / N) + (self.lamb/2)*(np.sum(self.Wqh**2) + np.sum(self.Whh**2) + np.sum(self.Whx**2))

  def xavier_initialization(self, n_inputs, n_outputs):
    return np.sqrt(6/(n_inputs+n_outputs))

  def predict(self, X_new):
    Z, Y = self.forward_prop(X_new, "evaluation/inference")
    return np.argmax(Y, axis=1)

  def softmax(self, o):
    # To prevent overflow
    o_max = np.max(o, axis=1).reshape(-1, 1)
    o_shifted = o - o_max

    # Softmax function implementation, the keepdims is used for broadcasting purposes.
    z =  np.exp(o_shifted) / np.sum(np.exp(o_shifted), axis=1, keepdims=True)
    return z

  def mini_batch(self, t, X, N):
    batches = {}
    n_batches = N // self.batch_size
    # Create batches
    for i in range(n_batches):
        batches[i] = [X[i*self.batch_size:(i+1)*self.batch_size], t[i*self.batch_size:(i+1)*self.batch_size]]

    # Last batch should be compiled into its own batch, even if it's less than batch size
    if N % self.batch_size != 0 :
        batches[n_batches] = [X[n_batches*self.batch_size:], t[n_batches*self.batch_size:]]

    return batches

  def dropout(self, y, mode):
    if mode == "training":
      r = np.random.binomial(n=1, p=self.p, size=(y.shape[0], y.shape[1])).astype(np.float64)
      return r/(1-self.p)
    elif mode == "evaluation/inference":
      return np.ones_like(y)

  def one_hot_targets(self, y):
    y_ohe = np.zeros(shape=(y.shape[0], 2)) # (Batch Size, Num_classes)
    instances_for_indexing = np.arange(y.shape[0])[:, None]
    y_ohe[instances_for_indexing, y.reshape(y.shape[0], 1)] = 1

    return y_ohe


## Importing the Data

In this dataset, we will be working with a set of DNA sequences to determine the motifs present in the data.

In [6]:
dna = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Datasets/DNA_Sequencing/promoters.data", header=None)

In [7]:
dna.shape

(106, 3)

In [8]:
dna.head()

Unnamed: 0,0,1,2
0,+,S10,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1,+,AMPC,\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2,+,AROH,\t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3,+,DEOP2,\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4,+,LEU1_TRNA,\ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...


In [9]:
dna.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       106 non-null    object
 1   1       106 non-null    object
 2   2       106 non-null    object
dtypes: object(3)
memory usage: 2.6+ KB


In [10]:
dna = dna.rename(mapper={0: "Promoter", 1: "Instance Name", 2: "DNA_Sequence"}, axis=1)

In [11]:
dna = dna.drop('Instance Name', axis=1)
dna

Unnamed: 0,Promoter,DNA_Sequence
0,+,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1,+,\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2,+,\t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3,+,\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4,+,\ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...
...,...,...
101,-,\t\tcctcaatggcctctaaacgggtcttgaggggttttttgctga...
102,-,\t\tgtattctcaacaagattaaccgacagattcaatctcgtggat...
103,-,\t\tcgcgactacgatgagatgcctgagtgcttccgttactggatt...
104,-,\t\tctcgtcctcaatggcctctaaacgggtcttgaggggtttttt...


## Cleaning the Data

Important cleaning steps are lowercasing, removing punctuation, removing numbers, removing extra space, and removing contractions.

First convert the text into a series of the play lines.

In [12]:
dna_sequences = dna["DNA_Sequence"]

In [13]:
dna_sequences = [line.strip() for line in dna_sequences if line.strip() != '']
dna_sequences[0:10]

['tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt',
 'tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa',
 'gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg',
 'aattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatactaacaaactc',
 'tcgataattaactattgacgaaaagctgaaaaccactagaatgcgcctccgtggtag',
 'aggggcaaggaggatggaaagaggttgccgtataaagaaactagagtccgtttaggt',
 'cagggggtggaggatttaagccatctcctgatgacgcatagtcagcccatcatgaat',
 'tttctacaaaacacttgatactgtatgagcatacagtataattgcttcaacagaaca',
 'cgacttaatatactgcgacaggacgtccgttctgtgtaaatcgcaatgaaatggttt',
 'ttttaaatttcctcttgtcaggccggaataactccctataatgcgccaccactgaca']

In [14]:
# Length of each sequence
len(dna_sequences[0])

57

In [15]:
# List of 106 elements, each being a DNA sequence
len(dna_sequences)

106

In [16]:
dna_sequences_series = pd.Series(dna_sequences)

In [17]:
dna_sequences_series

Unnamed: 0,0
0,tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgc...
1,tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaa...
2,gtactagagaactagtgcattagcttatttttttgttatcatgcta...
3,aattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaata...
4,tcgataattaactattgacgaaaagctgaaaaccactagaatgcgc...
...,...
101,cctcaatggcctctaaacgggtcttgaggggttttttgctgaaagg...
102,gtattctcaacaagattaaccgacagattcaatctcgtggatggac...
103,cgcgactacgatgagatgcctgagtgcttccgttactggattgtca...
104,ctcgtcctcaatggcctctaaacgggtcttgaggggttttttgctg...


In [18]:
import re

# Removes punctuation, numbers, and contractions
for i, line in enumerate(dna_sequences_series):
  dna_sequences_series[i] = re.sub(r"[.,!?;:0-9']", "", line)
  dna_sequences_series[i] = dna_sequences_series[i].lower()

dna_sequences_series

Unnamed: 0,0
0,tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgc...
1,tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaa...
2,gtactagagaactagtgcattagcttatttttttgttatcatgcta...
3,aattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaata...
4,tcgataattaactattgacgaaaagctgaaaaccactagaatgcgc...
...,...
101,cctcaatggcctctaaacgggtcttgaggggttttttgctgaaagg...
102,gtattctcaacaagattaaccgacagattcaatctcgtggatggac...
103,cgcgactacgatgagatgcctgagtgcttccgttactggattgtca...
104,ctcgtcctcaatggcctctaaacgggtcttgaggggttttttgctg...


## Preprocess Data

One-hot encode the values

In [19]:
# Create a vector of size (sequence_number, base_length, one_hot_categories)
data = np.zeros((len(dna_sequences_series), 57, 4), dtype=np.float64)

dna_vals = ['a', 'c', 'g', 't']

# k matches the indices of dna_vals


def one_hot_encode(data):
  for i in range(len(dna_sequences_series)):
    for j in range(57):
      for k in range(4):
        if dna_sequences_series[i][j] == dna_vals[k]:
          data[i][j][k] = 1

one_hot_encode(data)

data

array([[[0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        ...,
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]],

       [[0., 0., 0., 1.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        ...,
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]],

       [[0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        ...,
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.]],

       ...,

       [[0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        ...,
        [0., 0., 0., 1.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]],

       [[0., 1., 0., 0.],
        [0., 0., 0., 1.],
        [0., 1., 0., 0.],
        ...,
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.]],

       [[0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        ...,
        [1., 0., 0., 0.],
        [0., 1.

## Train/Test Split

Create a set of sequences of equal length that can be fed into the model.

In [20]:
int(len(data)*0.8)

84

In [21]:
Y = np.zeros((len(dna)))
Y = np.array([1 if val == "+" else 0 for val in dna["Promoter"]])
Y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [22]:
X_train = data[:int(len(data)*0.8), :, :]
X_test = data[int(len(data)*0.8):, :, :]
y_train = Y[:int(len(data)*0.8)]
y_test = Y[int(len(data)*0.8):]
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((84, 57, 4), (22, 57, 4), (84,), (22,))

## Creating Datasets and DataLoaders

In [23]:
from torch.utils.data import DataLoader, TensorDataset
train_dataset = TensorDataset(torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.long))

In [24]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=False)

## Building the model and training loop

In [25]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [26]:
# Model class
class RNN(nn.Module):
  def __init__(self):
    super().__init__()
    self.dropout = nn.Dropout(p=0.5)
    self.rnn = nn.RNN(input_size=4,
                      hidden_size=32,
                      num_layers=1,
                      batch_first=True) # means outputs are (batch, seq, feature)
    self.fc = nn.Linear(32, 2)

  def forward(self, x):
    x = self.dropout(x)
    h0 = torch.zeros(1, x.size(0), 32).to(device)
    out, hidden = self.rnn(x, h0)
    final_hidden = hidden.squeeze(0)
    logits = self.fc(final_hidden)

    return logits

In [27]:
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

{np.int64(0): np.int64(31), np.int64(1): np.int64(53)}


In [28]:

import torch.optim as optim

# Training loop
epochs = 10
model = RNN()


model.to(device)
print(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(epochs):
  epoch_loss = 0
  n_correct=0
  for data in train_loader:
    # Set gradients to 0
    optimizer.zero_grad()

    # Get batches
    feature, targets = data
    feature = feature.to(device)
    targets = targets.to(device).reshape(feature.shape[0], 1)

    # Predictions
    preds = model(feature)
    targets = torch.flatten(targets)

    # Calculate loss for batch
    loss = criterion(preds, targets)

    preds_labels = torch.argmax(preds, dim=1).reshape(preds.shape[0], 1)
    values = (preds_labels == targets.reshape(targets.shape[0], 1)).float()

    n_correct += torch.sum(values)

    # Backpropagation
    loss.backward()

    # Update weights
    optimizer.step()

    epoch_loss += loss
    if epoch == 2:
      break

  print("The loss for epoch {0} is {1:.4f}.".format(epoch+1, epoch_loss.item() / len(train_loader)))
  print(n_correct.item()) # predicts same thing but loss decreases
  print(n_correct.item() / len(X_train)) # figure this out



cpu
The loss for epoch 1 is 0.7080.
47.0
0.5595238095238095
The loss for epoch 2 is 0.6743.
52.0
0.6190476190476191
The loss for epoch 3 is 0.0471.
8.0
0.09523809523809523
The loss for epoch 4 is 0.6849.
54.0
0.6428571428571429
The loss for epoch 5 is 0.6679.
52.0
0.6190476190476191
The loss for epoch 6 is 0.6584.
52.0
0.6190476190476191
The loss for epoch 7 is 0.6549.
53.0
0.6309523809523809
The loss for epoch 8 is 0.6579.
52.0
0.6190476190476191
The loss for epoch 9 is 0.6618.
55.0
0.6547619047619048
The loss for epoch 10 is 0.6682.
52.0
0.6190476190476191


## Metrics

## Analysis

## Implemented Model

In [154]:
rnn_implemented = RecurrentNeuralNetwork(hidden_size=10, optimizer="Adam", epochs=50, seq_length=57, batch_size=106)
rnn_implemented.fit(X_train, y_train)

The training loss at epoch 1 is 0.4326
The training loss at epoch 2 is 0.4313
The training loss at epoch 3 is 0.4344
The training loss at epoch 4 is 0.4295
The training loss at epoch 5 is 0.4379
The training loss at epoch 6 is 0.4388
The training loss at epoch 7 is 0.4346
The training loss at epoch 8 is 0.4331
The training loss at epoch 9 is 0.4342
The training loss at epoch 10 is 0.4301


Loss for hidden_size = 16 with no Adam:
The training loss at epoch 0 is 3.3198
The training loss at epoch 1 is 3.2469
The training loss at epoch 2 is 3.2965
The training loss at epoch 3 is 3.3724
The training loss at epoch 4 is 3.4002

Final loss for hidden_size = 16 with Adam:
The training loss at epoch 1 is 7.1368
The training loss at epoch 2 is 8.8937
The training loss at epoch 3 is 10.2670
The training loss at epoch 4 is 10.9233
The training loss at epoch 5 is 10.9977
The training loss at epoch 6 is 13.9131
The training loss at epoch 7 is 39.8660
The training loss at epoch 8 is 83.5445
The training loss at epoch 9 is 69.8890
The training loss at epoch 10 is 64.4569
blows up due to time step issue

## Sequence Generation