# Recurrent Neural Networks: Programming Practice

COSC 410: Applied Machine Learning\
Colgate University\
*Prof. Apthorpe*

## Overview

This notebook will give you practice with the following topics:
  1. Creating and training simple RNNs
  2. Creating and training LSTM & GRU networks

We will be using a new **natural language** dataset of IMDB movie reviews. We will attempt to perform a classification task to label reviews as having either *positive* or *negative* sentiment. The reviews have already been preprocessed from words --> numbers, such that the most common word is encoded as 1, the second most common word is encoded as 2, etc. 

## Part 1. Data Import & Inspection

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras as ks

In [2]:
max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review

(x_train, y_train), (x_val, y_val) = ks.datasets.imdb.load_data(num_words=max_features) # Load the dataset

x_train = ks.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) # Pad reviews with 0s to equal length
x_val = ks.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen) # Pad reviews with 0s to equal length

print(len(x_train), "Training sequences") # Print number of training reviews
print(len(x_val), "Validation sequences") # Print number of validation reviews

print(np.unique(y_train, return_counts=True)[1]) # print counts of classes in training set
print(np.unique(y_val, return_counts=True)[1]) # print counts of classes in validation set

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 Training sequences
25000 Validation sequences
[12500 12500]
[12500 12500]


**Note:** One of the benefits of RNNs is that they can handle sequences of differing lengths, so why do we pad the training and validation data to equal lengths? ... Training time! Keras is optimized for mini-batch gradient descent when the training examples are non-ragged matrices (a.k.a. *tensors*...hence "Tensorflow"). Once we have trained the model, we can perform predictions on *new* reviews of arbitary length.  

In [7]:
# Function that converts an embedded review back into English bag-of-words representation
def reverse_embedding(review):
    words_to_ints = ks.datasets.imdb.get_word_index(path="imdb_word_index.json")
    reverse_embedding_map = dict(zip(words_to_ints.values(), words_to_ints.keys()))
    return " ".join([reverse_embedding_map.get(x, "") for x in review])

# Function that prints the label (1 = positive review, 0 = negative review) and bag-of-words text of a review
def print_example_and_label(idx):
    print(f"Label: {y_train[idx]}")
    print(reverse_embedding(x_train[idx]) + "\n")
    
print_example_and_label(0)
print_example_and_label(1)

Label: 1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit 

## Part 2. 1-Node RNNs

We'll start by creating three different 1-Node RNNs with `SimpleRNN`, `LSTM`, and `GRU` nodes. 

In [9]:
# SimpleRNN model
model = ks.Sequential([
    ks.Input(shape=(None,)),
    ks.layers.Embedding(max_features, 1),
    ks.layers.SimpleRNN(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1f2a9f900d0>

In [10]:
# LSTM model
model = ks.Sequential([
    ks.Input(shape=(None,)),
    ks.layers.Embedding(max_features, 1),
    ks.layers.LSTM(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

KeyboardInterrupt: 

In [11]:
# GRU model
model = ks.Sequential([
    ks.Input(shape=(None,)),
    ks.layers.Embedding(max_features, 1),
    ks.layers.GRU(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1f2a68145b0>

## Part 3. Recurrent Layer Hyperparameters

`SimpleRNN`, `LSTM` and `GRU` layers share several hyperparameters that can be set via keyword arguments. The documentation for each of these layers is here: https://keras.io/api/layers/recurrent_layers/
  * `return_sequences`: Whether the layers should return a single value for each input sequence (`False`, *default*) or a new multi-step sequence for each input sequence (`True`)
  * `kernel_regularizer`, `bias_regularizer`, and `recurrent_regularizer` for L1, L2, or L1L2 regularization
  * `dropout` and `recurrent_dropout` for setting dropout rates of forward and recurrent connections. 

## Part 4: Stacked RNNs

Finally, let's try a stacked LSTM network with 2 hidden layers. 

In [None]:
# Stacked LSTM model
model = ks.Sequential([
    ks.Input(shape=(None,)),
    ks.layers.Embedding(max_features, 1),
    ks.layers.LSTM(64, return_sequences=True)
    ks.layers.LSTM(64, return_sequences=True)
    ks.layers.LSTM(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

We can see that the stacked RNN has higher plateau performance and reaches that performance more quickly. However, the increasing difference between train accuracy and val accuracy indicates that we are in an overfitting regime, so it doesn't make sense to increase the size of the network further. 