# Recurrent Neural Networks (TensorFlow)

[![Open in Colab](https://lab.aef.me/files/assets/colab-badge.svg)](https://colab.research.google.com/github/adamelliotfields/lab/blob/main/files/tf/rnn.ipynb)
[![Open in Kaggle](https://lab.aef.me/files/assets/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/adamelliotfields/lab/blob/main/files/tf/rnn.ipynb)
[![Render nbviewer](https://lab.aef.me/files/assets/nbviewer_badge.svg)](https://nbviewer.org/github/adamelliotfields/lab/blob/main/files/tf/rnn.ipynb)

Recurrent Neural Networks (RNN) differ from feed-forward networks in that they have loops within their architecture, allowing information from previous steps to influence the current operation.

RNNs can "remember" the previous words in a sentence to provide context for understanding the current word. This makes RNNs powerful for tasks involving sequential data like language modeling and time series forecasting.

Ilya Sutskever's PhD thesis from 2013 was on [_Training Recurrent Neural Networks_](https://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf).

This notebook includes my notes on RNNs as well as examples of variants like LSTMs, GRUs, and BiRNNs using the [Air Passengers](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/AirPassengers) and [IMDB Reviews](https://ai.stanford.edu/~amaas/data/sentiment/) datasets.

> If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs. &mdash; [Andrej Karpathy](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

## Backpropagation Through Time

A regular deep neural network has many layers with different weights. When we backpropagate through the network, we calculate the gradients of the loss function for each weight in each layer.

The number of "loops" in an RNN is equal to the number of time steps in the sequence. If the input is 100 characters, then there are 100 steps. Each step essentially forms a new layer that is connected to the previous step. To train a RNN, we need to "unroll" all these steps, which forms a deep feed-forward network, except the weights are shared across all the steps. This is known as BPTT (backpropagation through time).

In Keras, this is all handled transparently.

## Statefulness

When batching, the state at the start of a batch is essentially reset. If you need to maintain state across batches, you can set `stateful=True`, which initializes the state of the batch to the previous state of the last batch. This is necessary when you cannot fit the necessary context in the sequences in the batch.

## Bidirectional

Bidirectional RNNs are a variant that introduce two hidden states, one that processes the sequence from left to right and another that processes the sequence from right to left. This allows the network to learn from both "past" and "future" data. The outputs of each are combined, which allows the model to capture information from both directions.

## Long Short-Term Memory

Because RNN layers form very deep networks when unrolled, they suffer from the same gradient stability problems as other very deep networks. This makes it difficult for traditional RNNs to learn long-range dependencies in the data.

LSTM (long short-term memory) networks (Hochreiter and Schmidhuber, [1997](https://www.bioinf.jku.at/publications/older/2604.pdf)) introduce memory cells and gating mechanisms to selectively remember and forget information. There are 3 gates in an LSTM network:
1. Input: decides what new information to store in the cell state.
2. Forget: decides what information to throw away from the cell state.
3. Output: decides what to output based on input and the memory of the cell.

## Gated Recurrent Unit

GRU networks (Cho et al., [2014](https://arxiv.org/abs/1406.1078)) are a simplified version of LSTMs that combine the forget and input gates into a single update gate.

## Additional Resources

* [wikipedia.org/wiki/recurrent_neural_network](https://en.wikipedia.org/wiki/Recurrent_neural_network)
* [wikipedia.org/wiki/long_short-term_memory](https://en.wikipedia.org/wiki/Long_short-term_memory)
* [wikipedia.org/wiki/gated_recurrent_unit](https://en.wikipedia.org/wiki/Gated_recurrent_unit)
* [deeplearningbook.org/contents/rnn](https://www.deeplearningbook.org/contents/rnn.html)
* [developer.ibm.com/articles/cc-cognitive-recurrent-neural-networks](https://developer.ibm.com/articles/cc-cognitive-recurrent-neural-networks/)
* [neptune.ai/blog/recurrent-neural-network-guide](https://neptune.ai/blog/recurrent-neural-network-guide)
* [blog.paperspace.com/bidirectional-rnn-keras](https://blog.paperspace.com/bidirectional-rnn-keras/)
* [stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
* [colah.github.io/posts/2015-08-understanding-lstms](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [joshvarty.github.io/visualizingrnns](https://joshvarty.github.io/VisualizingRNNs/)
* [machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras](https://machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras/)
* [bouvet.no/bouvet-deler/explaining-recurrent-neural-networks](https://www.bouvet.no/bouvet-deler/explaining-recurrent-neural-networks)
* [tensorflow.org/guide/keras/working_with_rnns](https://www.tensorflow.org/guide/keras/working_with_rnns)
* [pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html)

## Examples

In [None]:
import os
from importlib.util import find_spec

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["KERAS_BACKEND"] = "tensorflow"

if find_spec("google.colab"):
    os.environ["TFDS_DATA_DIR"] = "/content/drive/MyDrive/tensorflow_datasets"

In [None]:
import keras
import numpy as np
import tensorflow as tf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

from datetime import datetime
from sklearn.preprocessing import MinMaxScaler

### Air Passengers

In [None]:
# converts fractional years to datetimes
def parse_time(d):
    year = int(d)
    month = d - year
    month = round(month * 12) + 1
    return datetime(year, month, 1)


def get_sequences(data, length=12):
    sequences = []
    labels = []
    for i in range(len(data) - length):
        sequences.append(data[i : i + length])
        labels.append(data[i + length])
    return np.array(sequences), np.array(labels)

In [None]:
# https://www.rdocumentation.org/packages/datasets/topics/AirPassengers
air_df = sm.datasets.get_rdataset("AirPassengers", cache=True).data
air_df.columns = ["ds", "y"]
air_df["ds"] = air_df["ds"].apply(parse_time)
air_df.set_index("ds", inplace=True)

In [None]:
# normalize
scaler = MinMaxScaler()
air_df_scaled = air_df.copy()
air_df_scaled["y"] = scaler.fit_transform(air_df_scaled[["y"]])

# generates 132 sequences of 12 (144 - 12 = 132)
X, y = get_sequences(air_df_scaled.y.values, length=12)

# split into train, test, and validation sets
X_train, y_train = X[:-36], y[:-36]
X_test, y_test = X[-24:], y[-24:]
X_val, y_val = X[-36:-24], y[-36:-24]

# reshape for RNN input layer
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], 1)

#### Simple RNN

In [None]:
# clear model cache
keras.backend.clear_session()

# build model
# when stacking, return sequences all but the last
x_input = keras.Input(shape=(12, 1))
x = keras.layers.SimpleRNN(
    32,
    activation="relu",
    recurrent_dropout=0.1,
    return_sequences=True,
)(x_input)
x = keras.layers.SimpleRNN(32, activation="relu", recurrent_dropout=0.1)(x)
x = keras.layers.Dense(1, activation="linear")(x)
rnn_model = keras.Model(x_input, outputs=x, name="AirPassengersRNN")
rnn_model.compile(optimizer="adam", loss="mse")  # 3201 params

In [None]:
# train
rnn_model.fit(X_train, y_train, epochs=200, validation_data=(X_val, y_val), verbose=0);

In [None]:
# inverse transform the predictions and actual values
y_pred = rnn_model.predict(X_test, verbose=0)
y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred_inv = scaler.inverse_transform(y_pred)

# plot the results
plt.figure(figsize=(8, 6))
plt.plot(air_df.index[-24:], y_test_inv, label="Actual")
plt.plot(air_df.index[-24:], y_pred_inv, label="Predicted")
plt.legend()
plt.show()

#### LSTM

In [None]:
keras.backend.clear_session()

x_input = keras.Input(shape=(12, 1))
x = keras.layers.LSTM(32, activation="relu", recurrent_dropout=0.1, return_sequences=True)(x_input)
x = keras.layers.LSTM(32, activation="relu", recurrent_dropout=0.1)(x)
x = keras.layers.Dense(1, activation="linear")(x)
lstm_model = keras.Model(x_input, outputs=x, name="AirPassengersLSTM")
lstm_model.compile(optimizer="adam", loss="mse")  # 12705 params
lstm_model.summary()

In [None]:
# requires more epochs and normalization
lstm_model.fit(X_train, y_train, epochs=200, validation_data=(X_val, y_val), verbose=0);

In [None]:
# inverse transform the predictions and actual values
y_pred = lstm_model.predict(X_test, verbose=0)
y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred_inv = scaler.inverse_transform(y_pred)

# plot the results
plt.figure(figsize=(8, 6))
plt.plot(air_df.index[-24:], y_test_inv, label="Actual")
plt.plot(air_df.index[-24:], y_pred_inv, label="Predicted")
plt.legend()
plt.show()

#### GRU

In [None]:
keras.backend.clear_session()

x_input = keras.Input(shape=(12, 1))
x = keras.layers.GRU(32, activation="relu", recurrent_dropout=0.1, return_sequences=True)(x_input)
x = keras.layers.GRU(32, activation="relu", recurrent_dropout=0.1)(x)
x = keras.layers.Dense(1, activation="linear")(x)
gru_model = keras.Model(x_input, outputs=x, name="AirPassengersLSTM")
gru_model.compile(optimizer="adam", loss="mse")  # 9729 params
gru_model.summary()

In [None]:
# gru requires more epochs to be competitive with LSTM
gru_model.fit(X_train, y_train, epochs=400, validation_data=(X_val, y_val), verbose=0);

In [None]:
# inverse transform the predictions and actual values
y_pred = gru_model.predict(X_test, verbose=0)
y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred_inv = scaler.inverse_transform(y_pred)

# plot the results
plt.figure(figsize=(8, 6))
plt.plot(air_df.index[-24:], y_test_inv, label="Actual")
plt.plot(air_df.index[-24:], y_pred_inv, label="Predicted")
plt.legend()
plt.show()

### IMDB Reviews

In [None]:
(imdb_train, imdb_test, imdb_unsupervised), imdb_info = tfds.load(
    "imdb_reviews",
    with_info=True,
    as_supervised=True,
    split=("train", "test", "unsupervised"),
)

In [None]:
# unsupervised keys are "text" (str) and "label" (int)
for example, label in imdb_train.take(1):
    print("text: ", example.numpy().decode("utf8"))
    print("label: ", label.numpy())

for example, label in imdb_test.take(1):
    print("text: ", example.numpy().decode("utf8"))
    print("label: ", label.numpy())

# unsupervised set is unlabeled (-1)
for example, label in imdb_unsupervised.take(1):
    print("text: ", example.numpy().decode("utf8"))
    print("label: ", label.numpy())

In [None]:
# shuffle and batch
imdb_train = imdb_train.shuffle(buffer_size=imdb_train.cardinality())
imdb_train = imdb_train.batch(128)
imdb_train = imdb_train.prefetch(tf.data.AUTOTUNE)

imdb_test = imdb_test.batch(128).prefetch(tf.data.AUTOTUNE)

In [None]:
VOCAB_SIZE = 1000

encoder = keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(imdb_train.map(lambda text, label: text))  # take only text

vocab = np.array(encoder.get_vocabulary())

#### LSTM

In [None]:
keras.backend.clear_session()

x_input = keras.Input(shape=(1,), dtype=tf.string)
x = encoder(x_input)
x = keras.layers.Embedding(
    input_dim=len(vocab),
    output_dim=64,
    mask_zero=True,
)(x)  # mask to handle variable sequence lengths
x = keras.layers.LSTM(128)(x)
x = keras.layers.Dense(64, activation="gelu")(x)
x = keras.layers.Dense(1)(x)

imdb_lstm = keras.Model(x_input, outputs=x, name="IMDBLSTM")
imdb_lstm.compile(
    metrics=["accuracy"],
    loss=keras.losses.BinaryCrossentropy(from_logits=True),  # no activation on output layer
    optimizer=keras.optimizers.AdamW(
        weight_decay=4e-4,
        learning_rate=1e-4,  # lower learning rate than default
    ),
)
imdb_lstm.summary()

In [None]:
# ~15s/epoch on T4/L4
imdb_lstm.fit(imdb_train, epochs=5, validation_data=imdb_test);

In [None]:
imdb_lstm.save("imdb_lstm.keras")
imdb_lstm = keras.saving.load_model("imdb_lstm.keras")

In [None]:
test_loss, test_acc = imdb_lstm.evaluate(imdb_test, verbose=0)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_acc}")

#### BiGRU

In [None]:
keras.backend.clear_session()

x_input = keras.Input(shape=(1,), dtype=tf.string)
x = encoder(x_input)
x = keras.layers.Embedding(input_dim=len(vocab), output_dim=64, mask_zero=True)(x)
x = keras.layers.Bidirectional(keras.layers.GRU(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.GRU(64))(x)
x = keras.layers.Dense(64, activation="gelu")(x)
x = keras.layers.Dense(1)(x)

imdb_bigru = keras.Model(x_input, outputs=x, name="IMDBBiGRU")
imdb_bigru.compile(
    metrics=["accuracy"],
    optimizer=keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=4e-4),
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
)
imdb_bigru.summary()

In [None]:
imdb_bigru.save("imdb_bigru.keras")
imdb_bigru = keras.saving.load_model("imdb_bigru.keras")

In [None]:
test_loss, test_acc = imdb_bigru.evaluate(imdb_test, verbose=0)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_acc}")

In [None]:
y_preds = imdb_bigru.predict(
    np.array(["This movie was great!", "This movie was so bad."]),
    verbose=0,
)
for pred in y_preds:
    print(f"Sentiment: {'positive' if pred >= 0.0 else 'negative'}")