# Chapter 15: Processing Sequences Using RNNs and CNNs

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

The batter hits the ball. The outfielder immediately starts running, anticipating the ball’s trajectory. He tracks it, adapts his movements, and finally catches it (under a thunder of applause). Predicting the future is something you do all the time, whether you are finishing a friend’s sentence or anticipating the smell of coffee at breakfast. In this chapter we will discuss recurrent neural networks (RNNs), a class of nets that can predict the future (well, up to a point, of course). They can analyze time series data such as stock prices, and tell you when to buy or sell. In autonomous driving systems, they can anticipate car trajectories and help avoid accidents. More generally, they can work on sequences of arbitrary lengths, rather than on fixed-sized inputs like all the nets we have considered so far. For example, they can take sentences, documents, or audio samples as input, making them extremely useful for natural language processing applications such as automatic translation or speech-to-text.

In this chapter we will first look at the fundamental concepts underlying RNNs and how to train them using backpropagation through time, then we will use them to forecast a time series. After that we'll explore the two main difficulties that RNNs face:
* **Unstable gradients** (discussed in Chapter 11), which can be alleviated using various techniques, including recurrent dropout and recurrent layer normalization.
* **A very limited short-term memory**, which can be extended using LSTMs, GRUs, and 1D convolutional layers.

## 2. Recurrent Neurons and Layers

Up to now, we have focused on feedforward neural networks, where activations flow only in one direction, from the input layer to the output layer. A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward. Let’s look at the simplest possible RNN, composed of just one neuron receiving inputs, producing an output, and sending that output back to itself. At each time step $t$ (also called a frame), this recurrent neuron receives the inputs $\mathbf{x}_{(t)}$ as well as its own output from the previous time step, $\hat{y}_{(t-1)}$. Since there is no previous output at the first time step, it is generally set to 0.

We can represent this small network against the time axis. This is called *unrolling the network through time*.

Each recurrent neuron has two sets of weights: one for the inputs $\mathbf{x}_{(t)}$ and the other for the outputs of the previous time step $\hat{y}_{(t-1)}$. Let’s call these weight vectors $\mathbf{w}_x$ and $\mathbf{w}_y$. If we consider the whole recurrent layer instead of just one recurrent neuron, we can place all the weight vectors in two weight matrices, $\mathbf{W}_x$ and $\mathbf{W}_y$. The output vector of the whole recurrent layer can then be computed as follows:

$$ \hat{\mathbf{y}}_{(t)} = \phi(\mathbf{W}_x^T \mathbf{x}_{(t)} + \mathbf{W}_y^T \hat{\mathbf{y}}_{(t-1)} + \mathbf{b}) $$

Where:
* $\hat{\mathbf{y}}_{(t)}$ is the output vector of the recurrent layer at time step $t$.
* $\mathbf{x}_{(t)}$ is the input vector at time step $t$.
* $\mathbf{W}_x$ is the weight matrix for the inputs.
* $\mathbf{W}_y$ is the weight matrix for the outputs of the previous time step.
* $\mathbf{b}$ is the bias vector.
* $\phi$ is the activation function (usually hyperbolic tangent, `tanh`).

### Memory Cells

Since the output of a recurrent neuron at time step $t$ is a function of all the inputs from previous time steps, you could say it has a form of memory. A part of a neural network that preserves some state across time steps is called a *memory cell* (or simply a cell). The output $\hat{y}_{(t)}$ is usually the state $\mathbf{h}_{(t)}$ itself, but complex cells like LSTM separate the two.

### Input and Output Sequences

An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs (sequence-to-sequence network). This is useful for predicting time series.
Alternatively, you could feed the network a sequence of inputs and ignore all outputs except for the last one (sequence-to-vector network). This is useful for sentiment analysis (feed a movie review, output a sentiment score).
Conversely, you could feed the network a single input at the first time step (and zeros for all other steps) and let it output a sequence (vector-to-sequence network). This is useful for image captioning.
Finally, you could have a Sequence-to-Vector network (Encoder) followed by a Vector-to-Sequence network (Decoder). This is used for translation.

## 3. Training RNNs

To train an RNN, the trick is to unroll it through time and then simply use regular backpropagation. This strategy is called *Backpropagation Through Time (BPTT)*.

Just like in regular backpropagation, there is a first forward pass through the unrolled network. Then the output sequence is evaluated using a cost function $J$ (e.g., MSE for forecasting). The gradients of that cost function are then propagated backward through the unrolled network. Gradients flow backward through time, from the last time step to the first. Note that the weights $\mathbf{W}_x$, $\mathbf{W}_y$, and $\mathbf{b}$ are shared across all time steps, so the gradient for these weights is the sum of the gradients calculated at each time step.

## 4. Forecasting a Time Series

Let's create a simple time series function to generate data for our experiments.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10)) # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise
    return series[..., np.newaxis].astype(np.float32)

# Create Training, Validation, and Test sets
n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

### Baseline Metrics

Before using RNNs, it is good to have a baseline. The simplest baseline is **Naive Forecasting**: predict the last value.

In [None]:
y_pred = X_valid[:, -1]
naive_mse = np.mean(keras.losses.mean_squared_error(y_valid, y_pred))
print("Naive Forecasting MSE:", naive_mse)

Another simple approach is to use a **Linear Regression** model (or a Dense layer).

In [None]:
model_linear = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]),
    keras.layers.Dense(1)
])
model_linear.compile(loss="mse", optimizer="adam")
model_linear.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
print("Linear Model evaluated:", model_linear.evaluate(X_valid, y_valid))

### Implementing a Simple RNN

A simple RNN layer contains just one recurrent neuron? No! It contains 1 recurrent layer with 1 unit. This unit processes the sequence step by step. Note that by default, `SimpleRNN` uses the `tanh` activation function.

In [None]:
model_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])
model_rnn.compile(loss="mse", optimizer="adam")
model_rnn.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
print("Simple RNN evaluated:", model_rnn.evaluate(X_valid, y_valid))

### Deep RNNs

To stack multiple RNN layers, you must ensure that all intermediate layers return their full sequence of outputs (3D tensor) instead of just the last time step. You do this by setting `return_sequences=True`.

In [None]:
model_deep_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(1)
])
model_deep_rnn.compile(loss="mse", optimizer="adam")
model_deep_rnn.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))

### Forecasting Several Steps Ahead

So far we have predicted only the next time step (1 step ahead). What if we want to predict 10 steps ahead?

**Option 1: Autoregressive Prediction**
Use the model to predict step 1, add it to the input, predict step 2, and so on. Errors tend to accumulate.

In [None]:
series = generate_time_series(1, n_steps + 10)
X_new, y_new = series[:, :n_steps], series[:, n_steps:]
X = X_new
for step_ahead in range(10):
    y_pred_one = model_deep_rnn.predict(X)[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

Y_pred = X[:, n_steps:]
print("10-step prediction using autoregression complete.")

**Option 2: Seq-to-Vec (Predicting a Vector)**
Train the model to output all 10 values at once. We need to regenerate the targets to be vectors of size 10.

In [None]:
# Regenerate data with 10 targets
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

model_vector = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10) # Output 10 values at once
])

model_vector.compile(loss="mse", optimizer="adam")
model_vector.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

**Option 3: Seq-to-Seq**
Instead of predicting 10 steps only at the very end, we can train the model to predict the next 10 steps at *every* single time step. This provides more gradients for training and stabilizes the process. We use `TimeDistributed(Dense(10))` to apply the output layer at every step.

In [None]:
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10D vectors
for step_ahead in range(1, 10 + 1):
    Y[..., step_ahead - 1] = series[..., step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

model_seq = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

def last_time_step_mse(Y_true, Y_pred):
    return keras.metrics.mean_squared_error(Y_true[:, -1], Y_pred[:, -1])

model_seq.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model_seq.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

## 5. Handling Long Sequences

To handle long sequences, we face the vanishing/exploding gradient problem (unstable gradients) and the loss of long-term patterns.

### Fighting Unstable Gradients
We can use good initialization, faster optimizers, and dropout. For RNNs, Layer Normalization (LN) is more effective than Batch Normalization. LN normalizes across the feature dimension instead of the batch dimension.

Here is how to implement Layer Normalization within a custom memory cell:

In [None]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation="tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units, activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

model_ln = keras.models.Sequential([
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True, input_shape=[None, 1]),
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

### Tackling the Short-Term Memory Problem

Due to transformations through time, data is lost. Long-Term memory cells were introduced to solve this.

**LSTM (Long Short-Term Memory):**
The LSTM cell manages two state vectors: $\mathbf{h}_{(t)}$ (short-term state) and $\mathbf{c}_{(t)}$ (long-term state). The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it.

It uses three gates controlled by logistic activations:
* **Forget Gate ($f_{(t)}$):** Controls which parts of the long-term state should be erased.
* **Input Gate ($i_{(t)}$):** Controls which parts of $\mathbf{g}_{(t)}$ (the main candidate for addition) should be added to the long-term state.
* **Output Gate ($o_{(t)}$):** Controls which parts of the long-term state should be read and output to $\mathbf{h}_{(t)}$ and $\mathbf{y}_{(t)}$.

Equations:
$$ \mathbf{i}_{(t)} = \sigma(\mathbf{W}_{xi}^T \mathbf{x}_{(t)} + \mathbf{W}_{hi}^T \mathbf{h}_{(t-1)} + \mathbf{b}_i) $$
$$ \mathbf{f}_{(t)} = \sigma(\mathbf{W}_{xf}^T \mathbf{x}_{(t)} + \mathbf{W}_{hf}^T \mathbf{h}_{(t-1)} + \mathbf{b}_f) $$
$$ \mathbf{o}_{(t)} = \sigma(\mathbf{W}_{xo}^T \mathbf{x}_{(t)} + \mathbf{W}_{ho}^T \mathbf{h}_{(t-1)} + \mathbf{b}_o) $$
$$ \mathbf{g}_{(t)} = \tanh(\mathbf{W}_{xg}^T \mathbf{x}_{(t)} + \mathbf{W}_{hg}^T \mathbf{h}_{(t-1)} + \mathbf{b}_g) $$
$$ \mathbf{c}_{(t)} = \mathbf{f}_{(t)} \otimes \mathbf{c}_{(t-1)} + \mathbf{i}_{(t)} \otimes \mathbf{g}_{(t)} $$
$$ \mathbf{y}_{(t)} = \mathbf{h}_{(t)} = \mathbf{o}_{(t)} \otimes \tanh(\mathbf{c}_{(t)}) $$

In [None]:
model_lstm = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])
model_lstm.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model_lstm.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

**GRU (Gated Recurrent Unit):**
A simplified version of the LSTM cell. It merges the two state vectors into a single vector $\mathbf{h}_{(t)}$. It merges the forget and input gates into a single *update gate*. It is computationally more efficient than LSTM and performs similarly.

In [None]:
model_gru = keras.models.Sequential([
    keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

**Using 1D Convolutional Layers:**
1D Conv layers slide a kernel over a sequence. They can learn local patterns in time sequences. By setting `strides` > 1, we can downsample the sequence length, making it easier for an RNN to handle long sequences.

In [None]:
model_conv_gru = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid",
                        input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])
model_conv_gru.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model_conv_gru.fit(X_train, Y_train[:, 3::2], epochs=20, validation_data=(X_valid, Y_valid[:, 3::2]))

**WaveNet:**
Proposed in 2016 by DeepMind researchers. It stacks 1D convolutional layers, doubling the dilation rate (how spread apart the kernel inputs are) at every layer. This allows the network to have an exponentially growing receptive field, capturing extremely long-term patterns efficiently.

In [None]:
model_wavenet = keras.models.Sequential()
model_wavenet.add(keras.layers.InputLayer(input_shape=[None, 1]))
for rate in (1, 2, 4, 8) * 2:
    model_wavenet.add(keras.layers.Conv1D(filters=20, kernel_size=2, padding="causal",
                                          activation="relu", dilation_rate=rate))
model_wavenet.add(keras.layers.Conv1D(filters=10, kernel_size=1))
model_wavenet.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model_wavenet.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))