# Chapter 15: Processing Sequences Using RNNs and CNNs

This notebook contains the code reproductions and theoretical explanations for Chapter 15 of *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*.

## Chapter Summary

This chapter introduces **Recurrent Neural Networks (RNNs)**, a class of neural networks designed to handle sequential data. Unlike feedforward networks, RNNs have connections that point backward, allowing them to maintain a form of memory and process sequences of arbitrary lengths.

Key topics covered include:

* **Recurrent Neurons and Layers:** The fundamental concepts of RNNs, how a recurrent neuron and layer work, and how they are "unrolled through time."
* **Training RNNs:** Understanding how to train RNNs using **Backpropagation Through Time (BPTT)**.
* **Time Series Forecasting:** We build several RNNs to predict future values in a generated time series, including simple RNNs, deep RNNs, and models that can forecast multiple time steps ahead.
* **Handling Long Sequences:** We explore the two main challenges of training RNNs on long sequences:
    1.  **Unstable Gradients:** The vanishing/exploding gradient problem, which can be mitigated with techniques like gradient clipping, Layer Normalization, and recurrent dropout.
    2.  **Limited Short-Term Memory:** The inability of simple RNNs to capture long-term dependencies.
* **Advanced RNN Cells:** We introduce two powerful cell architectures that solve the short-term memory problem: **Long Short-Term Memory (LSTM)** and **Gated Recurrent Unit (GRU)** cells.
* **CNNs for Sequences:** We explore using 1D convolutional layers to process sequences, which can be much faster than RNNs. This culminates in the **WaveNet** architecture, which uses dilated 1D convolutions to efficiently learn very long-term patterns.

## Setup

First, let's import the necessary libraries and set up the environment.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Common setup for plotting
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Recurrent Neurons and Layers

### Theoretical Explanation

A **recurrent neuron** is a neuron that receives not only inputs but also its own output from the previous time step. This creates a loop, allowing the network to have a form of memory. At time step *t*, the recurrent neuron receives the inputs **x**(t) and its own previous output **y**(t-1) to produce the current output **y**(t).

When a layer of recurrent neurons is built, this logic is vectorized. The output of the entire layer at time step *t*, **Y**(t), is a function of the input matrix **X**(t) and the layer's output from the previous time step, **Y**(t-1). This can be visualized by **unrolling the network through time**, creating a deep network where each "layer" represents a time step.

#### Memory Cells
A component that preserves some state across time steps is called a **memory cell** (or just a cell). A simple recurrent neuron is a very basic cell. Its hidden state **h**(t) (which, in a simple RNN, is the same as its output **y**(t)) is a function of the previous state **h**(t-1) and the current input **x**(t).

#### Input and Output Sequences
RNNs can handle various types of input/output sequences:
* **Sequence-to-sequence:** Input a sequence, output a sequence (e.g., forecasting stock prices).
* **Sequence-to-vector:** Input a sequence, output a single vector (e.g., sentiment analysis of a review).
* **Vector-to-sequence:** Input a single vector, output a sequence (e.g., image captioning).
* **Encoder-Decoder:** A sequence-to-vector (encoder) network followed by a vector-to-sequence (decoder) network (e.g., machine translation).

#### Training RNNs
To train an RNN, we use **Backpropagation Through Time (BPTT)**. This involves unrolling the RNN through time (for the length of the input sequences) and then using regular backpropagation. The key is that the weights (**W**x and **W**y) are shared across all time steps. Gradients are computed at each time step and then summed up to update the weights.

## Forecasting a Time Series

Let's create a function to generate some time series data. Each series will be the sum of two sine waves plus some noise.

In [2]:
def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise
    return series[..., np.newaxis].astype(np.float32)

# When dealing with time series, inputs are generally 3D:
# [batch size, time steps, dimensionality]
# Here, dimensionality is 1 (univariate time series).

In [3]:
# Create the datasets
n_steps = 50
series = generate_time_series(10000, n_steps + 1)

# X_train will be the first 50 time steps, y_train will be the 51st.
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_train shape: (7000, 50, 1)
y_train shape: (7000, 1)


### Baseline Metrics

Before building a complex model, it's crucial to establish baseline metrics. If our RNN can't beat these, it's not very useful.

1.  **Naive Forecasting:** Predict the last observed value. This is surprisingly hard to beat for some series.
2.  **Simple Linear Model:** A fully connected network (Dense layers) that looks at all 50 time steps to make a prediction.

In [7]:
# 1. Naive Forecasting
y_pred_naive = X_valid[:, -1]
naive_mse = tf.reduce_mean(tf.square(y_valid - y_pred_naive))
print("Naive Forecasting MSE:", naive_mse)

Naive Forecasting MSE: tf.Tensor(0.02153162, shape=(), dtype=float32)


In [8]:
# 2. Simple Linear Model
model_linear = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]),
    keras.layers.Dense(1)
])

model_linear.compile(loss="mse", optimizer="adam")
history_linear = model_linear.fit(X_train, y_train, epochs=20,
                                validation_data=(X_valid, y_valid),
                                verbose=0)

linear_mse = model_linear.evaluate(X_valid, y_valid, verbose=0)
print("Linear Model MSE:", linear_mse)

  super().__init__(**kwargs)


Linear Model MSE: 0.004609603434801102


### Implementing a Simple RNN

Now let's build the simplest possible RNN. It has a single layer with a single neuron. We don't need to specify the length of the input sequences, so we set the time step dimension to `None`.

In [9]:
model_simple_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])

model_simple_rnn.compile(loss="mse", optimizer="adam")
history_simple_rnn = model_simple_rnn.fit(X_train, y_train, epochs=20,
                                        validation_data=(X_valid, y_valid),
                                        verbose=0)

simple_rnn_mse = model_simple_rnn.evaluate(X_valid, y_valid, verbose=0)
print("Simple RNN MSE:", simple_rnn_mse)

# It's better than naive, but worse than the linear model.
# This is because it only has 3 parameters and has to preserve state.

  super().__init__(**kwargs)


Simple RNN MSE: 0.014739658683538437


### Deep RNNs

We can stack multiple RNN layers to create a **Deep RNN**. This is generally more powerful.

**Important:** All recurrent layers *except the last one* must have `return_sequences=True`. This ensures they output a 3D sequence (including the time step dimension) for the next recurrent layer to process.

In [10]:
model_deep_rnn = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(1) # Only returns the last output
])

model_deep_rnn.compile(loss="mse", optimizer="adam")
history_deep_rnn = model_deep_rnn.fit(X_train, y_train, epochs=20,
                                    validation_data=(X_valid, y_valid),
                                    verbose=0)

deep_rnn_mse = model_deep_rnn.evaluate(X_valid, y_valid, verbose=0)
print("Deep RNN MSE:", deep_rnn_mse)

Deep RNN MSE: 0.003031549509614706


The last layer doesn't need to be an RNN. A `Dense` layer is often faster, just as accurate, and allows us to choose any activation function. Here, we make the last `SimpleRNN` layer return only its last output (by setting `return_sequences=False`, which is the default) and add a `Dense` layer on top.

In [11]:
model_deep_dense = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20), # No return_sequences=True, so it returns [batch_size, units]
    keras.layers.Dense(1)
])

model_deep_dense.compile(loss="mse", optimizer="adam")
history_deep_dense = model_deep_dense.fit(X_train, y_train, epochs=20,
                                        validation_data=(X_valid, y_valid),
                                        verbose=0)

deep_dense_mse = model_deep_dense.evaluate(X_valid, y_valid, verbose=0)
print("Deep RNN with Dense output MSE:", deep_dense_mse)

Deep RNN with Dense output MSE: 0.0026087926235049963


### Forecasting Several Time Steps Ahead

What if we want to predict the next 10 values, not just one?

**Option 1: Iterative Forecasting**
Use the previous model, predict one step, add that prediction to the input, and call the model again to predict the next step, and so on.

In [13]:
# Generate a new series for this test
series = generate_time_series(1, n_steps + 10)
X_new, Y_new = series[:, :n_steps], series[:, n_steps:]

X = X_new
for step_ahead in range(10):
    y_pred_one = model_deep_dense.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

Y_pred = X[:, n_steps:]
print(Y_pred.shape)
print(tf.reduce_mean(tf.square(Y_new - Y_pred)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
(1, 10, 1)
tf.Tensor(0.002272064, shape=(), dtype=float32)


This approach works, but errors tend to accumulate. The predictions for later time steps get progressively worse.

**Option 2: Multi-Output Model (Sequence-to-Vector)**
Train an RNN to predict all 10 next values at once. The output layer will be a `Dense` layer with 10 units.

In [14]:
# Prepare new targets: Y_train is now [batch_size, 10]
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

model_seq_to_vec = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10) # Output 10 values
])

model_seq_to_vec.compile(loss="mse", optimizer="adam")
history_seq_to_vec = model_seq_to_vec.fit(X_train, Y_train, epochs=20,
                                        validation_data=(X_valid, Y_valid),
                                        verbose=0)
seq_to_vec_mse = model_seq_to_vec.evaluate(X_valid, Y_valid, verbose=0)
print("Seq-to-Vec RNN MSE:", seq_to_vec_mse)

  super().__init__(**kwargs)


Seq-to-Vec RNN MSE: 0.008170127868652344


**Option 3: Sequence-to-Sequence Model**

A more powerful approach is to train the model to forecast the next 10 values at *every* time step.

At time *t*, the model outputs a vector of forecasts for *t+1* to *t+10*. At time *t+1*, it outputs forecasts for *t+2* to *t+11*, etc.

This helps the model because the loss is calculated at every time step, so more gradients can flow through the network, stabilizing and speeding up training.

To do this, we use `return_sequences=True` on *all* recurrent layers and wrap the final `Dense` layer in a `TimeDistributed` layer. This applies the `Dense` layer at every time step independently.

In [15]:
# Prepare targets: Y is now [batch_size, time_steps, 10]
Y = np.empty((10000, n_steps, 10))
for step_ahead in range(1, 10 + 1):
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]

Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

In [23]:
model_seq_to_seq = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

# For evaluation, we only care about the forecast at the very last time step.
# So we create a custom metric.
def last_time_step_mse(Y_true, Y_pred):
    return tf.reduce_mean(tf.square(Y_true[:, -1] - Y_pred[:, -1]))

optimizer = keras.optimizers.Adam(learning_rate=0.01)
model_seq_to_seq.compile(loss="mse", optimizer=optimizer, metrics=[last_time_step_mse])

history_seq_to_seq = model_seq_to_seq.fit(X_train, Y_train, epochs=20,
                                        validation_data=(X_valid, Y_valid),
                                        verbose=0)

seq_to_seq_mse = model_seq_to_seq.evaluate(X_valid, Y_valid, verbose=0)
print("Seq-to-Seq RNN loss (full):", seq_to_seq_mse[0])
print("Seq-to-Seq RNN MSE (last step):", seq_to_seq_mse[1])

Seq-to-Seq RNN loss (full): 0.01889195293188095
Seq-to-Seq RNN MSE (last step): 0.007327883969992399


## Handling Long Sequences

### Fighting the Unstable Gradients Problem

**Theoretical Explanation:**

Training on long sequences means BPTT has to backpropagate through many time steps. This makes the RNN effectively a very deep network, making it suffer from the **unstable gradients problem** (vanishing or exploding gradients).

Several solutions exist:
* **Gradient Clipping:** Capping the gradients at a certain threshold.
* **Nonsaturating Activation Functions:** `tanh` is often used, but `ReLU` can lead to explosions.
* **Batch Normalization:** Cannot be used between time steps, only between recurrent layers. It's not very effective for RNNs.
* **Layer Normalization:** A better solution for RNNs. Instead of normalizing *across the batch dimension* (like BN), it normalizes *across the features dimension*. It's computed on the fly at each time step and behaves the same during training and testing. It's typically applied *inside* the cell.
* **Dropout:** Can be applied to the inputs (`dropout`) and to the hidden states (`recurrent_dropout`).

In [18]:
# Example of a custom RNN cell with Layer Normalization
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation="tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units,
                                                         activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

In [19]:
# We can use this custom cell with the keras.layers.RNN layer
model_ln_rnn = keras.models.Sequential([
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True,
                     input_shape=[None, 1]),
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

  super().__init__(**kwargs)


### Tackling the Short-Term Memory Problem

**Theoretical Explanation:**

Because of the transformations the data goes through at each time step, information from early in the sequence is gradually lost. A simple RNN has a very limited short-term memory.

To solve this, more complex cells with long-term memory are used.

#### LSTM Cells
The **Long Short-Term Memory (LSTM)** cell is the most popular solution. It manages two state vectors:
* **h**(t): The short-term state (like in a simple RNN).
* **c**(t): The long-term state.

Its key idea is that it can learn what to *store* in the long-term state, what to *throw away*, and what to *read* from it. It does this using three **gates**:

1.  **Forget Gate:** Decides which parts of the long-term state **c**(t-1) to erase.
2.  **Input Gate:** Decides which parts of the candidate state **g**(t) (the "new" information) to add to the long-term state.
3.  **Output Gate:** Decides which parts of the long-term state **c**(t) to read and output as the short-term state **h**(t) and the cell output **y**(t).

This architecture allows information to be preserved for very long periods.


In [20]:
# Using LSTM in Keras is simple
model_lstm = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

#### GRU Cells
The **Gated Recurrent Unit (GRU)** cell is a simplified version of the LSTM cell that often performs just as well.

It merges the two state vectors (**h** and **c**) into one and uses only two gates:

1.  **Update Gate (z):** Controls how much of the *previous* state to keep and how much of the *new* candidate state to add (it handles the job of both the forget and input gates).
2.  **Reset Gate (r):** Controls how much of the previous state to *show* to the main layer that computes the candidate state.

GRU is slightly more computationally efficient than LSTM.

In [21]:
# Using GRU in Keras
model_gru = keras.models.Sequential([
    keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

### Using 1D Convolutional Layers to Process Sequences

**Theoretical Explanation:**

RNNs are great, but they are very sequential and thus slow to train. A **1D convolutional layer** can be a great alternative. It slides several filters (kernels) across a sequence, producing a 1D feature map per filter.

* It can learn short sequential patterns (no longer than the kernel size).
* It is not recurrent, so it can be parallelized and is much faster than an RNN.
* It can be used as a preprocessing step for an RNN to downsample the sequence (using a stride > 1), which helps the RNN learn longer-term patterns.

#### WaveNet
The **WaveNet** architecture shows that CNNs alone can handle very long sequences. It does this by stacking 1D convolutional layers with **doubling dilation rates** (1, 2, 4, 8, ...).

A dilated convolution applies a filter over an area larger than its size by "skipping" inputs. A dilation rate of 2 means the filter is applied to every 2nd input. This allows the network's receptive field to grow *exponentially*, making it highly efficient at learning long-term dependencies.

We also use `padding="causal"` to ensure that the prediction at time *t* only depends on inputs from *t* or earlier (no peeking into the future).

In [24]:
# Example of a 1D Conv layer in front of a GRU
model_cnn_rnn = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid",
                          input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

# Note: Because the Conv1D layer uses stride=2, it downsamples the sequence by 2.
# And because its kernel_size is 4, the first output is based on inputs 0-3.
# This means the targets (Y_train) must be adjusted accordingly.
model_cnn_rnn.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history_cnn_rnn = model_cnn_rnn.fit(X_train, Y_train[:, 3::2], epochs=20,
                                    validation_data=(X_valid, Y_valid[:, 3::2]),
                                    verbose=0)

cnn_rnn_mse = model_cnn_rnn.evaluate(X_valid, Y_valid[:, 3::2], verbose=0)
print("CNN+RNN loss (full):", cnn_rnn_mse[0])
print("CNN+RNN MSE (last step):", cnn_rnn_mse[1])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


CNN+RNN loss (full): 0.018182016909122467
CNN+RNN MSE (last step): 0.008125267922878265


In [25]:
# Simple WaveNet-style model
model_wavenet = keras.models.Sequential()
model_wavenet.add(keras.layers.InputLayer(input_shape=[None, 1]))

for rate in (1, 2, 4, 8) * 2: # Stack two blocks of dilated layers
    model_wavenet.add(keras.layers.Conv1D(filters=20, kernel_size=2, padding="causal",
                                      activation="relu", dilation_rate=rate))

model_wavenet.add(keras.layers.Conv1D(filters=10, kernel_size=1))

model_wavenet.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history_wavenet = model_wavenet.fit(X_train, Y_train, epochs=20,
                                    validation_data=(X_valid, Y_valid),
                                    verbose=0)

wavenet_mse = model_wavenet.evaluate(X_valid, Y_valid, verbose=0)
print("WaveNet loss (full):", wavenet_mse[0])
print("WaveNet MSE (last step):", wavenet_mse[1])



WaveNet loss (full): 0.020734522491693497
WaveNet MSE (last step): 0.009024088270962238
