# Chapter 15 - Processing Sequences Using RNNs and CNNs

In this chapter we will first look at the fundamental concepts underlying Recurrent Neural Networks (RNNs)
and how to train them using backpropagation through time, then we will use
them to forecast a time series. After that we’ll explore the two main difficulties
that RNNs face:

- Unstable gradients (discussed in Chapter 11), which can be alleviated
using various techniques, including recurrent dropout and recurrent
layer normalization

- A (very) limited short-term memory, which can be extended using
LSTM and GRU cells

## Recurrent Neurons and Layers

The simplest possible RNN is composed of one neuron receiving inputs, producing and output, and sending theat output back to itself. Each recurrent neuron has two sets of weights: one for the inputs $\pmb{x}_{t}$ and the other for the outputs of the previous time step, $\pmb{y}_{t-1}$. We can extend this architecture to a layer of Recurrent Neurons, and replace weight vectors for weight matrices $\pmb{W}_x$ for the input weights, and $\pmb{W}_y$, plus a bias term $\pmb{b}$:

![rnn_layers](./images/ch15_rnn_layer.png)

The output produce by a layer at time $t$ is given by the equation:

$$
\pmb{y}(t) = \phi(\pmb{W}_x^T\pmb{x}_{(t)} + \pmb{W}_y^T\pmb{y}_{(t-1)} + \pmb{b})
$$

Just as with feedforward neural networks, we can compute a recurrent layer’s
output in one shot for a whole mini-batch by placing all the inputs at time step $t$
in an input matrix $\pmb{X}_t$:

$$
\pmb{Y}(t) = \phi(\pmb{X}_{(t)}\pmb{W}_x + \pmb{Y}_{(t-1)}\pmb{W}_y + \pmb{b})
= \phi([\pmb{X}_{(t)} + \pmb{Y}_{(t-1)}]\pmb{W} + \pmb{b})
$$

with $\pmb{W} = \begin{bmatrix} \pmb{W}_x \\ \pmb{W}_y \end{bmatrix}$.

### Memory Cells

Since the output of a recurrent neuron at time step t is a function of all the inputs
from previous time steps, you could say it has a form of memory. A part of a
neural network that preserves some state across time steps is called a memory
cell (or simply a cell). A single recurrent neuron, or a layer of recurrent neurons,
is a very basic cell, capable of learning only short patterns (typically about 10
steps long, but this varies depending on the task).

### Input and Output Sequences

![input_output](./images/ch15_input_output_seq.png)

Applications:

- seq-to-seq: Predicting stock prices
- seq-to-vector: Text sentiment analysis
- vector-to-seq: Image caption
- Encoder-Decoder: Translating sentence from one language to another

## Training RNNs

To train an RNN, the trick is to unroll it through time (like we just did) and then
simply use regular backpropagation (see Figure 15-5). This strategy is called
*backpropagation through time* (BPTT).

![bptt](./images/ch15_bptt.png)

Depending on the task, the cost function $C$ may use all outputs, or just a subset of them. In the example above, it uses the last three outputs, but if the task was a sequence-to-vector, it would only use the last one. 

## Forecasting a Time Series

In [15]:
import numpy as np
from tensorflow import keras

2022-02-14 15:00:35.849948: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-14 15:00:35.849973: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [6]:
def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 20))  # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise
    return series[..., np.newaxis].astype(np.float32)

This function creates as many time series as requested (via the `batch_size`
argument), each of length `n_steps`, and there is just one value per time step in
each series (i.e., all series are univariate). The function returns a NumPy array of
shape `[batch size, time steps, 1]`, where each series is the sum of two sine waves
of fixed amplitudes but random frequencies and phases, plus a bit of noise.

Now, let's create a training set, a validation set, and a test set using this function:

In [9]:
n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

`X_train` contains 7000 time series (i.e., its shape is [7000, 50, 1]), while `X_valid` contains 2000 and `X_test` contains 1000. Since we want to forecast a single value for each series, the targets are column vectors.

### Baseline Metrics

It is important to define the metrics to evaluate our models. The simplest approach is to predict the last value in each series. this is called *naive forecasting*:

In [17]:
y_pred = X_valid[:, -1]
np.mean(keras.losses.mean_squared_error(y_valid, y_pred))

0.040211584

Another simple approach is to use a fully connected network. Since it expects a flat list of features for each input, we need to add a `Flatten` layer. Let's implement a simple linear regression:

In [27]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]),
    keras.layers.Dense(1)
])

model.compile(loss="mse", optimizer="adam")

model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f9940309400>

In [28]:
model.evaluate(X_valid, y_valid)



0.002967239124700427

Using this approach, we got a MSE of 0.00297 on the validation set.

### Implementing a Simple RNN

Now, let's use the simplest RNN we can build to perform this task. It just contains a single layer, with a single neuron. We do not need to specify the length of the input sequences (unlike in the previous model) since a rnn can process any number of time steps (this is why we set the first input dimension to `None`.

In [30]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f9940217430>

In [31]:
model.evaluate(X_valid, y_valid)



0.1439523547887802

As we can see, the performance of this RNN was worse than the Neural Network (linear model). It looks like our RNN is too simple to get good performance. Next, let's try a deep RNN.

By default, recurrent layers in Keras only return the final output. To make them return one output per time step, you must set `return_sequences=True`|, as we will see.

### Deep RNNs

It is quite common to stack multiple layers of cells, like in the figure 15-7. This gives you a deep RNN.

![deep_rnn](./images/ch15_deep_rnn.png)

Implementing a deep RNN with `tf.keras` is quite simple: just stack recurrent
layers. In this example, we use three `SimpleRNN` layers (but we could add any
other type of recurrent layer, such as an `LSTM` layer or a `GRU` layer, which we will
discuss shortly):

In [32]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(1)
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f99402f19d0>

In [33]:
model.evaluate(X_valid, y_valid)



0.0027131850365549326

And bang! we beat the linear model with 0.00271 MSE.

Note that the last layer is not ideal: it must have a single unit because we want to
forecast a univariate time series, and this means we must have a single output
value per time step. 

However, having a single unit means that the hidden state is
just a single number. That’s really not much, and it’s probably not that useful;
presumably, the RNN will mostly use the hidden states of the other recurrent
layers to carry over all the information it needs from time step to time step, and
it will not use the final layer’s hidden state very much. 

Moreover, since a `SimpleRNN` layer uses the `tanh` activation function by default, the predicted
values must lie within the range –1 to 1. But what if you want to use another
activation function? For both these reasons, it might be preferable to replace the
output layer with a `Dense` layer: it would run slightly faster, the accuracy would
be roughly the same, and it would allow us to choose any output activation
function we want. If you make this change, also make sure to remove
`return_sequences=True` from the second (now last) recurrent layer:

In [34]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(1)
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f9939fa2a90>

In [35]:
model.evaluate(X_valid, y_valid)



0.002368667395785451

### Forecasting Several Time Steps Ahead

So far, we've dedicated our studies in predicting the next value of the series. But what if we want to predict the next 10 values of the series? There are two possible approaches for this:

1. The first option is to use the model we already trained, make it predict the next value, then add that value to the inputs (acting as if this predicted value had actually occured), and use the model again to predict the following value, and so on, as in the following code:

In [36]:
series = generate_time_series(1, n_steps + 10)
X_new, Y_new = series[:, :n_steps], series[:, n_steps:]
X = X_new 
for step_ahead in range(10):
    y_pred_one = model.predict(X[:, step_ahead])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)
    
Y_pred = X[:, n_steps:]

In [41]:
np.mean(keras.metrics.mean_squared_error(Y_pred, Y_new))

0.031221583

In [49]:
Y_pred

array([[[ 0.34370846],
        [ 0.21279262],
        [-0.02680342],
        [-0.15851194],
        [-0.31637913],
        [-0.38260907],
        [-0.3879037 ],
        [-0.34886643],
        [-0.24574944],
        [-0.16358002]]], dtype=float32)

2. The second option is to train an RNN to predict all 10 next values at once. We can still use a sequence-to-vector model, but it will output 10 values instead of 1. However, we first need to change the targets to be vectors containing the next 10 values:

In [42]:
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

Now we just need the output layer to have 10 units instead of 1:

In [48]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10)
])

model.compile(loss="mse", optimizer="adam")
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f9939a8c460>

In [50]:
model.evaluate(X_valid, Y_valid)



0.006903470028191805

In [51]:
Y_pred = model.predict(X_new)

In [53]:
np.mean(keras.metrics.mean_squared_error(Y_pred, Y_new))

0.22924933

In this particular case, the first approach was much better.

In [52]:
Y_pred

array([[ 0.42717203,  0.16126128, -0.08839938, -0.36850065, -0.5211376 ,
        -0.56968665, -0.5459968 , -0.42273957, -0.2622762 , -0.10547987]],
      dtype=float32)

We can still improve on this solution: indeed, instead of training the model to forecast the next 10 values only at the very last time step, we can train it to forecast the next 10 values at each and every time
step. In other words, we can turn this sequence-to-vector RNN into a sequenceto-sequence RNN. The advantage of this technique is that the loss will contain a term for the output of the RNN at each and every time step, not just the output at the last time step. This means there will be many more error gradients flowing through the model, and they won’t have to flow only through time; they will also flow from the output of each time step. This will both stabilize and speed up training.

To be clear, at time step 0 the model will output a vector containing the forecasts
for time steps 1 to 10, then at time step 1 the model will forecast time steps 2 to
11, and so on. So each target must be a sequence of the same length as the input
sequence, containing a 10-dimensional vector at each step. Let’s prepare these
target sequences:

In [62]:
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10D vectors
for step_ahead in range(1, 10 + 1):
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

To turn the model into a sequence-to-sequence model, we must set
`return_sequences=True` in all recurrent layers (even the last one), and we must
apply the output Dense layer at every time step. Keras offers a
TimeDistributed layer for this very purpose.

In [63]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

All outputs are needed during training, but only the output at the last time step is
useful for predictions and for evaluation. So although we will rely on the MSE
over all the outputs for training, we will use a custom metric for evaluation, to
only compute the MSE over the output at the last time step:

In [65]:
def last_time_step_mse(Y_true, Y_pred):
    return keras.metrics.mean_squared_error(Y_true[:, -1], Y_pred[:, -1])

model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f9939270940>

In [66]:
model.evaluate(X_valid, Y_valid)



[0.01456688903272152, 0.00796412955969572]

In this example, we didn't get any improvement.

## Handling Long Sequences

To train an RNN on long sequences, we must run it over many time steps,
making the unrolled RNN a very deep network. Just like any deep neural
network it may suffer from the **unstable gradients problem**, discussed in Chapter
11: it may take forever to train, or training may be unstable. Moreover, when an
RNN processes a long sequence, it will gradually **forget the first inputs in the
sequence**. Let’s look at both these problems, starting with the unstable gradients
problem.

### Fighting the Unstable Gradients Problem

Many of the tricks we learned in the context of deep nets to alleviate the unstable gradients problem can also be used of RNNs: good parameter initialization, faster optimizers, dropout and so on. However, Non-saturating activation functions (ReLU) may not help here. The accumulation of gradients at every step makes the gradients explode. That's why the default activation function is the hyperbolic tangent, which saturates and limit the gradients. 

Moreover, Batch Normalization cannot be used as efficiently with RNNs. It was found that BN was slightly beneficial only when it was applied to the **inputs**, **not to the hidden states**. In other words, it was slightly better than nothing when applied between recurrent layers (i.e., vertically in Figure 15-7), but not within recurrent layers (i.e., horizontally). In Keras this can be done simply by adding a BatchNormalization layer before each recurrent layer, but don’t expect too much from it.

Another form of normalization often workss better with RNNs: *Layer Normalization*. In an RNN, it is typically used right after the linear combination of the inputs and the hidden states.

Let’s use `tf.keras` to implement Layer Normalization within a simple memory cell. For this, we need to define a custom memory cell. It is just like a regular layer, except its `call()` method takes two arguments: the inputs at the current time step and the hidden states from the previous time step. 

In [75]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation="tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.outputs_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units, activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    
    
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]
    
# It would have been simpler to inherit from SimpleRNNCell instead so that we wouldn’t have to
# create an internal SimpleRNNCell or handle the state_size and output_size attributes, but the
# goal here was to show how to create a custom cell from scratch.

The `call()` method starts by applying the simple RNN cell, which
computes a linear combination of the current inputs and the previous hidden
states, and it returns the result twice (indeed, in a SimpleRNNCell, the outputs
are just equal to the hidden states: in other words, `new_states[0]` is equal to
outputs, so we can safely ignore new_states in the rest of the `call()` method).
Next, the `call()` method applies Layer Normalization, followed by the
activation function. Finally, it returns the outputs twice (once as the outputs, and
once as the new hidden states). To use this custom cell, all we need to do is
create a `keras.layers.RNN` layer, passing it a cell instance:

In [78]:
model = keras.models.Sequential([
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True,
                     input_shape=[None, 1]),
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [79]:
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [80]:
model.evaluate(X_valid, Y_valid)



[0.014905747026205063, 0.009058771654963493]

### Tackling the Short-Term Memory Problem

Various types of cells with long-term memory have been introduced to mitigate the short-term memory problem. They have proven so successful that the basic cells we've been using until now are not used much anymore. Let’s first look at the most popular of these long-term memory cells: the LSTM cell.

#### LSTM cells

The Long Short-Term Memory (LSTM) cell was proposed in 1997. If you consider the LSTM cell as a black box, it can be used very much like a basic cell, except it will perform much better; training will converge faster, and it will detect long-term dependencies in the data. In Keras, you can simply use the LSTM layer instead of the SimpleRNN layer:

In [81]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f98fb981f40>

In [83]:
model.evaluate(X_valid, Y_valid)



[0.012923017144203186, 0.006200644187629223]

This is the best result we've obtained so far. So how does an LSTM cell work? Its architecture is shown in Figure 15-9.

![lstm_architecture](./images/ch15_lstm_architecture.png)

If you don’t look at what’s inside the box, the LSTM cell looks exactly like a
regular cell, except that its state is split into two vectors: $\pmb{h}_{(t)}$ and $\pmb{c}_{(t)}$ ("c" stands
for "cell"). You can think of $\pmb{h}_{(t)}$ as the **short-term state** and $\pmb{c}_{(t)}$ as the **long-term state**.

The key idea is that the network can learn what to **store** in the long-term state, what to **throw away**, and what to **read** from it. These three operations are controlled by the three *gate controllers*: $\pmb{f}_{(t)}$, $\pmb{i}_{(t)}$, and $\pmb{o}_{(t)}$.

First, let's start with we already know. The gate $\pmb{g}_{(t)}$ behaves exactly like the cells we've been dealing with so far. It has the usual role of analyzing the current input $\pmb{x}_{(t)}$ and the previous (short-term) state $\pmb{h}_{(t-1)}$. In a basic cell, there is nothing other than this layer, and its output goes straight to the output $\pmb{y}_{(t)}$ and $\pmb{h}_{(t)}$. 

The gate controllers are fully connected layers, responsible for filtering and selecting with informations will be passed through the network.

- the forget gate ($\pmb{f}_{(t)}$) controls which parts of the long-term state should be erased via and element-wise multiplication with the previous long-term memory state $\pmb{c}_{(t-1)}$.

- the input gate ($\pmb{i}_{(t)}$) controls which parts of $\pmb{g}_{(t)}$ should be added to the long-term state to yield the next long term state and also be passed through the $tanh$ activation.

- the output gate ($\pmb{o}_{(t)}$) control which parts of the long-term state should be read and output at this time step, both to $\pmb{h}_{(t)}$ and $\pmb{y}_{(t)}$.

The following equations summarizes how to compute the cell's long-term state, its short-term state, and its output at each time step for a single instance.

$$
\pmb{i}_{(t)} = \sigma(\pmb{W}_{xi}^T\pmb{x}_{(t)} + \pmb{W}_{hi}^T\pmb{h}_{(t-1)} + \pmb{b_i})
\\
\pmb{f}_{(t)} = \sigma(\pmb{W}_{xf}^T\pmb{x}_{(t)} + \pmb{W}_{hf}^T\pmb{h}_{(t-1)} + \pmb{b_f})
\\
\pmb{o}_{(t)} = \sigma(\pmb{W}_{xo}^T\pmb{x}_{(t)} + \pmb{W}_{ho}^T\pmb{h}_{(t-1)} + \pmb{b_o})
\\
\pmb{g}_{(t)} = \tanh(\pmb{W}_{xg}^T\pmb{x}_{(t)} + \pmb{W}_{hg}^T\pmb{h}_{(t-1)} + \pmb{b_g})
\\
\pmb{c}_{(t)} = \pmb{f}_{(t)} \otimes \pmb{c}_{(t-1)} + \pmb{i}_{(t)} \otimes \pmb{g}_{(t)}
\\
\pmb{y}_{(t)} = \pmb{h}_{(t)} = \pmb{o}_{(t)} \otimes \tanh(\pmb{c}_{(t)})
$$

Where:

- $\pmb{W}_{xi}$, $\pmb{W}_{xf}$, $\pmb{W}_{xo}$, and $\pmb{W}_{xg}$ are the weight matrices of each of the four layers for their connection to the input vector $\pmb{x}_{(t)}$.
- $\pmb{W}_{hi}$, $\pmb{W}_{hf}$, $\pmb{W}_{ho}$, and $\pmb{W}_{hg}$ are the weight matrices of each of the four layers for their connection to the previous short-term state $\pmb{h}_{(t-1)}$.

#### Peephole connections

There are many variants of the LSTM cell, one of them was proposed in 2000. This variant has extra connections called *peephole connections*: the previous long-term state $\pmb{c}_{(t-1)}$ is added as an input to the controllers of the forget gate and the input gate, and the current long-term state $\pmb{c}_{(t)}$ is added as input to the controller of the output gate. This often improves performance, but not always, and there is no clear pattern for which tasks are better off with or without them: you will have to try it on your task and see if it helps.

#### GRU cells

The GRU cell is a simplified version of the LSTM cell, and it seems to perform just as well. These are the main simplifications:

![gru_cells](./images/ch15_gru_cells.png)

- Both state vectors (for short-term and long-term memory) are merged into a single vector $\pmb{h}_{(t)}$.
- A single gate controller $\pmb{z}_{(t)}$ controls the forget gate and the input gate. If the gate controller outputs a 1, the forget gate is open (= 1) and the input gate is closed (1 – 1 = 0). If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a frequent variant to the LSTM cell in and of itself.
- There is no output gate; the full state vector is output at every time step. However, there is a new gate controller $\pmb{r}_{(t)}$ that controls which part of the previous state will be shown to the main layer ($\pmb{g}_{(t)}$)

The following equations summarizes how to compute the cell's state at each time for a single instance:

$$
\pmb{z}_{(t)} = \sigma(\pmb{W}_{xz}^T\pmb{x}_{(t)} + \pmb{W}_{hz}^T\pmb{h}_{(t-1)} + \pmb{b_z})
\\
\pmb{r}_{(t)} = \sigma(\pmb{W}_{xr}^T\pmb{x}_{(t)} + \pmb{W}_{hr}^T\pmb{h}_{(t-1)} + \pmb{b_r})
\\
\pmb{g}_{(t)} = \tanh(\pmb{W}_{xg}^T\pmb{x}_{(t)} + \pmb{W}_{hg}^T(\pmb{r}_{(t)} \otimes \pmb{h}_{(t-1)}) + \pmb{b_g})
\\
\pmb{z}_{(t)} = \pmb{z}_{(t)} \otimes \pmb{h}_{(t-1)} + (1 - \pmb{z}_{(t)}) \otimes \pmb{g}_{(t)}
$$

In [84]:
model = keras.models.Sequential([
    keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f99398472b0>

In [85]:
model.evaluate(X_valid, Y_valid)



[0.012731347233057022, 0.006827231030911207]

The performance obtained using GRU in this case was a little worse than with LSTM.

LSTM and GRU cells are one of the main reasons behind the success of RNNs.
Yet while they can tackle much longer sequences than simple RNNs, they still
have a fairly limited short-term memory, and they have a hard time learning
long-term patterns in sequences of 100 time steps or more, such as audio
samples, long time series, or long sentences. One way to solve this is to shorten
the input sequences, for example using 1D convolutional layers.

#### Using 1D convolutional layers to process sequences

a 1D convolutional layer slides several kernels across a sequence, producing a 1D feature map per kernel. Each kernel will learn to detect a single very short sequential pattern (no longer than the kernel size). If you use 10 kernels, then the layer’s output will be composed of 10 1-dimensional sequences (all of the same length), or equivalently you can view this output as a single 10-dimensional sequence. This means that you can build a neural network composed of a mix of recurrent layers and 1D convolutional layers (or even 1D pooling layers). If you use a 1D convolutional layer with a stride of 1 and "same" padding, then the output sequence will have the same length as the input sequence. But if you use "valid" padding or a stride greater than 1, then the output sequence will be shorter than the input sequence, so make sure you adjust the targets accordingly. For example, the following model is the same as earlier, except it starts with a 1D convolutional layer that downsamples the input sequence by a factor of 2, using a stride of 2.

Note that we must also crop off the first three time steps in the targets (since the kernel’s size is 4, the first output of the convolutional layer will be based on the input time steps 0 to 3), and downsample the targets by a factor of 2:

In [86]:
model = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid", 
                        input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
model.fit(X_train, Y_train[:, 3::2], epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f98fa6bf040>

In [88]:
model.evaluate(X_valid, Y_valid[:, 3::2])



[0.009614660404622555, 0.006209068465977907]

Using this approach, we got a performance similar to the LSTM.

## Exercises

**1. Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?**

Sequence-to-sequence applications could be text translation or time-series forecasting. Sequence-to-vector could be a sentiment analysis task with text input data. Lastly, a vector-to-sequence application could be image captioning.

**2. How many dimensions must the inputs of an RNN layer have? What
does each dimension represent? What about its outputs?**

An RNN layer must have three input dimensions *[batch_size, time_steps, dimensionality]*, even for 1-dimensional data, in which case the last coordinate would be 1. For example, if you want to process a batch containing 5 time series of 10 time steps each, with 2 values per time step (e.g., the temperature and the wind speed), the shape will be *[5, 10, 2]*.

The output of an RNN varies according to the task, any task that is trying to predict a sequence, will have an output of *[batch_size, time_steps, dimensionality]*, and in case of a vector, it would have the dimension of the vector.

**3. If you want to build a deep sequence-to-sequence RNN, which RNN layers should have `return_sequences=True`? What about a sequence-to-vector RNN?**

All RNN inner layers must have `return_sequences=True`, in order to have sequences as outputs of the network in a sequence-to-sequence task. On the other hand, in sequence-to-vector approaches, the last RNN layer do not return a sequence, hence `return_sequences=False`. The task of defining the output dimensions is fit only for the output layer.

**4. Suppose you have a daily univariate time series, and you want to forecast the next seven days. Which RNN architecture should you use?**

I would use a sequence-to-sequence RNN architecture. The input shape would be *[batch_size, time_steps, 1]* and the output *[batch_size, time_steps, 1]*.

**5. What are the main difficulties when training RNNs? How can you handle them?**

The two main problems that can arise when training RNNs are *unstable gradients* and *forgetting long sequences*.

- *unstable gradients*: can be fought using regularization techniques, such as dropout, layer normalization, clipnorm or clip values (implemented in the optimizer) and saturating activation functions such as `tanh`.
- *forgetting long sequences*: can be overcomed using LSTM/GRU cells, which are specialized for longer sequences. And if it is not enough, you can also deploy Convolutional layers to do the job.

**6. Can you sketch the LSTM cell's architecture?**

Yes.

**7. Why would you want to use 1D convolutional layers in an RNN?**

1D convolutional layers improve the capacity of the RNN to remember long term sequences. It act by reducing the time series, while learning to filter out "uninportant" parts of the series. This also reduces complexity. 

**8. Which neural network architecture could you use to classify videos?**

To classify videos based on their visual content, one possible
architecture could be to take (say) one frame per second, then run every
frame through the same convolutional neural network (e.g., a pretrained
Xception model, possibly frozen if your dataset is not large), feed the
sequence of outputs from the CNN to a sequence-to-vector RNN, and
finally run its output through a softmax layer, giving you all the class
probabilities. For training you would use cross entropy as the cost
function. If you wanted to use the audio for classification as well, you
could use a stack of strided 1D convolutional layers to reduce the
temporal resolution from thousands of audio frames per second to just
one per second (to match the number of images per second), and
concatenate the output sequence to the inputs of the sequence-to-vector
RNN (along the last dimension).