In [None]:
from IPython.display import Image

To serve as slides: 
- Open a terminal window 
- `jupyter nbconvert /full/paht/to/notebook.ipynb --to slides --post serve`

## Week 5 – Recurrent Neural Networks

# Outline


- Recurrent Neurons 
- Basic RNNs in TensorFlow 
- Training RNNs 
- Deep RNNs 
- LSTM Cell 
- GRU Cell 
- Natural Language Processing

    
# Learning Outcomes

- Understanding of fundamental concepts underlying RNNs
- Main problems RNNs face and solutions to fight them
- Comprenhension of the different types of cells: LSTMs and GRUs. 
- Familiarity with the implementation of RNNs using TensorFlow and Keras. 
- Exposure to applications where RNNs have shown remarkable performance
- Grasp of practical considerations: training time, parameters, vanishing/exploding gradients,
- Become familiar with the typical architectures
- Practical training and deployment considerations

# Introduction

- Today, we are going to discuss recurrent neural networks (RNN):
    - A class of nets that can predict the future :)
- They can analyze time series data such as stock prices, 
- In autonomous driving systems, they can anticipate car trajectories,
- They can work on sequences of arbitrary lengths,
- They can take sentences, documents, or audio samples as input, making them extremely useful for natural language processing (NLP) systems.
- RNNs can generate sentences, image captions, and much more. 
- We could ask RNNs to predict which are the most likely next notes in a melody, then randomly pick one of these notes and play it. Then ask the net for the next most likely notes, play it, and repeat the process again and again. (Melody composition) see the Magenta project by Google.


We will look at the fundamental concepts underlying RNNs, problem and solutions during training.

# Setup

First, let's make sure this notebook has all the required libraries, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
#File containing all definitions and utility functions.
from setups import *
from plotting import *
%matplotlib inline
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "cnn"

# Recurrent Neurons

- A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backwards. 

- Let’s look at the simplest possible RNN, one cell with feedback: 
    - At each time step t, this recurrent neuron receives the inputs $x(t)$ as well as its own output from the previous time step, $y(t–1)$. 
    - We can represent this tiny network against the time axis. This is called unrolling the network through time.


![rnn_cell.png](imgs/rnn_cell.png)




Now, to create a layer:
- Every neuron receives both the input vector $x(t)$ and the output vector from the previous time step $y(t–1)$. 
- Note that both the inputs and outputs are vectors now 

![rnn_layer.png](imgs/rnn_layer.png)

Each recurrent neuron has two sets of weights: 
- for the inputs $x(t)$: $w_x$
- for the outputs of the previous time step, $y(t–1)$: $w_y$


- We can place all the weight vectors for all neurons in two weight matrices, $W_x$ and $W_y$. 
- The output vector of the whole recurrent layer can then be computed as:  
($b$ is the bias vector and $\phi(.)$ is the activation function).


![rnneq1.png](imgs/rnneq1.png)

We can compute a recurrent layer’s output in one shot for a whole mini-batch :


![rnn_outputeq.png](imgs/rnn_outputeq.png)


- $Y(t)$ is an $m \times n_{neurons}$ matrix containing the layer’s outputs at time step $t$ for each instance in the mini-batch (m is the number of instances in the mini-batch).

- $X(t)$ is an $m \times n_{inputs}$ matrix containing the inputs for all instances 

- $W_x$ is an $n_{inputs} \times n_{neurons}$ matrix containing the connection weights of the current step.

- $W_y$ is an $n_{neurons} × n_{neurons}$ matrix containing the connection weights for the outputs of the previous time step.

- $b$ is a vector of size $n_{neurons}$ containing each neuron’s bias term.

The weight matrices $W_x$ and $W_y$ are often concatenated vertically into a single weight matrix $W$

The notation $[X(t) Y(t–1)]$ represents the horizontal concatenation of the matrices $X(t)$ and $Y(t–1)$.

Notice that Y(t) is a function of X(t) and Y(t–1), which is a function of X(t–1) and Y(t–2), which is a function of X(t–2) and Y(t–3), and so on. 



# Memory Cells

- The output of a recurrent neuron at time $t$ is a function of all the inputs from previous times:
    - You could say it has a form of memory. 
    
    
- A single recurrent neuron, is a very form of memory cell.

- In general, a cell’s state $h(t)$ is a function of some inputs $x(t)$ and its state at the previous time step: 
    - $h(t) = f(h(t–1), x(t))$
    
    
- The network's output $y(t)$, is also a function of the previous $y(t-1)$ and $x(t)$. 

- We will look at more complex memory cells later today

![mem_cells.png](imgs/mem_cells.png)


# Input and Output Sequences

- An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs:
    - Useful for predicting time series


- RNNs can also take a sequence of inputs, and ignore all outputs except for the last one:
    - Sequence-to-vector network. i.e. feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score (e.g., from –1 [hate] to +1 [love]).


- Single input at the first time step (and zeros for all other time steps), and let it output a sequence 
    - Vector-to-sequence network. i.e. the input could be an image, and the output could be a caption for that image.


- Sequence-to-vector network (encoder), followed by a vector-to-sequence network (decoder).  
    - translating a sentence from one language to another. This two-step model, called an Encoder–Decoder. 
    - This model works better because the last words of a sentence can affect the first words of the translation.




![input_output_seq.png](imgs/input_output_seq.png)


# RNNs in TensorFlow


![rnn_cell.png](imgs/rnn_cell.png)


First, We will create an RNN composed of a layer of five recurrent neurons using the tanh activation function. We will assume that the RNN runs over only two time steps, taking input vectors of size 3 at each time step. The following code builds this RNN, unrolled through two time steps:

In [None]:
n_inputs = 3
n_neurons = 5

X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

Wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons],dtype=tf.float32))
Wy = tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))
b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))

Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

init = tf.global_variables_initializer()

This network looks much like a two-layer feedforward neural network, with a few twists: 
    - The same weights and bias terms are shared by both layers
    - We feed inputs at each layer, and we get outputs from each layer. 

To run the model, we need to feed it the inputs at both time steps, like so:


In [None]:
import numpy as np

# Mini-batch:        instance 0,instance 1,instance 2,instance 3
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})

This mini-batch contains four instances, each with an input sequence composed of exactly two inputs. At the end, Y0_val and Y1_val contain the outputs of the network at both time steps for all neurons and all instances in the mini-batch:



```
>>> print(Y0_val)  # output at t = 0
[[-0.0664006   0.96257669  0.68105787  0.70918542 -0.89821595]  # instance 0
 [ 0.9977755  -0.71978885 -0.99657625  0.9673925  -0.99989718]  # instance 1
 [ 0.99999774 -0.99898815 -0.99999893  0.99677622 -0.99999988]  # instance 2
 [ 1.         -1.         -1.         -0.99818915  0.99950868]] # instance 3
>>> print(Y1_val)  # output at t = 1
[[ 1.         -1.         -1.          0.40200216 -1.        ]  # instance 0
 [-0.12210433  0.62805319  0.96718419 -0.99371207 -0.25839335]  # instance 1
 [ 0.99999827 -0.9999994  -0.9999975  -0.85943311 -0.9999879 ]  # instance 2
 [ 0.99928284 -0.99999815 -0.99990582  0.98579615 -0.92205751]] # instance 3
 ```

# Static Unrolling Through Time (TF)
The static_rnn() function creates an unrolled RNN network by chaining cells. The following code creates the exact same model as the previous one:


In [None]:
X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, [X0, X1],
                                                dtype=tf.float32)
Y0, Y1 = output_seqs

- Steps:
    - Create the input placeholders 
    - Create a `BasicRNNCell`
    - Call static_rnn()
    
    
- The `static_rnn()` function calls the cell factory’s `__call__()` function once per input, creating two copies of the cell, with shared weights and bias terms, and it chains. The `static_rnn()` function returns two objects:
    - A python list containing the output tensors for each time step. 
    - A tensor containing the final states of the network. 

If there were 50 time steps: You would to have to define 50 input placeholders and 50 output tensors. And, at execution time you would have to feed each of the 50 placeholders and manipulate the 50 outputs. 

The following code builds the same RNN again:

In [None]:
n_steps = 2
n_inputs = 3
n_neurons = 5

reset_graph()

In [None]:
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
X_seqs = tf.unstack(tf.transpose(X, perm=[1, 0, 2]))
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, X_seqs,
                                                dtype=tf.float32)
outputs = tf.transpose(tf.stack(output_seqs), perm=[1, 0, 2])

In [None]:
init = tf.global_variables_initializer()

In [None]:
#Now we can run the network by feeding it a single tensor that contains all the mini-batch sequences:

X_batch = np.array([
         # t = 0     t = 1
        [[0, 1, 2], [9, 8, 7]], # instance 0
        [[3, 4, 5], [0, 0, 0]], # instance 1
        [[6, 7, 8], [6, 5, 4]], # instance 2
        [[9, 0, 1], [3, 2, 1]], # instance 3
    ])

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict={X: X_batch})

```
#And we get a single outputs_val tensor for all instances, all time steps, and all neurons:

>>> print(outputs_val)
[[[-0.91279727  0.83698678 -0.89277941  0.80308062 -0.5283336 ]
  [-1.          1.         -0.99794829  0.99985468 -0.99273592]]

 [[-0.99994391  0.99951613 -0.9946925   0.99030769 -0.94413054]
  [ 0.48733309  0.93389565 -0.31362072  0.88573611  0.2424476 ]]

 [[-1.          0.99999875 -0.99975014  0.99956584 -0.99466234]
  [-0.99994856  0.99999434 -0.96058172  0.99784708 -0.9099462 ]]

 [[-0.95972425  0.99951482  0.96938795 -0.969908   -0.67668229]
  [-0.84596014  0.96288228  0.96856463 -0.14777924 -0.9119423 ]]]
  ```

- However, this approach still builds a graph containing one cell per time step. 
    - If there were 50 time steps, the graph would look pretty ugly. 

Fortunately, there is a better solution: the `dynamic_rnn()` function.

# Dynamic Unrolling Through Time

The `dynamic_rnn()` function uses a `while_loop()` operation to run over the cell the appropriate number of times, and you can set `swap_memory=True` if you want it to swap the GPU’s memory to the CPU’s memory during backpropagation to avoid OOM errors.

- It accepts a single tensor for all inputs at every time step and outputs a single tensor for all outputs at every time step 
- There is no need to stack, unstack, or transpose. 

In [None]:
reset_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

#During backpropagation, the while_loop() operation does the appropriate magic: 
#it stores the tensor values for each iteration during the forward pass 
#so it can use them to compute gradients during the reverse pass.

# Handling Variable Length Input Sequences

- What if the input sequences have variable lengths (e.g., like sentences)? 
    - In this case you should set the sequence_length argument when calling the dynamic_rnn() (or static_rnn()) function; 
    - It must be a 1D tensor indicating the length of the input sequence for each instance.

In [None]:
seq_length = tf.placeholder(tf.int32, [None])

[...]
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32,
                                    sequence_length=seq_length)


For example, suppose the second input sequence contains only one input instead of two. It must be padded with a zero vector in order to fit in the input tensor X

In [None]:

X_batch = np.array([
        # step 0     step 1
        [[0, 1, 2], [9, 8, 7]], # instance 0
        [[3, 4, 5], [0, 0, 0]], # instance 1 (padded with a zero vector)
        [[6, 7, 8], [6, 5, 4]], # instance 2
        [[9, 0, 1], [3, 2, 1]], # instance 3
    ])
seq_length_batch = np.array([2, 1, 2, 2])

We now need to feed values for both placeholders X and seq_length:

In [None]:
with tf.Session() as sess:
    init.run()
    outputs_val, states_val = sess.run(
        [outputs, states], feed_dict={X: X_batch, seq_length: seq_length_batch})

Now, the RNN outputs zero vectors for every time step past the input sequence length:

```
>>> print(outputs_val)
[[[-0.68579948 -0.25901747 -0.80249101 -0.18141513 -0.37491536]
  [-0.99996698 -0.94501185  0.98072106 -0.9689762   0.99966913]]  # final state

 [[-0.99099374 -0.64768541 -0.67801034 -0.7415446   0.7719509 ]   # final state
  [ 0.          0.          0.          0.          0.        ]]  # zero vector

 [[-0.99978048 -0.85583007 -0.49696958 -0.93838578  0.98505187]
  [-0.99951065 -0.89148796  0.94170523 -0.38407657  0.97499216]]  # final state

 [[-0.02052618 -0.94588047  0.99935204  0.37283331  0.9998163 ]
  [-0.91052347  0.05769409  0.47446665 -0.44611037  0.89394671]]] # final state
  ```

The states tensor contains the final state of each cell (excluding the zero vectors):


```
>>> print(states_val)
[[-0.99996698 -0.94501185  0.98072106 -0.9689762   0.99966913]  # t = 1
 [-0.99099374 -0.64768541 -0.67801034 -0.7415446   0.7719509 ]  # t = 0 !!!
 [-0.99951065 -0.89148796  0.94170523 -0.38407657  0.97499216]  # t = 1
 [-0.91052347  0.05769409  0.47446665 -0.44611037  0.89394671]] # t = 1
```

# Handling Variable-Length Output Sequences

- i.e. the length of a translated sentence is generally different from the length of the input sentence. 
    - In this case, the most common solution is to define a special output called an end-of-sequence token (EOS token). Any output past the EOS should be ignored.


# Training RNNs

The approach is to unroll the RNN through time and then use regular backpropagation  --backpropagation through time (BPTT).


![training_rnns.png](imgs/training_rnns.png)


- There is a first forward pass through the unrolled network ; 
- then the output sequence is evaluated using a cost function 
- and the gradients of that cost function are propagated backward through the unrolled network 
- finally the model parameters are updated using the gradients computed during BPTT. 

Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output. 

Since the same parameters W and b are used at each time step, backpropagation will do the right thing and sum over all time steps.



# Training a Sequence Classifier

- Let’s train an RNN to classify MNIST images:
    - We will treat each image as a sequence of 28 rows of 28 pixels each 
    - We will use cells of 150 recurrent neurons, plus a fully connected layer containing 10 neurons connected to the output of the last time step, followed by a softmax layer

![seq_clf.png](imgs/seq_clf.png)


In [None]:
n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

reset_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

In [None]:
#Now let’s load the MNIST data and reshape the test data 
# to [batch_size, n_steps, n_inputs] as is expected by the network. 
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/")
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

In [None]:
#The execution phase is exactly the same as for a Fully Connected network (See Geron ch. 10)

n_epochs = 100
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

The output should look like this:
```
0 Train accuracy: 0.94 Test accuracy: 0.9308
1 Train accuracy: 0.933333 Test accuracy: 0.9431
[...]
98 Train accuracy: 0.98 Test accuracy: 0.9794
99 Train accuracy: 1.0 Test accuracy: 0.9804
```

We get over 98% accuracy! Plus you would certainly get a better result by tuning the hyperparameters, initializing the RNN weights using He initialization, training longer, or adding regularization.

You can specify an initializer for the RNN by wrapping its construction code in a variable scope (e.g., use `variable_scope("rnn", initializer=variance_scaling_initializer()`) to use He initialization).

# Training to Predict Time Series
Now let’s take a look at how to handle time series, such as stock prices, air temperature, brain wave patterns, and so on. 

- We will train an RNN to predict the next value in a generated time series: 
    - Each training instance is a randomly selected sequence of 20 consecutive values from the time series
    - The target sequence is the same as the input sequence, except it is shifted by one time step into the future 

![predict_ts.png](imgs/predict_ts.png)


First, let’s create the RNN. It will contain 100 recurrent neurons and we will unroll it over 20 time steps since each training instance will be 20 inputs long. Each input will contain only one feature (the value at that time). The targets are also sequences of 20 inputs, each containing a single value. The code is almost the same as earlier:

In [None]:
n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu)
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

In general you would have more than just one input feature. For example, if you were trying to predict stock prices, you would likely have many other input features at each time step, such as prices of competing stocks, ratings from analysts, or any other feature that might help the system make its predictions.

- At each time step we have an output vector of size 100. But what we actually want is a single output value at each time step.
    - The simplest solution is to wrap the cell in an `OutputProjectionWrapper`. A cell wrapper acts like a normal cell, proxying every method call to an underlying cell, but it also adds some functionality. 
    - The `OutputProjectionWrapper` adds a fully connected layer of linear neurons on top of each output.
    - The resulting RNN is represented as:


![output_projections.png](imgs/output_projections.png)


Let’s tweak the preceding code by wrapping the BasicRNNCell into an OutputProjectionWrapper:

In [None]:
cell = tf.contrib.rnn.OutputProjectionWrapper(
    tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu),
    output_size=n_outputs)

Now, we need to define the cost function (we could use the MSE), an optimizer (Adam), the training op, and the variable initialization op:

In [None]:
learning_rate = 0.001

loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
#Now on to the execution phase:

n_iterations = 1500
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = [...]  # fetch the next training batch
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(iteration, "\tMSE:", mse)

The program’s output should look like this:

```0       MSE: 13.6543
100     MSE: 0.538476
200     MSE: 0.168532
300     MSE: 0.0879579
400     MSE: 0.0633425
[...]
```

Once the model is trained, you can make predictions:

```
X_new = [...]  # New sequences
y_pred = sess.run(outputs, feed_dict={X: X_new})
```

Predition after 1000 iterations:

![ts_predictions.png](imgs/ts_prediction.png)

- Besides `OutputProjectionWrapper` there is a trickier but more efficient solution: 
    - Reshape the RNN outputs from `[batch_size, n_steps, n_neurons]` to `[batch_size * n_steps, n_neurons]`, then
    - apply a single fully connected layer with the appropriate output size,
    - The result is an output tensor of shape `[batch_size * n_steps, n_outputs]`, and then
    - reshape the output tensor to `[batch_size, n_steps, n_outputs]`



![stack_outputs.png](imgs/stack_outputs.png)



To implement this solution, we first revert to a basic cell, without the `OutputProjectionWrapper:`

In [None]:
cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu)
rnn_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

Then we stack all the outputs using the reshape() operation, apply the fully connected linear layer, and finally unstack all the outputs, again using reshape():

In [None]:
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])

# Creative RNN

- We can use the same model we trained to generate some creative sequences:
    - We need to provide a sequence seed containing n_steps values (e.g., full of zeros), 
    - use the model to predict the next value, 
    - append this predicted value to the sequence, 
    - feed the last n_steps values to the model to predict the next value, and so on. 
    
- This process generates a sequence that resemblances the original time series.


In [None]:
sequence = [0.] * n_steps
for iteration in range(300):
    X_batch = np.array(sequence[-n_steps:]).reshape(1, n_steps, 1)
    y_pred = sess.run(outputs, feed_dict={X: X_batch})
    sequence.append(y_pred[0, -1, 0])

![creative_seq.png](imgs/creative_seq.png)


What if you can feed all your favourite music to an RNN and see if it can generate the next hit. You would likely need a much more powerful RNN, with more neurons, and also much deeper. Let’s look at deep RNNs now.

## Deep RNNs

It is quite common to stack multiple layers of cells:

![deeprnn.png](imgs/deeprnn.png)


In TensorFlow, you can create several cells and stack them into a MultiRNNCell. 

In [None]:
#In the following code we stack three identical cells 
# (but you could very well use various kinds of cells with a different number of neurons:

n_neurons = 100
n_layers = 3

layers = [tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,
                                      activation=tf.nn.relu)
          for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)


- The states variable is a tuple containing one tensor per layer, each representing the final state of that layer’s cell (with shape [batch_size, n_neurons]). 

- If you set `state_is_tuple=False`, then states becomes a single tensor containing the states from every layer, concatenated along the column axis.

# Distributing a Deep RNN Across Multiple GPUs

You can efficiently distribute deep RNNs across multiple GPUs by pinning each layer to a different GPU. However, if you try to create each cell in a different device() block, it will not work:

![rnn_cell.png](imgs/rnn_cell.png)


- It would fail because a BasicRNNCell is a cell factory, not a cell per se; 
    - No cells get created when you create the factory, and thus no variables do either. The device block is simply ignored. The cells actually get created later. 
    - When you call `dynamic_rnn()`, it calls the `MultiRNNCell`, which calls each individual `BasicRNNCell`, which create the actual cells (including their variables). 
    - Unfortunately, none of these classes provide any way to control the devices on which the variables get created. 
    - If you try to put the dynamic_rnn() call within a device block, the whole RNN gets pinned to a single device. 
- The trick is to create your own cell wrapper

In [None]:
with tf.device("/gpu:0"):  # BAD! This is ignored.
    layer1 = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

with tf.device("/gpu:1"):  # BAD! Ignored again.
    layer2 = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

In [None]:
import tensorflow as tf

class DeviceCellWrapper(tf.contrib.rnn.RNNCell):
  def __init__(self, device, cell):
    self._cell = cell
    self._device = device

  @property
  def state_size(self):
    return self._cell.state_size

  @property
  def output_size(self):
    return self._cell.output_size

  def __call__(self, inputs, state, scope=None):
    with tf.device(self._device):
        return self._cell(inputs, state, scope)

# Applying Dropout

- If you build a very deep RNN, it may end up overfitting the training set. 
    - To prevent that, a common technique is to apply dropout
- You can simply add a dropout layer before or after the RNN as usual, but if you also want to apply dropout between the RNN layers, you need to use a DropoutWrapper. 

In [None]:
keep_prob = tf.placeholder_with_default(1.0, shape=())

cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
         for layer in range(n_layers)]
cells_drop = [tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob=keep_prob)
              for cell in cells]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(cells_drop)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
# The rest of the construction phase is just like earlier.

#During training, you can feed any value you want to the keep_prob placeholder (typically, 0.5):

n_iterations = 1500
batch_size = 50
train_keep_prob = 0.5

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = next_batch(batch_size, n_steps)
        _, mse = sess.run([training_op, loss],
                          feed_dict={X: X_batch, y: y_batch,
                                     keep_prob: train_keep_prob})
    saver.save(sess, "./my_dropout_time_series_model")
    
    
#During testing, you should let keep_prob default to 1.0, 
#effectively turning dropout off (remember that it should only be active during training):

with tf.Session() as sess:
    saver.restore(sess, "./my_dropout_time_series_model")

    X_new = [...] # some test data
    y_pred = sess.run(outputs, feed_dict={X: X_new})

Note that it is also possible to apply dropout to the outputs by setting output_keep_prob. Tt is also possible to apply dropout to the cell’s state using `state_keep_prob`.

Unfortunately, if you want to train an RNN on long sequences, things will get a bit harder. Let’s see why and what you can do about it.

# The Difficulty of Training over Many Time Steps

- To train an RNN on long sequences, you will need to run it over many time steps, making the unrolled RNN a very deep network. 
    - Just like any deep neural network it may suffer from the vanishing/exploding gradients problem and take long to train. 


- You could then do:
    - Good parameter initialization, 
    - nonsaturating activation functions (e.g., ReLU), 
    - Batch Normalization, 
    - Gradient Clipping, and 
    - faster optimizers. 
    - Event after all. The training will still be very slow.

- One solution is to unroll the RNN only over a limited number of time steps during training (truncated backpropagation through time). 
    - It can be implemented by truncating the input sequences:
        - Reduce `n_steps` during training. In this case, the model will not be able to learn long-term patterns. 
        - One workaround could be to make sure that these shortened sequences contain both old and recent data, so that the model can learn to use both.
            - what if fine-grained data from last year is actually useful? What if there was a brief but significant event that absolutely must be taken into account, even years later?


- For RNNs, the memory of the first inputs gradually fades away. Some information is lost after each time step. 
- After a while, the RNN’s state contains virtually no trace of the first inputs.
    - For example, say you want to perform sentiment analysis on a long review that starts with the four words “I loved this movie,” but the rest of the review lists the many things that could have made the movie even better. 
    - If the RNN gradually forgets the first four words, it will completely misinterpret the review. 
    - To solve this problem, various types of cells with long-term memory have been introduced. 

# LSTM Cell

- The Long Short-Term Memory (LSTM) cell was proposed in 1973 by Sepp Hochreiter and Jürgen Schmidhuber.

- An LSTM cell will:
    - Converge faster while training
    - Detect long-term dependencies in the data. 
    - In TensorFlow, replace `BasicRNNCell` with `BasicLSTMCell`
        - `lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)`



- LSTM cells manage two state vectors, and for performance reasons they are kept separate by default. You can change this default behavior by setting `state_is_tuple=False`.


- Architecture:

![lstmcell.png](imgs/lstmcell.png)

- The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it


- The LSTM cell looks exactly like a regular cell, except that its state is split in two vectors: 
    - $h_{(t)}$ the short-term state and 
    - $c_{(t)}$ as the long-term state.


- As the long-term state $c_{(t–1)}$ traverses the network:
    - it first goes through a forget gate, dropping some memories, then 
    - adds some new memories via the addition operation.
    - At each time step, some memories are dropped and some memories are added. 
    
    
- The long-term state is then copied and passed through the `tanh` function, the result is filtered by the output gate. This produces the short-term state $h_{(t)}$. 

- Here is how it works. The current input vector $x_{(t)}$ and the previous short-term state $h_{(t-1)}$ are fed to four different fully connected layers. They all serve a different purpose:
    - The main layer is the one that outputs $g_{(t)}$. It has the usual role of analyzing the current inputs $x_{(t)}$ and the previous $h_{(t-1)}$. 
        - In a basic cell, there is nothing else than this layer
        - In an LSTM cell this layer’s output is partially stored in the long-term state.
    - The three other layers are gate controllers. Their outputs range from 0 to 1. 
        - Their outputs are fed to element-wise multiplication operations, so if they output 0s, they close the gate, and if they output 1s, they open it. Specifically:
            - The forget gate controls which parts of the long-term state should be erased.
            - The input gate controls which parts of $g_{(t)}$ should be added to the long-term state 
            - The output gate controls which parts of the long-term state should be read and output at this time step.



- An LSTM cell can learn to recognize an important input, store it in the long-term state, learn to preserve it for as long as it is needed, and learn to extract it whenever it is needed. 

- They have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, and more.

![rnneq.png](imgs/rnneq.png)

$W_{xi}$, $W_{xf}$, $W_{xo}$, $W_{xg}$ are the weight matrices of each of the four layers for their connection to the input vector $x_{(t)}$.

$W_{hi}$, $W_{hf}$, $W_{ho}$, and $W_{hg}$ are the weight matrices of each of the four layers for their connection to the previous short-term state $h_{(t–1)}$.

$b_i$, $b_f$, $b_o$, and $b_g$ are the bias terms for each of the four layers.

# Peephole Connections

- It may be a good idea to give an RNN a bit more context by letting them peek at the long-term state. 
    - An LSTM variant with peephole connections was proposed by Felix Gers and Jürgen Schmidhuber in 2006.
        - The previous long-term state $c_{(t–1)}$ is added as an input to the controllers of the forget gate and the input gate, and the current long-term state $c_{(t)}$ is added as input to the controller of the output gate.



- In TensorFlow, use the LSTMCell and set `use_peepholes=True`
    - `lstm_cell = tf.contrib.rnn.LSTMCell(num_units=n_neurons, use_peepholes=True)`
    

# GRU Cell


- The Gated Recurrent Unit (GRU) cell was proposed by Kyunghyun Cho et al. in a 2014

![gru_cell.png](imgs/gru_cell.png)

# GRU Cell

- The GRU cell is a simplified version of the LSTM cell, and it seems to perform just as well:
    - Both state vectors are merged into a single vector $h_{(t)}$.
    - A single gate controller controls both the forget gate and the input gate:
        - If the gate controller outputs a 1, the forget gate is open and the input gate is closed. 
        - If it outputs a 0, the opposite happens. 
        - Whenever a memory must be stored, the location is erased first. 
        
    - There is no output gate; 
    - The full state vector is output at every time step. 
    - There is a new gate controller that controls which part of the previous state will be shown to the main layer.

# Equations

![gru_eq.png](imgs/gru_eq.png)

- In Tensorflow:
    `gru_cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)`
- LSTM or GRU cells are one of the main reasons behind the success of RNNs in recent years, in particular for applications in natural language processing (NLP).

# Natural Language Processing


- Most of the state-of-the-art NLP applications, such as machine translation, automatic summarization, parsing, sentiment analysis, and more, are based on RNNs.

- This topic is very well covered by TensorFlow’s `Word2Vec` and `Seq2Seq` tutorials, so you should definitely check them out.

## Word Embeddings

- We need to choose a word representation. One option could be to represent each word using a one-hot vector. 
    - However, with a large vocabulary, this sparse representation would not be efficient at all.
    
    
- Ideally, you want similar words to have similar representations 
    - For example, In “I drink milk”, the model knows that “milk” is close to “water” but far from “shoes,” then it will know that “I drink water” is probably a valid sentence as well, while “I drink shoes” is probably not. 
    
    
- The most common solution is to represent each word in the vocabulary using a small and dense vector (e.g., 150 dimensions), called an embedding, and just let the neural network learn a good embedding for each word during training.


- During training, backpropagation automatically moves the embeddings around in a way that helps the neural network perform its task. Typically this means that similar words will gradually cluster close to one another, and even end up organized in a rather meaningful way.
    - For example, embeddings may end up placed along various axes that represent gender, singular/plural, adjective/noun, and so on. 

In [None]:
# you first need to create the variable representing the embeddings 
# for every word in your vocabulary (initialized randomly)

vocabulary_size = 50000
embedding_size = 150

init_embeds = tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)
embeddings = tf.Variable(init_embeds)

# Once you have a list of known words. you can feed the word identifiers to TensorFlow
# using a placeholder, and apply the embedding_lookup() function to get 
# the corresponding embeddings:


train_inputs = tf.placeholder(tf.int32, shape=[None])  # from ids...
embed = tf.nn.embedding_lookup(embeddings, train_inputs)  # ...to embeddings


- Once your model has learned good word embeddings, they can actually be reused fairly efficiently in any NLP application
    - In fact, instead of training your own word embeddings, you may want to download pretrained word embeddings. 

# An Encoder–Decoder Network for Machine Translation


- Let’s take a look at a machine translation model from English to French

![translation.png](imgs/translation.png)


- The English sentences are fed to the encoder, and the decoder outputs the French translations. 
    - The decoder is given as input the word that it should have output at the previous step 
    - For the very first word, it is given a token that represents the beginning of the sentence. 
    - The decoder is expected to end the sentence with an end-of-sequence (EOS) token (e.g., `<eos>`).

- English sentences are reversed. For example “I drink milk” is reversed to “milk drink I.” 
    - This ensures that the beginning of the English sentence will be fed last to the encoder, since it is the first thing the decoder needs to translate.

# An Encoder–Decoder Network for Machine Translation

- Each word is initially represented by a simple integer identifier. 
- Next, an embedding lookup returns the word embedding 
- These word embeddings are what is actually fed to the encoder and the decoder.
- The decoder outputs a score for each word in the output vocabulary,
- then the Softmax layer turns these scores into probabilities. 
    - For example, at the first step the word "Je" may have a probability of 20%, "Tu" may have a probability of 1%, and so on. The word with the highest probability is output. 

![word_seq.png](imgs/word_seq.png)



# TensorFlow sequence-to-sequence

- If you go through TensorFlow’s sequence-to-sequence tutorial and you look at the code in `rnn/translate/seq2seq_model.py` (in the TensorFlow models), you will notice a few important differences:

- First, so far we have assumed that all input sequences (to the encoder and to the decoder) have a constant length. But obviously sentence lengths may vary:
    - It can be handled by using the sequence_length argument to the `static_rnn()` or `dynamic_rnn()` functions
    - Sentences are grouped into buckets of similar lengths, and the shorter sentences are padded using a special padding token. 
        - For example "I drink milk" becomes "<pad> <pad> <pad> milk drink I", and its translation becomes "Je bois du lait <eos> <pad>".

- Second, when the output vocabulary is large, outputting a probability for each and every possible word will be slow. Computing the softmax function over such a large vector would be very computationally intensive.
    - One solution is to let the decoder output a smaller vectors, then use a sampling technique to estimate the loss without having to compute it over every single word in the target vocabulary. 
    - Sampled Softmax technique was introduced in 2015 by Sébastien Jean et al.
    - In TensorFlow you can use the `sampled_softmax_loss()` function.

- Third, the tutorial’s implementation uses an attention mechanism that lets the decoder peek into the input sequence. We will covert it later in the course.

## Using Keras

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

In [None]:
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

In [None]:
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

In [None]:
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Additional Resources

- Understanding LSTMs
    - https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Recurrent Neural Networks w/ TF:
    - https://www.tensorflow.org/tutorials/sequences/recurrent
    
- Keras conv networks:
    - https://keras.io/layers/recurrent/
    - https://keras.io/applications/
    
- Deep Learning book
    - Ian Goodfellow and Yoshua Bengio and Aaron Courville
    - https://www.deeplearningbook.org/contents/rnn.html
    