# 07 - Deep Learning

## 07 - Advanced RNN

![](https://images.unsplash.com/photo-1548018628-c56f29b810b4?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1041&q=80)
Picture by [Taras Chernus](https://unsplash.com/photos/uQMyw1VFKqI)

___

Today, we keep diving into the Recurrent Neural Networks. We will learn about more advanced techniques to handle long term dependencies, and more types of RNN!

# I. Translation Machine

## I.1. How to translate a sentence?

The most intuitive way to translate a sentence would be to use one of the RNN types we saw yesterday: the many-to-many:
![](images/many-to-many.png)

So the idea is that you take a sentence in english:
> `My name is Cassidy.`

If we have a well trained network, this should word :
The first word `My` is given as input, and outputs `Mon`.

The second word `name` outputs `nom`, `is` outputs `est`, and finally `Cassidy` outputs `Cassidy`. Our translation will be 
> `Mon nom est Cassidy.`


Alright, let's think about it a bit more: would it work every time doing that way?

## I.2. Swapping words

Now let's consider the following sentence, and let's do the same exercise:
> `What is your favorite movie?`

The translation now would be the following:
> `Quel est ton préféré film ?`

This is really **not good**, we want the following translation:
> `Quel est ton film préféré ?`


Because english and french do not always order the same adjectives and nouns.

How to deal with that problem? The neural network (and even a human!) actually has to know that both words `favorite` and `movie` have to be translated!

When you human, want to translate a sentence, don't you read it all first? Then let's give our RNN a chance to do the same! This would give the following architecture:

![](images/many-to-many_translation.png)


What does this architecture mean? Well we have two sides.

The left side, we feed our RNN with the sequence of words in english. So that the RNN has a chance to "read" **all the sequence** before making any translation. Just like a human does!

The right side is a bit trickier. Once all the sentence has been read, the RNN begins to predict translation. So it begins by predicting the first word, the same way we did up to now.

Then, the predicted word is fed as input to predict the next word. Indeed, the **RNN uses the previous predicted word to predict the next one**!

This architecture is sometimes called an **encoder-decoder sequence-to-sequence** model.

## I.3. Long term dependencies

Finally, let's look at a final translation example. We want to translate this sentence
> `Cats, unlike my dog, don't like water.`

Into this one:
> `Les chats, contrairement à mon chien, n'aiment pas l'eau.`

How would perform our new architecture on that sentence? Quite good, but for sure it would **miss the plural agreement** and give the following translation:
> `Les chats, contrairement à mon chien, n'aime pas l'eau.`

Indeed, our RNN would predict `n'aime pas` based on `mon chien` right before. While actually, it should remember that it is related to `Les chats` a long time ago...

This is the problem of **long-term dependencies**. To manage them, we would like to **keep memory** of the past.

# II. Memory Cells

## II.1. RNN Cell structure

We will change our representation now, so as a reminder, here is how we could see a RNN cell:
![](images/RNN_cell.png)

Where we find back what we already know:
- The hidden state $h_t$ is computed using $h_{t-1}$ and $x_t$
- The prediction $\hat{y}$ is computed using $h_t$

Nothing new for now!

## II.2. Gated Recurrent Unit structure

In order to keep memory of older state, the first idea was to create an **update gate** usually noted $\Gamma_u$. This is a **Gated Recurrent Unit** (often called **GRU**) 

The idea of this update gate is to have a variable that says how much information from the past step we should keep.

The structure would be the following:
![](images/GRU_cell.png)

As you can see on the diagram, the update gate is computed using the sigmoid function, using the same inputs as a regular RNN:

$$
\large \Gamma_u = sigmoid(W_{hu} h_{t-1} + W_{xu} X_t + b_u)
$$

This value $\Gamma_u$ can really be seen as a gate:
* If $\Gamma_u = 1$, completely forget the previous state
* If $\Gamma_u = 0$, do not update at all, keep only the previous state

This should allow our network to remember information on longer sequences.

*Optional*: For those who want to full mathematics behind, here they are:

$$
\hat{h_t} = g(W_{hh} \Gamma_r h_{t-1} + W_{xh} X_t + b_h)
$$

$$
 \Gamma_u = sigmoid(W_{hu} h_{t-1} + W_{xu} X_t + b_u)
$$

$$
 \Gamma_r = sigmoid(W_{hr} h_{t-1} + W_{xr} X_t + b_r)
$$

$$
 \hat{h_t} = \Gamma_u \hat{h_t} + (1-\Gamma_u)h_{t-1}
$$

Where $\Gamma_r$ can be seen as a relevance gate.

**No worries**, as usual, all of this is **already coded in TensorFlow**, all you will have to do is:
```Python
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import GRU

model = Sequential()
model.add(GRU(units=16))
```

## II.3. Long Short-Term Memory Cell 

Historically, before creating GRU, Long Short-Term Memory (LSTM) cells were invented. They are even more complex than GRU.

LSTM cells allow to keep information over longer sequences than GRU cells. The drawback is that they have much more parameters, and are thus harder to train: **LSTM need a larger training dataset**!

A GRU cell can be seen a binary update gate: you can either keep all the past information, or rely only on current information. (Even though it is more a threshold that can keep half past and half current information.)

A LSTM offers much more liberty: you can either keep all past information **and** all current information.

The structure is the following:

![](images/LSTM_cell.png)

The structure seems now very complicated, let's have a look!

We still have the update gate that keeps working the same:

$$
\Gamma_u = sigmoid(W_{hu} h_{t-1} + W_{xu} X_t + b_u)
$$

And there is now the **forget gate** $\Gamma_f$ that is computed the same way (with different weights of course!):

$$
\Gamma_f = sigmoid(W_{hf} h_{t-1} + W_{xf} X_t + b_f)
$$

And we still compute an intermediary state, called now $\hat{c}$ instead of $\hat{h}$ before, but this is pretty much the same for the moment:

$$
\hat{c_t} = g(W_{hc} \Gamma_r h_{t-1} + W_{xc} X_t + b_c)
$$

There is a new player now, the **cell state** $c_t$, that is computed thanks to the update gate, the forget gate and the current state:

$$
c_t = \Gamma_f c_{t-1} + \Gamma_u \hat{c}_t
$$

Finally, there is one more gate computed like other gates, the **output gate** $\Gamma_o$:

$$
\Gamma_o = sigmoid(W_{ho} h_{t-1} + W_{xo} X_t + b_o)
$$

That allows to compute the final hidden state $h_t$:

$$
h_t = \Gamma_o c_t
$$


Like the GRU, the LSTM is **coded in TensorFlow**, all you will have to do is:
```Python
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import LSTM

model = Sequential()
model.add(LSTM(units=16))
```

# III. Improving the movie review

Using RNN cells, we reached about 80% accuracy score on the movie review dataset. Can we do better using GRU and/or LSTM?

Let's try!

In [1]:
# Load the dataset
import numpy as np
from tensorflow.keras import datasets
imdb = datasets.imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

X_train.shape, y_train.shape

((25000,), (25000,))

In [2]:
# Pad the sequences
from tensorflow.keras.preprocessing import sequence

X_train = sequence.pad_sequences(X_train,
                                 value=0,
                                 padding='post',
                                 maxlen=256)

X_test = sequence.pad_sequences(X_test,
                                value=0,
                                padding='post',
                                maxlen=256)

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GRU

# Define the model
def my_RNN():

    model = Sequential()
    model.add(Embedding(input_dim=10000, output_dim=32, input_length=256))

    # We add two layers of GRU 
    model.add(GRU(units=24, return_sequences=True))
    model.add(GRU(units=24, return_sequences=False))

    model.add(Dense(units=1, activation='sigmoid'))

    return model

In [13]:
# Compile the model
model = my_RNN()

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [15]:
# Train the model
model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f7c1451e278>

In [16]:
from sklearn.metrics import accuracy_score

print('accuracy on train with NN:', model.evaluate(X_train, y_train)[1])
print('accuracy on test with NN:', model.evaluate(X_test, y_test)[1])

accuracy on train with NN: 0.98232
accuracy on test with NN: 0.83808


Using GRU, we could now reach about 84%, quite an improvement! And there is room for improvement!

# IV. More RNNs (optional)

## IV.1. Types of RNNs

Yesterday you have seen some types of RNNs, especially you know there is many-to-many and many-to-one.
![](images/RNN_types.png)

How to compute that difference with TensorFlow?

With the parameter `return_sequences`:
- When you set `return_sequences = False` to the last layer of your RNN, it won't return anything: thus this is a many-to-one.
- When you set `return_sequences = True` to the last layer of your RNN, it will return an output prediction for each step: this is a many-to-many

Well, there are more types of RNNs, you already saw the translation one (the encoder-decoder sequence-to-sequence model), a many-to-many with shift.

You can also do a **One-to-Many** RNN:
![](images/one-to-many.png)

This can be used for example in music generation! Or any type of sequence generation! They are a bit more complicated to compute using TensorFlow, but it is possible.

## IV.2. Bidirectional RNNs

Sometimes, you might want your RNN to read a sequence in both ways: from the beginning to the end, but also from the end to the beginning. In some cases, it can help it catch relationships between time steps that are in a way but not another.

This is fairly easy using TensorFlow to compute such a network:
```Python
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import LSTM, Bidirectional

model = Sequential()
model.add(Bidirectional(LSTM(units=16)))
```

See, really easy, and yet sometimes very powerful!

But warning, this multiplies by two the computation time!