<center><h1>Recurrent Neural Networks</h1></center>

<center><img src="https://gblobscdn.gitbook.com/assets%2F-LvBP1svpACTB1R1x_U4%2F-LwEQnQw8wHRB6_2zYtG%2F-LwEZT8zd07mLDuaQZwy%2Fimage.png?alt=media&token=93a3c3e2-e32b-4fec-baf5-5e6b092920c4"></center>

The main idea behind recurrent neural networks is that the input is fed to them sequentially through time.

for example the feed forward  neural network would work in the following way.

<center><img src='images/ann.png'></center>
<center>feed forward network</center>

In the network above, each word is feed to an activation function, regardless of it's order, the activation function has no clue what word have been seen before or after, as the output of the activation is calculated as 

$$h1(x) = h1('product', 'was', 'not', 'good') = W_{1,1} * 'product' + W_{1,2} * 'was' + W_{1,3}*'not'+ W_{1,4}*'good' $$

And you can see here that if we changed the order of the tokens, nothing will change.

Now let's see how Recurrent networks does this

<center><img src="images/rnn.png"></center>
<center>recurrent neural network</center>

In the recurrent network, the input goes through the same hidden layer multiple times, updating it's weights, and thus taking into account the order of the inputs because it update itself based on the order of the input.

So the activation is calculated here in this example as follows


$$h1(x) = {\color{Red} h1('good', } {\color{Green} h1('not',} {\color{Blue} h1('was', } {\color{Yellow} h1('product', '<sos>')}{\color{Blue} )}{\color{Green} )}{\color{Red} )}$$


As you can see here each function takes as it's input the previous function, which takes into account the previous input and so on till the first input.

### Types of sequence problems

<center><img src='https://camo.githubusercontent.com/89a1cc7342d324ca30e45025bb278572f3f114d2/687474703a2f2f6b617270617468792e6769746875622e696f2f6173736574732f726e6e2f64696167732e6a706567'></center>
<center><a href='https://github.com/vict0rsch/deep_learning/tree/master/keras/recurrent'>source</a></center>

The cases we might use sequence models might include for example:

1. One to Many
        a. song generator
        b. text generator
2. Many to One
        a. Sentiment Analysis
        b. Voice verification (like to tell if this is person X or not)
3. Many to Many (in the same time)
        a. Named Entities recognition
        b. Part of Speech tagging
4. Many to Many (encoder-decoder)
        a. Machine translation
        b. Question Answering
        
for one to one we don't need **sequence** model

### What happens inside the RNN block ??

<center><img src="https://miro.medium.com/max/1512/1*HRuDxU1i4JNu-Ywt88LnaQ.png"></center>
<center><a href="https://mc.ai/deep-learning-recurrent-neural-networks/">source</a></center>

The cell by itself does one calculation, which is the hidden state $h_t$, in some cases the output of this cell is fed to a sigmoid or a softmax output and this would be another output called $a_t$ in the figure below

<center><img src='images/rnn-detailed.png'></center>

### Deal with RNN inputs

We need to pad all the inputs of the RNN so that they all have the same length or time so to speak, for example if one input is *'the product was very good with few problems'* would have **8** steps, for the next input let's say its: *'a good product'* this one have only **3** steps, so we pad it with 5 more tokens to make it in the same length as our longest sentence so it would be like so `'a good product <PAD> <PAD> <PAD> <PAD> <PAD>'`.

### Input representation

We represent our input in this case in embeddings, in the same way we discussed before, either character level or word or even sentence level.

Of course we can make use of a pre-trained model to generate these embeddings for us.

### Multi layer RNN ?

The RNN cell can be stacked up to build multilayer network, let's see how it will look.

<center><img src='https://static.packt-cdn.com/products/9781787121089/graphics/image_06_008.png'></center>

As you can see, each input passes through the first layer then the second layer, thus will enable the second layer to build a complex representation upon the first one's.

### Can we see the future ?

We solved the problem of input order, but can't we make use of the future inputs, because for example we might make use of the later text in our input to classify the current text.

## Bidirectional Recurrent Neural Network BiRNN

<center><img src='https://d2l.ai/_images/birnn.svg'></center>

The main idea of the bidirectional rnn is to capture both the inputs from the past and the future, note here that we can't drive the output of the network till all of the inputs were loaded, in another words, you can't get a many to many model in a way that on each input word you got a label.

## Building RNN with Tensorflow

<center><img src='https://www.lewuathe.com/assets/img/posts/2019-03-06-annoucements-in-tensorflow-dev-summit-2019/catch.png'></center>

### Embedding layer

The embedding layer is used to generate embeddings for tokens as they are fed to the network, remember what we did in the embedding step before?

In [9]:
## Load the data to get started
import pandas as pd
import numpy as np
import tensorflow as tf
import spacy

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


In [10]:
word2idx = tf.keras.datasets.imdb.get_word_index()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data()

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [11]:
x_train[1][1]

194

In [12]:
idx2word = {word2idx[word]: word for word in word2idx.keys()}

In [13]:
idx2word[194]

'thought'

In [14]:
idx2word

{34701: 'fawn',
 52006: 'tsukino',
 52007: 'nunnery',
 16816: 'sonja',
 63951: 'vani',
 1408: 'woods',
 16115: 'spiders',
 2345: 'hanging',
 2289: 'woody',
 52008: 'trawling',
 52009: "hold's",
 11307: 'comically',
 40830: 'localized',
 30568: 'disobeying',
 52010: "'royale",
 40831: "harpo's",
 52011: 'canet',
 19313: 'aileen',
 52012: 'acurately',
 52013: "diplomat's",
 25242: 'rickman',
 6746: 'arranged',
 52014: 'rumbustious',
 52015: 'familiarness',
 52016: "spider'",
 68804: 'hahahah',
 52017: "wood'",
 40833: 'transvestism',
 34702: "hangin'",
 2338: 'bringing',
 40834: 'seamier',
 34703: 'wooded',
 52018: 'bravora',
 16817: 'grueling',
 1636: 'wooden',
 16818: 'wednesday',
 52019: "'prix",
 34704: 'altagracia',
 52020: 'circuitry',
 11585: 'crotch',
 57766: 'busybody',
 52021: "tart'n'tangy",
 14129: 'burgade',
 52023: 'thrace',
 11038: "tom's",
 52025: 'snuggles',
 29114: 'francesco',
 52027: 'complainers',
 52125: 'templarios',
 40835: '272',
 52028: '273',
 52130: 'zaniacs',

In [15]:
def reconstruct(tokens):
    text = []
    for token in tokens:
        text.append(idx2word[token])
    return " ".join(text)

In [16]:
reconstruct(x_train[0])

"the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but wh

In [17]:
y_train[1]

0

In [18]:
## In case you have multi-labeles
## We need to decode y here
# y_train_encoded = tf.keras.utils.to_categorical(y_train)
# y_test_encoded = tf.keras.utils.to_categorical(y_test)

### Let' pad x_train to make sure they are all of the same length

In [19]:
len(x_train[0]), len(x_train[1])

(218, 189)

In [20]:
max_sequence_len = 0
for sentence in x_train:
    max_sequence_len = max(len(sentence), max_sequence_len)
print(max_sequence_len)

2494


Because this max length is too much, let's make it up to 100 

In [21]:
max_sequence_len = 100

Now let's padd all the sentences to have that max length

In [22]:
x_train_padded = np.zeros((x_train.shape[0], max_sequence_len))
for i, sent in enumerate(x_train):
    x_train_padded[i, :len(sent)] = sent[:max_sequence_len]

In [23]:
x_train_padded.shape

(25000, 100)

the same for x_test

In [24]:
x_test_padded = np.zeros((x_test.shape[0], max_sequence_len))
for i, sent in enumerate(x_test):
    x_test_padded[i, :len(sent)] = sent[:max_sequence_len]
x_test_padded.shape

(25000, 100)

### Now let's build the model

In [25]:
# check the vocabulary size
len(word2idx)

88584

In [26]:
VOCAB_SIZE = len(word2idx)

In [27]:
VOCAB_SIZE

88584

### Important note

#### input_dim parameter inside keras embedding takes value equal to : maximum integer index + 1
#### which means it should be vocab_size+1 in our case because our vocab starts indexing from 1 as shown below

In [28]:
min(list(word2idx.values()))

1

### Using Functional way

In [29]:
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, SimpleRNN
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

In [30]:
input_word = Input(shape=(100,))
model = Embedding(input_dim=VOCAB_SIZE+1, output_dim=64, input_length=100)(input_word)
model = SimpleRNN(units=64)(model)
out = Dense(64, activation="relu")(model)
out = Dense(1, activation="sigmoid")(out)
model = Model(input_word, out)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 64)           5669440   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                8256      
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 5,681,921
Trainable params: 5,681,921
Non-trainable params: 0
_________________________________________________________________


### Using Sequential way

In [31]:
model = tf.keras.models.Sequential([    
    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE+1,output_dim=64,input_length=100),
    tf.keras.layers.SimpleRNN(units=64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [32]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 64)           5669440   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 5,681,921
Trainable params: 5,681,921
Non-trainable params: 0
_________________________________________________________________


### Compiling our model

In [33]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

In [34]:
tf.config.experimental_run_functions_eagerly(True)


Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.


### Fitting our model

In [35]:
history = model.fit(x_train_padded, y_train, epochs=3, batch_size=128,
                    validation_data=(x_test_padded, y_test), 
                    validation_steps=30)



Epoch 1/3
Epoch 2/3
Epoch 3/3


### Evaluating on test data

In [36]:
test_loss, test_acc = model.evaluate(x_test_padded, y_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.5258711576461792
Test Accuracy: 0.7674400210380554


In [37]:
tf.test.is_gpu_available(
    cuda_only=False, min_cuda_compute_capability=None
)

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


False

### Can we make it work in both directions ?

In [41]:
model_bi = tf.keras.models.Sequential([    
    tf.keras.layers.Embedding(VOCAB_SIZE+1, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [42]:
model_bi.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

In [43]:
history = model_bi.fit(x_train_padded, y_train, epochs=3, batch_size=128,
                    validation_data=(x_test_padded, y_test), 
                    validation_steps=30)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [44]:
test_loss, test_acc = model_bi.evaluate(x_test_padded, y_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.7359644770622253
Test Accuracy: 0.7822399735450745


### Ops, i forgot !

* One of the RNN drawback is that it tends to forget what it saw early in the sequence, because it will vanish eventually.

* Also the gradient of the RNN cell is very likely to vanish or explode.

To solve these issue, two different versions of the SimpleRNN were proposed.

## Gated Recurrent Unit (GRU)

The GRU cell replaces the Simple rnn cell, and adds a **gate** to the network, this gates allows the network to choose what to keep from the previous state, and thus can keep track of some memories, and also solves the vanishing gradient problem.

<center><img src='https://technopremium.com/blog/wp-content/uploads/2019/06/gru.png'></center>
<center><a href='https://technopremium.com/blog/rnn-talking-about-gated-recurrent-unit/'>source</a></center>

Without much math details, the gates here allows to forget certain points of the hidden states.

The simplified version of the equations are as follows

<center><img src='images/gru.png'></center>

Here we say that $C^t = a^t = h^t$ meaning that that output state/hidden state are the same

The parameter $\tilde{C}$ is the candidate to replace vector, which will include the candidate hidden state

$$\tilde{C} = tanh(W_c [C^{t-1}, X^t] + b_c)$$

The gate $\Gamma_u$ is the gate that will tell which to forget of the last state and which to remember.

$$\Gamma_u = \sigma(W_u [C^{t-1}, X^t] + b_u)$$

The parameter $C^t$ includes the final output of the cell

$$C^t = \Gamma_u*\tilde{C} + (1-\Gamma_u)* C^{t-1}$$

So: 

- if $\Gamma_u=0$ then $C^t = C^{t-1}$ 
- if $\Gamma_u=1$ then $C^t = \tilde{C}$ 

### Implementing GRU in Tensorflow

The implementation will be as simple as replacing one line of the code above

In [46]:
gru_bi = tf.keras.models.Sequential([    
    tf.keras.layers.Embedding(VOCAB_SIZE+1, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [47]:
gru_bi.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

In [48]:
history = gru_bi.fit(x_train_padded, y_train, epochs=3, batch_size=128,
                    validation_data=(x_test_padded, y_test), 
                    validation_steps=30)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [49]:
test_loss, test_acc = gru_bi.evaluate(x_test_padded, y_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.6155372262001038
Test Accuracy: 0.789359986782074


### Looong sequences

There is one more important architecture that you need to know before we wrap up, the LSTM

## Long Short Term Memory (LSTM)

<center><img src='https://colah.github.io/images/post-covers/lstm.png'></center>
<center><a href='https://colah.github.io/posts/2015-08-Understanding-LSTMs/'>source</a></center>

The lstm offers a long term memory component beside the short term one, and thus enables your model to remember things that he have seen way far in the past, let's go through the idea of the LSTM.

<center><img src='images/lstm.png'></center>

First the candidate new memory is calculated via the following equation

$$\tilde{C} = tanh(W_c [a^{t-1}, x^t] + b_c) $$

Here we simply take the last activation output (or the short term memory) and calculate what is the candidate of it to be remembered.

Then we have three gates here, update, forget and output, let's examine their equations.

> update gate
$$ \Gamma_u = \sigma(W_u [a^{t-1}, x^t] + b_u)$$

---

> forget gate
$$\Gamma_f = \sigma(W_f [a^{t-1}, x^t] + b_f)$$

---

> output gate
$$\Gamma_o = \sigma(W_o [a^{t-1}, x^t] + b_o)$$

---

So the three gates learn to remember, forget and update which parts of the hidden state, thus enable our model to work with relatively long sequences.

Now to the outputs of our cell

> the long term memory
$$C^t = \Gamma_u*\tilde{C} + \Gamma_f*C^{t-1}$$

So basically update these new values that i have just learned $\Gamma_u*\tilde{C}$ ,and keep from the last memory what the forget gate tells you $\Gamma_f*C^{t-1}$

> the short term memory
$$a^t = \Gamma_o*tanh(C^t)$$

just an activation function applied to the short memory learned in this step, and of course based only on the output of the $\Gamma_o$ gate.

> the output for this cell (aka prediction)
$$y^t = softmax(w_y.a^t + b_y)$$

And these together allows the LSTM to work with the large sequences without forgetting what was in the very first of the sentence, thus it can work with long sequences.

### Bidirectional LSTMs

<center><img src='https://camo.githubusercontent.com/7c17fdb1a2a5f7d0b896c44aefe7490c47d23bfa/68747470733a2f2f7777772e64726f70626f782e636f6d2f732f696e63646a727575347839323038332f6269646972656374696f6e616c5f6c6f6e675f73686f72742d7465726d5f6d656d6f72792e706e673f7261773d31'></center>

The idea is the same as in RNN and GRU, the only difference is that we change the rnn block to LSTM block!

### Long sequence ?

Now we can let go the sequence length that we limited above !

In [50]:
max_sequence_len = 0
for sentence in x_train:
    max_sequence_len = max(len(sentence), max_sequence_len)
print(max_sequence_len)

2494


In [51]:
x_train_padded = np.zeros((x_train.shape[0], max_sequence_len))
for i, sent in enumerate(x_train):
    x_train_padded[i, :len(sent)] = sent[:max_sequence_len]

In [52]:
x_train_padded.shape

(25000, 2494)

the same for x_test

In [53]:
x_test_padded = np.zeros((x_test.shape[0], max_sequence_len))
for i, sent in enumerate(x_test):
    x_test_padded[i, :len(sent)] = sent[:max_sequence_len]
x_test_padded.shape

(25000, 2494)

In [55]:
lstm_bi = tf.keras.models.Sequential([    
    tf.keras.layers.Embedding(VOCAB_SIZE+1, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [56]:
lstm_bi.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

In [59]:
history = lstm_bi.fit(x_train_padded, y_train, epochs=3,
                    validation_data=(x_test_padded, y_test), 
                    validation_steps=30)

In [None]:
test_loss, test_acc = lstm_bi.evaluate(x_test_padded, y_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

> run the next cell to store the models

### How can we predict an input given these models ?

In [None]:
## Load the data
import tensorflow as tf
import numpy as np

rnn_padding = 100
lstm_padding = 2494
word2idx = tf.keras.datasets.imdb.get_word_index()

rnn = tf.keras.models.load_model('models/rnn')
rnn_bi = tf.keras.models.load_model('models/rnn_bi')
gru_bi = tf.keras.models.load_model('models/gru_bi')
lstm_bi = tf.keras.models.load_model('models/lstm_bi')

In [None]:
def predict(text, clf, word2idx, padding_size):
    # padd the text
    padded_text = np.zeros((padding_size))
    # transform your text into indices
    padded_text[:min(padding_size, len(text.split()))] = [
        word2idx.get(word, 0) for word in text.split()][:padding_size]
    # predict it !
    prediction = clf.predict(tf.expand_dims(padded_text, 0))
    return prediction

In [None]:
text= "the movie was super awesome!\
although i didn't like the fact that this piece of shit called star lord destroyed the \
whole mission, i wish they don't include him in the upcoming movies really."

In [None]:
predict(text, rnn, word2idx, rnn_padding)

In [None]:
predict(text, rnn_bi, word2idx, rnn_padding)

In [None]:
predict(text, gru_bi, word2idx, rnn_padding)

In [None]:
predict(text, lstm_bi, word2idx, lstm_padding)

## Conclusion

In this notebook we have seen a brief on the recurrent neural networks, how they work and how to deal with sequences using them.

We have seen that RNN are much more efficient in working with sequences because they learn the knowledge embedded in the sequence unlike feed forward network.

We also discussed how GRU is better than RNN in two main points, the way it handles the memory and also the vanishing gradient handling.

Finally we have seen the LSTM and how it adds more gates to learn more about the sequence and enable a long term memory besides the short term, thus can work with larger sequences.

### Good readings

- [tensorflow tutorial (Text classification with an RNN)](https://www.tensorflow.org/tutorials/text/text_classification_rnn)
- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Coursera Sequence Models course by Andrew Ng](https://www.coursera.org/learn/nlp-sequence-models)

<center><img src='https://media1.tenor.com/images/4546feb2c62df2f4b67dc952434a8bab/tenor.gif?itemid=6218501'></center>