# Different types of Recurrent Neural Networks (RNNs)

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import SimpleRNN, LSTM, GRU, Bidirectional

### Dataset

In [2]:
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

## reshaping and normalizing mnist images  
x_train = x_train.reshape([-1, 28, 28]).astype("float32") / 255.0
x_test = x_test.reshape([-1, 28, 28]).astype("float32") / 255.0

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


### 1. Simple Dense RNN 

RNN works on the principle of saving the output of a particular layer and feeding this 
 back to the input in order to predict the output of the layer.<br>

 <center>
 <img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse3.explicit.bing.net%2Fth%3Fid%3DOIP.C-VkQoN0koBX4KxE5z-QYQHaC6%26pid%3DApi&f=1" width=500 height=300>
 </center>

<p>The decision a recurrent net reached at time step t-1 affects the decision it will reach one moment later at time step t.<br>

 So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data, much as we do in life.</p>

In [None]:
model = keras.Sequential(nmae="Simple_RNNs")
# our model takes 28 inputs and has 28 timesteps
model.add(keras.Input(shape=(None, 28))) 
model.add(SimpleRNN(256, activation='tanh', return_sequences=True, name="hidden1"))
model.add(SimpleRNN(256, activation='tanh', return_sequences=False, name='hidden2'))
model.add(keras.layers.Dense(10,  name='output_layer'))

## Model Summary
print(model.summary())

## Compiling the model
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    metrics=["accuracy"],
)


## Training the model on MNIST dataset
print("\n\nTraining The Model")
model.fit(x_train, y_train, batch_size=64, epochs=10, verbose=2)

## Evaluating the model
print("\n\nModel Evaluation")
model.evaluate(x_test, y_test, batch_size=64, verbose=2)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 hidden1 (SimpleRNN)         (None, None, 256)         72960     
                                                                 
 hidden2 (SimpleRNN)         (None, 256)               131328    
                                                                 
 output_layer (Dense)        (None, 10)                2570      
                                                                 
Total params: 206,858
Trainable params: 206,858
Non-trainable params: 0
_________________________________________________________________
None


Training The Model
Epoch 1/10
938/938 - 98s - loss: 0.3080 - accuracy: 0.9067 - 98s/epoch - 104ms/step
Epoch 2/10
938/938 - 97s - loss: 0.1810 - accuracy: 0.9485 - 97s/epoch - 104ms/step
Epoch 3/10
938/938 - 96s - loss: 0.1571 - accuracy: 0.9552 - 96s/epoch - 102ms/step
Epoch 4/10
938/938 - 97s - loss: 0

[0.1514495313167572, 0.9567999839782715]

## LSTMs (Long Short-Term Meomory)
<p>LSTMs help preserve the error that can be backpropagated through time and layers. By maintaining a more constant error, <br>they allow recurrent nets to continue to learn over many time steps (over 1000), thereby opening a channel to link causes and effects remotely. This is one of the central challenges to machine learning and AI, <br>since algorithms are frequently confronted by environments where reward signals are sparse and delayed, such as life itself.
</p>
<center>
<h3>LSTM Cell</h3>
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse4.mm.bing.net%2Fth%3Fid%3DOIP.Wxt8Fv2FV3MOFvICx-GwUgHaFc%26pid%3DApi&f=1">
</center>

### Creating a model using Bidirectional LSTM cells

In [None]:
model2 = keras.Sequential(name="LSTMs")

model2.add(keras.Input(shape=(None, 28)))
model2.add(LSTM(256, return_sequences=True, activation="relu"))
model2.add(keras.layers.Dropout(0.2))
model2.add(LSTM(256, return_sequences=False))
model2.add(keras.layers.Dropout(0.2))
model2.add(keras.layers.Dense(10, name="output_layer"))

## Model Summary
print(model2.summary())

## Compiling the model
model2.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=1e-3, decay=1e-5),
    metrics=["accuracy"],
)

## Training the model on MNIST dataset
print("\n\nTraining The Model")
model2.fit(x_train, y_train, batch_size=64, epochs=10, verbose=2)

## Evaluating the model
print("\n\nModel Evaluation")
model2.evaluate(x_test, y_test, batch_size=64, verbose=2)

Model: "LSTMs"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_10 (LSTM)              (None, None, 256)         291840    
                                                                 
 dropout_10 (Dropout)        (None, None, 256)         0         
                                                                 
 lstm_11 (LSTM)              (None, 256)               525312    
                                                                 
 dropout_11 (Dropout)        (None, 256)               0         
                                                                 
 output_layer (Dense)        (None, 10)                2570      
                                                                 
Total params: 819,722
Trainable params: 819,722
Non-trainable params: 0
_________________________________________________________________
None


Training The Model
Epoch 1/10
938/938 - 272s - l

[0.03847460821270943, 0.9879999756813049]

## GRUs (Gated Recurrent Units)

<p>The problem that arose when LSTM’s where initially introduced was the high number of parameters. <br>Let’s start by saying that the motivation for the proposed LSTM variation called GRU is the simplification, in terms of the number of parameters and the performed operations.</p>
<p>GRU’s got rid of the cell state and used the hidden state to transfer information. It also only has two gates, a reset gate and update gate.</p>

<center>
<h3>GRU Cell</h3>
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.DTOtKM1e1LLlo7VgBgH8EwHaDg%26pid%3DApi&f=1" width=500 height=300>
</center>

### Creating model using GRU cells

In [3]:
model3 = keras.Sequential(name="GRUs")

model3.add(keras.Input(shape=(None, 28)))
model3.add(GRU(256, return_sequences=True, activation='relu'))
model3.add(keras.layers.Dropout(0.3))
model3.add(GRU(128, return_sequences=False))
model3.add(keras.layers.Dense(10))

## Model Summary
print(model3.summary())

## Compiling the model
model3.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=1e-3, decay=1e-5),
    metrics=["accuracy"],
)

## Training the model on MNIST dataset
print("\n\nTraining The Model")
model3.fit(x_train, y_train, batch_size=64, epochs=10, verbose=2)

## Evaluating the model
print("\n\nModel Evaluation")
model3.evaluate(x_test, y_test, batch_size=64, verbose=2)

Model: "GRUs"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru (GRU)                   (None, None, 256)         219648    
                                                                 
 dropout (Dropout)           (None, None, 256)         0         
                                                                 
 gru_1 (GRU)                 (None, 128)               148224    
                                                                 
 dense (Dense)               (None, 10)                1290      
                                                                 
Total params: 369,162
Trainable params: 369,162
Non-trainable params: 0
_________________________________________________________________
None


Training The Model
Epoch 1/10
938/938 - 168s - loss: 0.3092 - accuracy: 0.8959 - 168s/epoch - 179ms/step
Epoch 2/10
938/938 - 160s - loss: 0.0759 - accuracy: 0.9771 - 160s/epoch - 1

[0.03848262503743172, 0.9900000095367432]

### Problems with RNNs

<p> As we can see, LSTMs and GRUs works better than simple RNNs. As we can see GRUs can perform as better as LSTMs with less number of parameters. <br> Here, GRU performs better for this dataset but we can't say that it will be a better choice everytime over LSTMs.<br>
RNN and LSTM are difficult to train because they require memory-bandwidth-bound computation, which is the worst nightmare for hardware designers.<br>
LSTM require 4 linear layer (MLP layer) per cell to run at and for each sequence time-step. Linear layers require large amounts of memory bandwidth to be computed, in fact they cannot use many compute unit often because the system has not enough memory bandwidth to feed the computational units.<br>
A comparison of the effectiveness of LSTM and Transformer (attention based model) shows that attention is usually attention wins.
</p>