# Working with text and sequences: Recurrent Neural Networks<a id="Top"></a>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
Table of Content
<ul>
<li>1. Recurrent Neural Networks</li>
<li>2. <a href="#Part_2">RNN layers</a></li>     
    <ul>
        <li>2.1 <a href="#Part_2_1">SimpleRNN layers</a></li>  
        <li>2.2 <a href="#Part_2_2">LSTM layers</a></li>
        <li>2.3 <a href="#Part_2_3">GRU layers</a></li>
        <li>2.4 <a href="#Part_2_4">Notes on the usage of Keras' RNN layers</a></li>
    </ul>    
<li>3. <a href="#Part_3">Working with text data: Embedding</a></li>    
</font>
</div>

# 1. Recurrent Neural Networks (RNNs)

So far we have covered two classes of neural net architectures: fully connected networks and convolutional 
neural networks. They are often referred to as _feedforward networks_. Fully connected and convolutional 
neural networks have no memory, in a sense that 

1. Each input shown to the networks is processed independently.
2. No states are kept in between inputs.

So, to process a text sequence or temporal series using such networks, it is necessary to turn the entire 
sequence or series into a single data point. For example, convert an IMDB or Amazon review into a sequence 
of encoded numbers; or use the entire time series as an input. While this practice might work when the 
length of data is fixed, it quickly become inconvenient if the data have a variable size/length.

However, there is a more fundamental reason why a feedforward network is not the best candidate for 
processing sequential/serial data. Think about this: more often than not, neighboring data points in a 
text dataset or a temporal series often have some logical connection. For example, the sentence you 
are reading consists of words that have a certain semantic order. Moreover, the sentence is processed 
word by word while keeping memories of what came before. While you read, there exists some sort of 
memory states between data points.

A recurrent neural network (RNN) is designed to exploit the internal connections/states in between 
data points in a sequence. Using Francois Chollet's words: __a RNN processes sequences by iterating 
through the sequence elements and maintaining a state containing information relative to what 
it has seen so far__. Effectively, each neuron in an RNN has an internal loop that sends the output 
in the previous step back to itself as an additional input for the next step, as indicated by panel (a) 
of the diagram below:

<img src='./images/fig_RNN-01.png' width=650>

More specifically, at time step $t$, the neuron receives inputs 
$\mathbf{x}_{(t)} = (x_{1, (t)}, \ldots, x_{n_i,(t)})$ and its output from the previous time step 
$y_{(t-1)}$. Here $\mathbf{x}_{(t)}$ is a vector with dimension $n_i$, the number of input features. 
The neuron then forms a weighted sum of the inputs and produce a new output $y_{(t)}$. If we represent 
the operation along the $t$ axis, we obtain panel (b) of the diagram above. This is referred to as 
__unrolling the network through time__. One can apply the same idea to a layer of recurrent neurons:

<img src='./images/fig_RNN-02.png' width=700>

where the vector $\mathbf{y}_{(t)} = (y_{1,(t)}, y_{2,(t)}, \ldots)$ represents neuron outputs.

# 2. RNN layers<a id="Part_2"></a>
<a href="#Top">Back to page top</a>

Let's recap by saying that __an RNN is a loop that reuses quantities computed in the previous ineration__. 
"Loop" means the procedure of unrolling over time. Here "the quantity computed in the previous iteration" 
is often referred to as states or hidden states, denoted as $\mathbf{h}$. So we can summarize an RNN's 
general structure in the following diagram

<img src='./images/fig_RNN-03.png' width=500>

Here the yellow square box represents an RNN layer. Some might call it a cell. Depending on how the state variable
$\mathbf{h}$ is handled, there are three primary RNN layer architectures: `SimpleRNN`, `LSTM`, and `GRU` layers. 
Let's walk through them one by one.

## 2.1 SimpleRNN layers<a id="Part_2_1"></a>
<a href="#Top">Back to page top</a>

As the name implies, the `SimpleRNN` layer is the simplest RNN implementation among the three ones. 
The state in the `SimpleRNN` cell at time $t$ is just the output of the cell: 
$\mathbf{h}_{(t)} = \mathbf{y}_{(t)}$. 
For a single input instance $\mathbf{x}_{(t)}$, the `SimpleRNN` cell first forms a weighted sum of 
$\mathbf{h}_{(t)}$ and $\mathbf{x}_{(t)}$, then send the results to an activation function:

$$ \mathbf{y}_{(t)} = \phi\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_x + 
                                 \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_h + 
                                 b
                           \right), $$

where $\phi(\ldots)$ is the hyperbolic tangent activation function. 
Schemically, the operation can be summarized by the following diagram:

<img src='./images/fig_RNN-SimpleRNN.png' width=450>

For a batch of input features $\mathbf{X}_{(t)}$ of shape `(batch_size, input_features)`, the equation 
generalizes to

$$ \mathbf{Y}_{(t)} = \phi\left( \mathbf{X}_{(t)}\cdot\mathbf{W}_x + 
                                 \mathbf{h}_{(t-1)}\cdot\mathbf{W}_h + 
                                 \mathbf{b} 
                          \right),$$

with $\mathbf{h}_{(t-1)} = \mathbf{Y}_{(t-1)}$. Each tensor that appears in this equation has the following 
shape:
- $\mathbf{Y}_{(t)} = \mathbf{h}_{(t)}$ : $m\times n_n,$
- $\mathbf{X}_{(t)}$ : $m\times n_i,$
- $\mathbf{W}_x$ : $n_i \times n_n,$
- $\mathbf{W}_h$ : $n_n\times n_n,$
- $\mathbf{b}$ : $n_n.$  

Here $m$ is the size of the input batch. $n_n$ is the number of neurons in the layer, $n_i$ is the
number of input features. 

To make the concept more clearer, we can implement the `SimpleRNN` operation using NumPy. The code
will take a sequence of inputs. At each time step, there will only be one input instance. As a result,
the input tensor has shape `(timesteps, input_features)`.

In [4]:
import numpy as np

timesteps = 100 
input_features = 32 
n_neurons = 64

inputs = np.random.random((timesteps, input_features))

state_t = np.zeros((n_neurons,))

Wx = np.random.random((n_neurons, input_features)) 
Wy = np.random.random((n_neurons, n_neurons)) 
b  = np.random.random((n_neurons,))

successive_outputs = [] 
for input_t in inputs:
    output_t = np.tanh(np.dot(Wx, input_t) + np.dot(Wy, state_t) + b)
    successive_outputs.append(output_t)
    state_t = output_t

final_output_sequence = np.concatenate(successive_outputs, axis=0)

In [2]:
np.shape(successive_outputs)

(100, 64)

In [3]:
np.shape(final_output_sequence)

(6400,)

In this example, _unrolling in time_ is realized by the for loop: in each iteration, `input_t` picks up 
an instance at time $t$ from the tensor `inputs`, the result of hyperbolic tangent activation function 
is sent to `output_t`, which is then assigned to the cell's state at $t$ `state_t`. Note that
because the states don't exist at $t=0$, `state_t` is initialized as a zero tensor at the beginning. 
Finally, the network's output has shape (100, 64) which stores the output from 64 neurons at each time 
step. 

### Keras `SimpleRNN` layer

The program we just implemented actually corresponds to an actual Keras layer: the `SimpleRNN` layer:
```python
    from keras.layers import SimpleRNN
```
Instead of taking one input feature instance, the `SimpleRNN` layer now takes a batch of inputs of shape
`(batch_size, timesteps, input_features)`. The layer's output has two modes:
1. The full sequences of successive outputs for each timestep, just like the NumPy example.
2. Only the last output for each input sequence.

These two modes are controlled by the `return_sequences` constructor argument. Later in the notebook, we'll
learn the reason of having these two modes.

## 2.2 LSTM layers<a id="Part_2_2"></a>
<a href="#Top">Back to page top</a>

While `SimpleRNN` is easy to understand, in reality it is too simplistic to achineve any good performance.
`SimpleRNN` has a major problem: even though in theory it is supposed to retain at any timestep about inputs 
seen several timesteps before, `SimpleRNN` in fact cannot learn such information. The reason is due to the
__vanishing gradient problem__ during training.

The Long Short-Term Memory (LSTM) layer is proposed by Hochreiter and Schmidhuber to address the
vanishing gradient problem. Its basic structure is like the `SimpleRNN` layer. On top of that, the LSTM
layer adds a way to carry information acress many timesteps. This special information channel prevents
signals from gradually vanishing during processing.

The architecture of the LSTM layer is depicted in the diagram below

<img src='./images/fig_RNN-LSTM.png' width=700>

Hochreiter and Schmidhuber's insight is the addition of the extra dataflow $\mathbf{c}_{(t)}$, as indicated
by the red flow lines in the diagram. It will also affect the state sent to the next timestep. The computation 
of $\mathbf{c}_{(t)}$ involves three distinct transformations. Each of the transformation has its own weight 
and bias matrices:

$$ \begin{align}
       \mathbf{i}_{(t)} &= \sigma\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xi} +
                                        \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_{hi} + 
                                        b_i \right), \\
       \mathbf{f}_{(t)} &= \sigma\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xf} +
                                        \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_{hf} + 
                                        b_f \right), \\
       \mathbf{g}_{(t)} &= \tanh\left(  \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xg} +
                                        \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_{hg} + 
                                        b_g \right),
   \end{align}
$$

where $\sigma(\ldots)$ is the sigmoid activation function. Then $\mathbf{c}_{(t)}$ is obtained by

$$ \mathbf{c}_{(t)} = \mathbf{f}_{(t)}\otimes\mathbf{c}_{(t-1)} + \mathbf{i}_{(t)}\otimes\mathbf{g}_{(t)}.$$

The symbol $\otimes$ denotes the element-wise multiplication. So basically, the flow of 
$\mathbf{c}_{(t)}$ is regulated by $\mathbf{f}_{(t)}$ and $\mathbf{i}_{(t)}$. Both rely on 
the input connection as well as the recurrent connection. More specifically
- $\mathbf{f}_{(t)}$ controls which part of $\mathbf{h}_{(t-1)}$ is retained.
- $\mathbf{i}_{(t)}$ determines how much of $\mathbf{g}_{(t)}$ is added to $\mathbf{c}_{(t)}$.

Finally, the output $\mathbf{y}_{(t)}$ is obtained by mixing the information from the flow $\mathbf{c}_{(t)}$
and the activation from the inputs $\mathbf{x}_{(t)}$ and previous state $\mathbf{h}_{(t-1)}$:

$$ \mathbf{y}_{(t)} = \mathbf{h}_{(t)} = 
                      \sigma\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_x + 
                                   \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_h + b \right) \otimes
                      \tanh\left( \mathbf{c}_{(t)} \right)
                                   $$
                                   
This makes sure that the output $\mathbf{y}_{(t)}$ as well as state $\mathbf{h}_{(t)}$ will 
"remember" the past information encoded in $\mathbf{c}_{(t)}$.

In Keras, the LSTM layers can be implemented by importing
```python
    from keras.layers import LSTM
```

## 2.3 GRU layers<a id="Part_2_3"></a>
<a href="#Top">Back to page top</a>

The Gated Recurrent Unit (GRU) layer was proposed by Cho et al. in a
<a href='https://arxiv.org/pdf/1406.1078.pdf'>2014 paper</a>. The GRU layer is a simplified version of the
LSTM layer. Its architecture is shown in the following diagram

<img src='./images/fig_RNN-GRU.png' width=600>

The diagram is different from Figure 14-14 in the __Hands-on Machine Learning__ book by Géron. In Figure 14-14,
the "$1-$" operation is applied to $\mathbf{z}_{(t)}$ before it is sent to merge with $\mathbf{h}_{(t-1)}$.
However in the original paper, the "$1-$" operation is applied to $\mathbf{z}_{(t)}$ that is sent to merge
with $\mathbf{g}_{(t)}$. Although Géron argued in the 
<a href='https://www.oreilly.com/catalog/errata.csp?isbn=0636920052289'>Errata</a> 
that both implementations work. Anyways, the _reset_ $\mathbf{r}_{(t)}$ and _update_ $\mathbf{z}_{(t)}$
regulators are given by the following equations

$$\begin{align}
    \mathbf{z}_{(t)} &= \sigma\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xz} +
                                     \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_{hz} \right), \\
    \mathbf{r}_{(t)} &= \sigma\left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xr} +
                                     \mathbf{h}_{(t-1)}^T\cdot\mathbf{w}_{hr} \right). \\       
\end{align}$$

The actual output $\mathbf{h}_{(t)}$ is then computed by

$$ \mathbf{h}_{(t)} =  \left( 1 - \mathbf{z}_{(t)} \right)\otimes\mathbf{h}_{(t-1)} +
                       \mathbf{z}_{(t)}\otimes\mathbf{g}_{(t)}, $$
                       
where

$$  \mathbf{g}_{(t)} = \tanh \left( \mathbf{x}_{(t)}^T\cdot\mathbf{w}_{xg} +
                                     \left(\mathbf{r}_{(t)}\otimes\mathbf{h}_{(t-1)}\right)^T\cdot\mathbf{w}_{hg}
                              \right).  $$
                              
Note that in these activations, there is no bias term. The idea behind the GRU design is
- The update $\mathbf{z}_{(t)}$ controls  controls how much information from the previous hidden 
  state $\mathbf{h}_{(t-1)}$ will carry over to the current hidden state.
- The reset $\mathbf{r}_{(t)}$ effectively allows the hidden state to drop any information that is
  found to be irrelevant in the fiture, allowing more compact representation.
  
It is argued in the paper that since the hidden state has separate reset and update operations, each hidden
unit will learn to capture dependencies over different time scales (short-term and long-term scales). Therefore,
the GRU has a simpler design than the LSTM unit while maintaining the ability to "remember" the history at
different time scales.

In Keras, the GRU layer can be implemented by importing:
```python
    from keras.layers import GRU
```
  

## 2.4 Notes on the usage of Keras' RNN layers<a id="Part_2_4"></a>
<a href="#Top">Back to page top</a>

As mentioned previously, Keras has three RNN layers:
- `keras.layers.SimpleRNN()`
- `keras.layers.LSTM()`
- `keras.layers.GRU()`

These layers in general take inputs of shape `(batch_size, timesteps, input_features)`. However, if you look at
the <a href='https://keras.io/layers/recurrent/'>Keras Documention</a> of recurrent layers, you'll find that
there is no such thing as input shape in the three layers. What happened? It turns out that `SimpleRNN`,
`LSTM`, and `GRU` laysers have all inhereted the base recurrent layer class `RNN` which takes 3D tensors with
shape `(batch_size, timesteps, input_dim)` as its input (here `input_dim` = `input_features`). __However, so long 
as the RNN layer is not used as the first layer in a network, Keras will infer the input shape from the previous
layer. In this case, one only needs to specify the dimension of RNN layer output, which is the `units` argument__. 
If any of the RNN layers is used as the first layer, then one has to specify at least `input_dim` in the layer
definition. This sets one of the dimension of the input tensor. Below is a code example from 
<a href='https://keras.io/getting-started/sequential-model-guide/'>Keras Sequential Model Guide</a>:
```python
    from keras.models import Sequential
    from keras.layers import LSTM, Dense
    import numpy as np

    data_dim = 16
    timesteps = 8
    num_classes = 10

    # expected input data shape: (batch_size, timesteps, data_dim)
    model = Sequential()
    model.add(LSTM(32, return_sequences=True,
               input_shape=(timesteps, data_dim)))  # returns a sequence of vectors of dimension 32
    model.add(LSTM(32, return_sequences=True))      # returns a sequence of vectors of dimension 32
    model.add(LSTM(32))                             # return a single vector of dimension 32
    model.add(Dense(10, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])```
              
So in this example, we are stacking three LSTM layers. In the first LSTM layer, one has to write out
the shape of input explicitly. Here the `batch_size` is assumed, just like it is in the `Conv2D` layer
we saw before. This network has the following architecture:

<img src='./images/fig_RNN-Stacked-LSTM.png' width=400>

When working with text data, it is necessary to tokenize text input. This is a procedure of transforming
text into numeric tensors. In this situation, it is customary to build a network using Keras `Embedding`
layer as the first layer of the model. The `Embedding` layer takes input tensor of shape 
`(batch_size, sequence_length)` and returns a 3D tensor of shape `(batch_size, sequence_length, output_dim)`,
which is the input shape of Keras RNN layers. Here is an example using `Embedding` and `LSTM` layers:
```python
    from keras.models import Sequential
    from keras.layers import Embedding, LSTM

    model = Sequential() 
    model.add(Embedding(max_features, 32)) 
    model.add(LSTM(32)) 
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='rmsprop', 
                  loss='binary_crossentropy', 
                  metrics=['acc'])```

Now the `LSTM` layer definition, one only needs to define output unit size. The input shape will be 
inferred from the output of the `Embedding` layer.

# 3. Working with text data: Embedding<a id="Part_2"></a>
<a href="#Top">Back to page top</a>