# Recurrent Neural Networks

Author: Binghen Wang

Last Updated: 31 Dec, 2022

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Convolutional Neural Networks.ipynb">Convolutional Neural Networks</a>
    <br>
    <b>RNN navigation:</b>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Basics:
- [Examples of Sequence Models](#ESM)
- [Standard Notation](#StandardN)
- [Recurrent Neural Networks](#RNN)
    - [Other Variants](#RNN-OV)
    - [Bidirectional RNN](#RNN-BRNN)
    - [Deep RNN](#RNN-DRNN)
    - [Language Model](#RNN-LM)
- [Gated Recurrent Unit (GRU)](#GRU)
- [Long Short Term Memory (LSTM)](#LSTM)

## Basics
<a name = "ESM"></a>
### Examples of Sequence Models
<table>
    <tr>
    <th>Task</th>
    <th>Input</th>
    <th>Output</th>
    </tr>
    <tr>
        <td>Name entity recognition</td>
        <td>text</td>
        <td>text with located and classified names</td>
    </tr>
    <tr>
        <td>DNA sequence analysis</td>
        <td>DNA sequence</td>
        <td>DNA sequence with located and classified segments</td>
    </tr>
    <tr>
        <td>Machine translation</td>
        <td>text in language A</td>
        <td>text in language B</td>
    </tr>
    <tr>
        <td>Speech recognition</td>
        <td>audio</td>
        <td>text</td>
    </tr><tr>
        <td>Video activity recognition</td>
        <td>video</td>
        <td>label</td>
    </tr><tr>
        <td>Sentiment classification</td>
        <td>text</td>
        <td>label (e.g. Likert scale rating)</td>
    </tr>
    <tr>
        <td>Music generation</td>
        <td>N/A</td>
        <td>audio</td>
    </tr>
    <tr>
        <td>poem generation</td>
        <td>keywords</td>
        <td>text</td>
    </tr>
</table>

<a name = "StandardN"></a>
### Standard Notation

<table>
    <tr>
    <th>Notation</th>
    <th>Meaning</th>
    </tr>
    <tr>
        <td>$T_x^{(i)}$</td>
        <td>length of input $i$</td>
    </tr>
    <tr>
        <td>$T_y^{(i)}$</td>
        <td>length of output $i$</td>
    </tr>
    <tr>
        <td>$x^{(i)<t>}$</td>
        <td>input $i$ at temporal location $t$</td>
    </tr>
    <tr>
        <td>$y^{(i)<t>}$</td>
        <td>out $i$ at temporal location $t$</td>
    </tr>
</table>

<a name = "RNN"></a>
### Recurrent Neural Networks

<div style = "text-align: center;">
    <img src="./images/RNN.png" style="width:80%;" >
</div>

Note that the parameters are shared across time. The **loss function** used for backpropagation takes the following form:
$$
L = \sum_t L(\hat y^{<t>}, y^{<t>}) = - \sum_t \sum_i y_i^{<t>} \log \hat y_i^{<t>}
$$

<a name = "RNN-OV"></a>
#### Other RNN Variants
<div style = "text-align: center;">
    <img src="./images/RNN types.png" style="width:90%;" >
</div>

<a name = "RNN-BRNN"></a>
#### Bidirectional RNN
Unidirectional RNN suffers from the drawback of not taking into consideration the context that comes after a word. A bidirectional RNN helps address this.

<div style = "text-align: center;">
    <img src="./images/BRNN.png" style="width:70%;" >
</div>

<a name = "RNN-DRNN"></a>
#### Deep RNN
A deep RNN stacks together a bunch of simple RNNs vertically and would require much longer time to train.

<div style = "text-align: center;">
    <img src="./images/Deep RNN.png" style="width:70%;" >
</div>


<a name = "RNN-LM"></a>
#### Language Model
Consider the task of training a **language model** that generates poems based on a keyword or nothing. We can train a one-to-many RNN to achieve this goal.

**Data:** a large corpus of poems text $\{{y^{<1>(1)}, \dots, y^{<T_y^{(i)}>(i)}}\}_{i=1}^m$<br>
**Data preparation**: **Tokenize** the unique words of the corpus into a vocabulary/dictionary, which is then used to generate one-hot representations of words in a sentence. In the vocabulary, it would be helpful to add two extra tokens, `<EOS>` (end of sentence) and `<UNK>` (unknown). <br>
**Loss function**: $$
L = \sum_t L(\hat y^{<t>}, y^{<t>}) = - \sum_t \sum_j y_j^{<t>} \log \hat y_j^{<t>}
$$
**Model during training**: We pass the true $y^{<t>}$ (*not the predicted $\hat y^{<t>}$*) as the input $x^{<t+1>}$ at time step $t+1$.
<div style = "text-align: center;">
    <img src="./images/language model training.png" style="width:50%;" >
</div><br>

**Model during application**: There are several nuances when applying the model to generate sequences:
1. From the second time step onwards, we **sample** a word based on the **softmax distribution** of the outcome in the previous time step $P(y^{<t-1>}| \mathcal{I_{t-2}})$ and feed it as the input $x^{<t>}$ to the recurrent neural network. <br>*<font color = darkblue>Note: in the context of the task, we do not use word with maximum probability as the input in the subsequent time step, as doing so would yield identical sequences all the time. We want the model to generate a <b>different sequence</b> each time we use it.</font>*
2. There are a few ways to determine the end of the generating process. One is to **stop the process** when the token `<EOS>` is selected. Another is to stop when the sequence achieves a pre-determined length.
3. Occasionally, the **unknown word** token `<UNK>` may be selected from a softmax distribution. One can deal with this in two ways–i)keep it in the sentence, ii) reject it and redraw another word.

<div style = "text-align: center;">
    <img src="./images/language model application.png" style="width:45%;" >
</div><br>


<a name = "GRU"></a>
### Gated Recurrent Unit
#### Vanishing Gradients Problem
Backpropagation of the RNN through time can be affected by the problem of vanishing gradients, which could cause the model to be slow to update. **The basic RNN has many local inferences**, meaning the output $\hat y^{<t>}$ is mainly affected by values (e.g, $x^{<t-1>}$, $x^{<t-2>}$) close to it not the ones that are far back (e.g. $x^{<t-10>}$).
<blockquote>
    The <b>cats</b>, which had already ate a bowl of cat food, <b>were</b> full.
</blockquote>

Special network units have been designed to address this issue. Two popular used units are **Gated Recurrent Unit (GRU)** and **Long Short Term Memory (LSTM)**.

#### Gated Recurrent Unit
<div style = "text-align: center;">
    <img src="./images/GRU explainer.png" style="width:100%;" >
</div><br>

The gates are bounded between 0 and 1. The **reset gate** (also known as the **relevance gate**) helps determine how relevant the previous hidden state is in the creation of the candidate hidden state. The **update gate** determines the weights of the candidate hidden state and the previous hidden state in the output hidden state. If $\Gamma_u = 0$, the previous hidden state is directly passed as the hidden state to the next time step, thereby help resolving the vanishing gradients problem.

<a name = "LSTM"></a>
### Long Short Term Memory
<div style = "text-align: center;">
    <img src="./images/LSTM.png" style="width:100%;" >
</div><br>

Unlike GRUs, the **Long Short Term Memory (LSTM)** units pass as hidden states both the **long-term memory** $c^{<t>}$ and the **short-term memory** $a^{<t>}$, and use three gates to govern their updating. In GRUs, $c^{<t>} = a^{<t>}$.