# Recurrent Neural Networks

Author: Binghen Wang

Last Updated: 28 Jan, 2023

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Convolutional Neural Networks.ipynb">Convolutional Neural Networks</a> |
    <a href="./Transformer.ipynb">The Transformer</a>
    <br>
    <b>RNN navigation:</b> <a href="./Natural Language Processing.ipynb">Natural Language Processing</a> 
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
Basics:
- [Examples of Sequence Models](#ESM)
- [Standard Notation](#StandardN)
- [Recurrent Neural Networks](#RNN)
    - [Other Variants](#RNN-OV)
    - [Bidirectional RNN](#RNN-BRNN)
    - [Deep RNN](#RNN-DRNN)
    - [Language Model](#RNN-LM)
    - [Machine Translation](#RNN-MT)
- [Gated Recurrent Unit (GRU)](#GRU)
- [Long Short Term Memory (LSTM)](#LSTM)

Advanced Models:
- [Attention Mechanism](#Attention)

## Basics
<a name = "ESM"></a>
### Examples of Sequence Models
<table>
    <tr>
    <th>Task</th>
    <th>Input</th>
    <th>Output</th>
    </tr>
    <tr>
        <td>Name entity recognition</td>
        <td>text</td>
        <td>text with located and classified names</td>
    </tr>
    <tr>
        <td>DNA sequence analysis</td>
        <td>DNA sequence</td>
        <td>DNA sequence with located and classified segments</td>
    </tr>
    <tr>
        <td>Machine translation</td>
        <td>text in language A</td>
        <td>text in language B</td>
    </tr>
    <tr>
        <td>Speech recognition</td>
        <td>audio</td>
        <td>text</td>
    </tr><tr>
        <td>Video activity recognition</td>
        <td>video</td>
        <td>label</td>
    </tr><tr>
        <td>Sentiment classification</td>
        <td>text</td>
        <td>label (e.g. Likert scale rating)</td>
    </tr>
    <tr>
        <td>Music generation</td>
        <td>N/A</td>
        <td>audio</td>
    </tr>
    <tr>
        <td>poem generation</td>
        <td>keywords</td>
        <td>text</td>
    </tr>
</table>

<a name = "StandardN"></a>
### Standard Notation

<table>
    <tr>
    <th>Notation</th>
    <th>Meaning</th>
    </tr>
    <tr>
        <td>$T_x^{(i)}$</td>
        <td>length of input $i$</td>
    </tr>
    <tr>
        <td>$T_y^{(i)}$</td>
        <td>length of output $i$</td>
    </tr>
    <tr>
        <td>$x^{(i)<t>}$</td>
        <td>input $i$ at temporal location $t$</td>
    </tr>
    <tr>
        <td>$y^{(i)<t>}$</td>
        <td>out $i$ at temporal location $t$</td>
    </tr>
</table>

<a name = "RNN"></a>
### Recurrent Neural Networks

<div style = "text-align: center;">
    <img src="./images/RNN.png" style="width:80%;" >
</div>

Note that the parameters are shared across time. The **loss function** used for backpropagation takes the following form:
$$
L = \sum_t L(\hat y^{<t>}, y^{<t>}) = - \sum_t \sum_i y_i^{<t>} \log \hat y_i^{<t>}
$$

<a name = "RNN-OV"></a>
#### Other RNN Variants
<div style = "text-align: center;">
    <img src="./images/RNN types.png" style="width:90%;" >
</div>

<a name = "RNN-BRNN"></a>
#### Bidirectional RNN
Unidirectional RNN suffers from the drawback of not taking into consideration the context that comes after a word. A bidirectional RNN helps address this.

<div style = "text-align: center;">
    <img src="./images/BRNN.png" style="width:70%;" >
</div>

<a name = "RNN-DRNN"></a>
#### Deep RNN
A deep RNN stacks together a bunch of simple RNNs vertically and would require much longer time to train.

<div style = "text-align: center;">
    <img src="./images/Deep RNN.png" style="width:70%;" >
</div>


<a name = "RNN-LM"></a>
#### Language Model
Consider the task of training a **language model** that generates poems based on a keyword or nothing. We can train a one-to-many RNN to achieve this goal.

**Data:** a large corpus of poems text $\{{y^{<1>(1)}, \dots, y^{<T_y^{(i)}>(i)}}\}_{i=1}^m$<br>
**Data preparation**: **Tokenize** the unique words of the corpus into a vocabulary/dictionary, which is then used to generate one-hot representations of words in a sentence. In the vocabulary, it would be helpful to add two extra tokens, `<EOS>` (end of sentence) and `<UNK>` (unknown). <br>
**Loss function**: $$
L = \sum_t L(\hat y^{<t>}, y^{<t>}) = - \sum_t \sum_j y_j^{<t>} \log \hat y_j^{<t>}
$$
**Model during training**: We pass the true $y^{<t>}$ (*not the predicted $\hat y^{<t>}$*) as the input $x^{<t+1>}$ at time step $t+1$.
<div style = "text-align: center;">
    <img src="./images/language model training.png" style="width:50%;" >
</div><br>

**Model during application**: There are several nuances when applying the model to generate sequences:
1. From the second time step onwards, we **sample** a word based on the **softmax distribution** of the outcome in the previous time step $P(y^{<t-1>}| \mathcal{I_{t-2}})$ and feed it as the input $x^{<t>}$ to the recurrent neural network. <br>*<font color = darkblue>Note: in the context of the task, we do not use word with maximum probability as the input in the subsequent time step, as doing so would yield identical sequences all the time. We want the model to generate a <b>different sequence</b> each time we use it.</font>*
2. There are a few ways to determine the end of the generating process. One is to **stop the process** when the token `<EOS>` is selected. Another is to stop when the sequence achieves a pre-determined length.
3. Occasionally, the **unknown word** token `<UNK>` may be selected from a softmax distribution. One can deal with this in two ways–i)keep it in the sentence, ii) reject it and redraw another word.

<div style = "text-align: center;">
    <img src="./images/language model application.png" style="width:45%;" >
</div><br>

<a name = "RNN-MT"></a>
#### Machine Translation

**Machine translation** can be viewed as training a conditional language model. With an encoder-decoder structure, the output is generated with the aim to find
$$
\underset{y^{<1>}, y^{<2>}, \dots, y^{<T_y>}}{\arg\max} P(y^{<1>}, y^{<2>}, \dots, y^{<T_y>}\vert x)
$$
where x is the input (e.g., audio clip, French sentence.)

<div style = "text-align: center;">
    <img src="./images/machine translation.png" style="width:80%;" >
</div><br>

Enumerating all the possibilities is inpractical given a large sized vocabulary and a target output sentence of a modest length. Instead, we adopt a search strategy to approximate the best solution. 

##### Greedy Search Algorithm
<blockquote>
    In the decoder stage, pick $y^{<1>} := \arg\max_{y^{<1>}} P(y^{<1>}\vert x)$ and set $t=1$. <br>
    Repeat until $y^{<t>}=\text{<EOS>}$:
    <blockquote>
        $
        t := t+1
        $<br>
        $
        y^{<t>} := \arg\max_{y^{<t>}} P(y^{<t>}\vert x, y^{<1>}, \dots, y^{<t-1>})
        $
    </blockquote>
</blockquote>

<div class ="alert alert-block alert-danger">
    <b>Problem with the greedy search algorithm:</b> It may miss a better solution (which gives a higher overall probability) by focusing on only those that perform the best in the initial few words of the sentence.
</div>

##### Beam Search Algorithm
Introduce $B$ as a hyperparameter of the beam search algorithm that governs the number of candidates to keep in the memory. (e.g. $B=3$ keeps 3 best candidates in the memory during the search). In the decoder stage,
<blockquote>
    <b>Step 1:</b> calculate the probabilities of $P(y^{<1>}\vert x)$ for all words $y^{<1>} \in \mathcal{V}$, store the $B$ words that yield the highest probabilities $\{y^{<1>}_{(1)}, \dots, y^{<1>}_{(B)}\}$ and their corresponding probabilities. Denote the word set as $\mathcal{M}^{<1>}$. <br><br>
    <b>Step 2:</b> For each element $y^{<1>}_{(b)} \in \mathcal{M}^{<1>}$, enumerate the probabilities $P(y^{<2>}\vert x, y^{<1>}_{(b)})$ for all words $y^{<2>} \in \mathcal{V}$. Calculate the joint probabilities $$ P(y^{<1>}_{(b)}, y^{<2>} \vert x) = P(y^{<2>} \vert x, y^{<1>}_{(b)})  \underbrace{P(y^{<1>}_{(b)} \vert x)}_{\text{previously stored}}.$$ Rank the $B \times \vert V\vert$ joint probabilities across all $B$ words. Store the $B$ combinations that yield the highest probabilities $\{\{y^{<1>}_{(1)},y^{<2>}_{(1)}\}, \dots, \{y^{<1>}_{(B)},y^{<2>}_{(B)}\}\}$ and their corresponding probabilities. Denote the set of combinations as $\mathcal{M}^{<2>}$.<br><br>
    ...
    <br><br>
    <b>Step t:</b> Repeat until a stopping criterion is met (e.g., the top combination (with the highest joint probability) ends in $\text{<EOS>}$):
    <blockquote>
        For each element $c^{<t-1>}_{(b)} \in \mathcal{M}^{<t-1>}$, enumerate the probabilities $P(y^{<t>}\vert x, c^{<t-1>}_{(b)})$ for all words $y^{<t>} \in \mathcal{V}$. Calculate the joint probabilities $$ P(c^{<t-1>}_{(b)}, y^{<t>} \vert x) = P(y^{<t>} \vert x, c^{<t-1>}_{(b)})  \underbrace{P(c^{<t-1>}_{(b)} \vert x)}_{\text{previously stored}}.$$ Rank the $B \times \vert V\vert$ joint probabilities across all $B$ elements. Store the $B$ combinations that yield the highest probabilities $\{\{y^{<1>}_{(1)},\dots,y^{<t>}_{(1)}\}, \dots, \{y^{<1>}_{(B)},\dots,y^{<t>}_{(B)}\}\}$ and their corresponding probabilities. Denote the set of combinations as $\mathcal{M}^{<t>}$.
    </blockquote>
</blockquote>

<div class ="alert alert-block alert-danger">
    <b>Problem with the vanila beam search algorithm:</b> As a result of the cost function, it might undesirably <b>favors shorter sentences</b> as outputs. 
    $$\arg\max_y \Pi_{t=1}^{T_y} P(y^{<t>}\vert x, y^{<1>}, \dots, y^{<t-1>})$$
    (Note: Longer sentences would involve the multiplication of more conditional probabilities less than one.)
</div>

<div class ="alert alert-block alert-success">
    <b>Length normalization:</b> To refine the beam search algorithm, one could use logarithm of the loss function (to avoid numerical underflow) and employ <b>length normalization</b> to balance between short and long sentences.
$$
\arg\max_y \frac{1}{T_y^{\alpha}}\sum_{t=1}^{T_y} \log P(y^{<t>}\vert x, y^{<1>}, \dots, y^{<t-1>})
$$
where $\alpha$ is a hyperparameter that governs the tolerance of long sentences. ($\alpha = 0$ is no normalization; $\alpha = 1$ is full normalization and $\alpha = 0.7$ is recommended.) <br><br>
Also note that to use length normalization, one needs a slightly different search algorithm from before. Specifically, in Step $t$ consider reaching a fixed length as the stopping criterion, then run the search algorithm several times for different fixed lengths and finally compute the costs across different lengths to find the best.
</div>

##### Evaluating a Machine Translation System
How do we evaluate the output of a machine translation, noting that a sentence could have multiple equally good translations in another language? One popular way is to use the **Bleu (Bilingual evaluation understudy) score**: <a href = "https://aclanthology.org/P02-1040.pdf">link</a>.

##### Problem with the Encoder-Decoder Structure
The encoder-decoder structure connected via a single hidden state is **not good at processing very long sentences**. The intuition is that the entire sentence of the input language needs to be processed before generating the output sentence. Yet, there could be more efficient ways to performance translation via clever use of <a href= "#Attention">**attention**</a>.

<a name = "GRU"></a>
### Gated Recurrent Unit
#### Vanishing Gradients Problem
Backpropagation of the RNN through time can be affected by the problem of vanishing gradients, which could cause the model to be slow to update. **The basic RNN has many local inferences**, meaning the output $\hat y^{<t>}$ is mainly affected by values (e.g, $x^{<t-1>}$, $x^{<t-2>}$) close to it not the ones that are far back (e.g. $x^{<t-10>}$).
<blockquote>
    The <b>cats</b>, which had already ate a bowl of cat food, <b>were</b> full.
</blockquote>

Special network units have been designed to address this issue. Two popular used units are **Gated Recurrent Unit (GRU)** and **Long Short Term Memory (LSTM)**.

#### Gated Recurrent Unit
<div style = "text-align: center;">
    <img src="./images/GRU explainer.png" style="width:100%;" >
</div><br>

The gates are bounded between 0 and 1. The **reset gate** (also known as the **relevance gate**) helps determine how relevant the previous hidden state is in the creation of the candidate hidden state. The **update gate** determines the weights of the candidate hidden state and the previous hidden state in the output hidden state. If $\Gamma_u = 0$, the previous hidden state is directly passed as the hidden state to the next time step, thereby help resolving the vanishing gradients problem.

<a name = "LSTM"></a>
### Long Short Term Memory
<div style = "text-align: center;">
    <img src="./images/LSTM.png" style="width:100%;" >
</div><br>

Unlike GRUs, the **Long Short Term Memory (LSTM)** units pass as hidden states both the **long-term memory** $c^{<t>}$ and the **short-term memory** $a^{<t>}$, and use three gates to govern their updating. In GRUs, $c^{<t>} = a^{<t>}$.

## Advanced Models
<a name = "Attention"></a>
### Attention Mechanism
<div style = "text-align: center;">
    <img src="./images/attention.png" style="width:100%;" >
</div><br>

In addition to its use in machine translation, the attention mechanism has also been adapted to the task of **image captioning** by applying an attention mechanism to the visual input.

<div class ="alert alert-block alert-warning">
    <b>Drawback with the attention model:</b> The computation of the attention weights is <b>quadratic</b> in computation cost, $T_x \times T_y$. This mandates the use of relatively simple neural network structure for the computation of $e^{<t,t^{\prime}>}$. Arguably, provided the lengths of the sentences for the translation task are not that long, the quadratic cost seems acceptable. Further research is being conducted to reduce the computation cost.
</div>