![Banner](img/AI_Special_Program_Banner.jpg)

# Recurrent Neural Networks (RNN) - Material 1: Some Background
---

The material presented is adapted from *Python Machine Learning 3rd Edition* by [Sebastian Raschka](https://sebastianraschka.com) & [Vahid Mirjalili](http://vahidmirjalili.com), Packt Publishing Ltd. 2019 (code available on [GitHub](https://github.com/rasbt/python-machine-learning-book-3rd-edition)).

## Recurrent Neural Networks (RNN)
---

In addition to the CNNs presented previously, there are a number of other popular neural network architectures to solve a variety of different real-world applications.

Many of these applications deal with so-called **sequential data** and, for example, a wanted prediction for the next element(s) of a given sequence. A group of machine learning algorithms that is particularly suitable to tackle such challenges are **Recurrent Neural Networks**.

Arguably, the most popular research and application domains for RNNs are found in **Natural Language Processing (NLP)** and **Time Series Analysis**. While you will dive even deeper into NLP as we go along, *Time Series Analysis* is beyond the scope of this course.

To start our NLP adventure, our focus here will be to get you covered with some basic ideas regarding RNNs and to develop a proper understanding of how those networks work.

As usual, we will also do some hands-on implementations using `PyTorch`.

---

## Overview
- [Modeling sequential data](#Modeling-sequential-data)
  - [Representing sequences](#Representing-sequences)
  - [Sequence modeling variations](#Sequence-modeling-variations)
    - [Examples](#Examples)
- [RNNs for modeling sequences](#RNNs-for-modeling-sequences)
  - [RNN looping mechanism](#RNN-looping-mechanism)
  - [Computing activations in an RNN](#Computing-activations-in-an-RNN)
  - [Alternate recurrence models](#Alternate-recurrence-models)
  - [Long short-term memory cells (LSTM)](#Long-short-term-memory-cells-(LSTM))
 
$\rightarrow$ &nbsp; [Project 1: Sentiment Analysis](3.3.b_RNN_2_Sentiment.ipynb)

---

## Modeling sequential data
---
* Special properties of sequential data
  + Previous assumption:  independent and identically distributed data
    * Spam detection (text): what does this assumption mean, and is it realistic?
    * Time series: e.g., stock market values $\rightarrow$ assumption is not valid
  + Sequential data: **order** matters!
* Some sort of **memory** is necessary to work with sequential data

### Representing sequences

Consider *sequence* $[\mathbf{x}^{(1)},\mathbf{x}^{(2)},\dots,\mathbf{x}^{(T)}]$ of length $T$ where superscript represents *order*, e.g.
* time-series: $\mathbf{x}^{(t)}$ represents inputs at time $t$
* text: $\mathbf{x}^{(t)}$ represents
  - $t$-th word
  - $t$-th character

In figure:
* $\mathbf{x}^{(t)}$: input features, $y^{(t)}$: target labels

<img src="./img/16_01.png" width=700/>

### Sequence modeling variations
* many $\equiv$ sequence, one $\equiv$ fixed-size vector or scalar (aka rank-0 / rank-1 [*tensor*](https://pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html); following this link is highly recommended!)
<img src="./img/16_02.png" width=700/>

#### Examples

* Many-to-one: e.g., *sentiment analysis* using text data as input and a class label as output
* One-to-many: e.g., *image captioning* - input is an image and output is an image description (text)
* Many-to-many: 
  * synchronized: e.g., video classification, where each frame in a video is labeled
  * delayed: e.g., translating entire sentences from one language to another

## RNNs for modeling sequences
---
* typical RNN structure includes a *recursive* component
* hidden layer receives its input from both the input layer of the current time step **and** the hidden layer from the previous time step

###  RNN looping mechanism
* Using past outputs allows for memory
* outputs may be the *tensor* $\mathbf{o}^{(T)}$ or the *sequence*  $[\mathbf{o}^{(0)},\mathbf{o}^{(1)},\dots,\mathbf{o}^{(T)}]$

<img src="./img/16_03.png" width=700/>

* multiple hidden layers possible
* representation can be *unfolded* for better understanding

<img src="./img/16_04.png" width=600/>

$\rightarrow$ hidden units now receive *two* inputs:
* net preactivation from input layer (as in standard ANNs)
* activation of hidden layer from *previous time step* (additionally)
* in multi-layer case, hidden layer $\mathbf{h}_l^{(t)}$ receives input from
  + output $\mathbf{o}_{l-1}^{(t)}$ of *previous layer* at *current time step*
  + $\mathbf{h}_l^{(t-1)}$ hidden values from *previous time step*
* practical consequence: *inner* recurrent layers *must return a sequence as output*

### Computing activations in an RNN

* two-part weight matrix
  * $\mathbf{W}_{hh}$: The weight matrix associated with the recurrent edge (from old hidden layer to new one)
  * $\mathbf{W}_{xh}$: the weight matrix between the (new) input and the hidden layer
  * $\mathbf{W}_{ho}$: the weight matrix between the hidden layer and the output layer
* sometimes $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ concatenated to $\mathbf{W}_h$:
  $\mathbf{W}_h = [\mathbf{W}_{xh},\mathbf{W}_{hh}]$
* weight updates and activation happen behind the scenes when using ```PyTorch```

<img src="./img/16_05.png" width=700/>

### Alternate recurrence models

* so far, only *hidden recurrence*, i.e., hidden layer has recurrent property
* also possible: *output recurrence*, i.e., recurrent connection from output layer $\mathbf{o}^{(t-1)}$
  + to current hidden layer $\mathbf{h}^{(t)}$ or
  + to current output layer $\mathbf{o}^{(t)}$
* depending on recurrence model, need weight matrix
  * $\mathbf{W}_{hh}$: for hidden recurrence
  * $\mathbf{W}_{oh}$: for output recurrence (output-to-hidden)
  * $\mathbf{W}_{oo}$: for output recurrence (output-to-output)
* sometimes written as $\mathbf{W}_{\text{rec}}$ in catch-all fashion
  

<img src="./img/16_07.png" width=700/>

### Long short-term memory cells (LSTM)

* Introduced by [Schmidhuber et al.](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory)
* avoids vanishing / exploding gradients (which present a *major problem* when learning long-range interactions; see [Pascanu et. al.: *vanishing* or *exploding gradients*](https://arxiv.org/pdf/1211.5063.pdf))
* Representation of / replacement for hidden layer
* Cell has 
  + *state* $\mathbf{C}^{(t)}$ (for which a *candidate value* $\tilde{\mathbf{C}}^{(t)}$ is first computed), 
  + gates (*forget gate* $\mathbf{f}^{(t)}$, *input gate* $\mathbf{i}^{(t)}$, and *output gate* $\mathbf{o}^{(t)}$) as well as
  + *hidden units* $\mathbf{h}^{(t)}$
* Implementation is (once again) provided by```PyTorch```

<img src="./img/16_09.png" width=700/>

$\rightarrow$ let's try this in practice on [Sentiment analysis](3.3.b_RNN_2_Sentiment.ipynb)