# Recurrent Neural Networks (RNN)

<hr>

**Modeling sequences/temporal problems**<br>

Common problems:

- Forward predictions in time-series data
- Language modeling: what word comes next given a sentence?

Some common challenges are for instance, how many steps back should we look at, how do we retain important, meaningful temporal positions that are far back? This presents challenges to engineer a mapping of *history* into a vector representation. In general, RNNs automatically addresses some of these issues that needs to be engineered with sequences.

****

**Encoding with RNN**

*We can encode everything - images, sentences, words, even events... and represent it in a new way using vectors*

$s_t = \tanh ( W^{s, s} s_{t-1} + W^{s, x} x_t )$

where $W^{s, s}, W^{s, x}$ are the weight parameters that would transform the previous state, $s_{t-1}$, along with new information, $x_t$, into a new vector that feeds into the activation function to produce a new state, $s_t$, at time $t$

<img alt="RNN Encoding" src="assets/rnn_transformation.png" width="300">



For a given sentence, we can start with a zero-vector state and add a new word in for each new state and continue for the rest of the sentence and then output a vector that represents the sentence. 

Each new word would represent a new layer in this network and critically each layer shares the same parameters across the network and applies the same transformation to produce the next state.

<img alt="RNN Encoding" src="assets/sentence_transformation.png" width="500">

****

**Gating and LSTM**

Now, we introduce the idea of gating. 

A gate vector, $g_t$ of the same dimension as $s_t$, determines *how much information to overwrite in the next state*.

$g_t = sigmoid(W^{g, s} s_{t-1} + W^{g, x} x_t)$

where $g_t \in \{0, 1\}$ and has the same dimensions as $s_t$. This is then applied in the computation of $s_t$ below:

$s_t = (1 - g_t) \bigodot s_{t-1} + g_t \bigodot \tanh (W^{s,s} s_{t-1} + W^{s, x} x_t)$

where $\bigodot$ represents element-wise multiplication between the two vectors

This essentially means that $g_t$ controls the amount of information that we want to overwrite from the new output, or similarly, how much of the previous state we would like to retain

Using *LSTM*, we can add model complexity by applying multiple gates to control how information is read from the previous state and produced into the next state:

<img alt="LSTM Gating" src="assets/lstm.jpg" width="500">

****

**Using Markov Models as neural networks to decode sequences**

Let $w \in V$ denote the set of possible words/symbols that includes:

- UNK symbol for any unknown word (out of vocabulary)
- &lt;beg&gt; symbol for specifying the start of a sentence
- &lt;end&gt; symbol for specifying the end of a sentence

In a first order Markov model (*bigram model*), the next symbol only depends on the previous one. A $k$th model depends on the previous $k$ ones.

We can also extend the Markov model into a feedforward neural network. We can calculate the values at each output unit using the following:

$z_k = \sum_{j} x_j W_{jk} + W_{ok}$

where $z_k$ is the computed output value, $x_j$ is the one-hot encoded vector representing the prior word and $W$ being vector-weights

To transform each $z_k$ to conditional probabilities, we can use a *softmax activation function*:

$p_k = \sigma(z_k) = \frac{\exp^{z_k}}{\sum_j \exp^{z_j}}$

<img alt="Markov Feedforward Model" src="assets/markov_feedforward.png" width="500">


By transforming a Markov language model into a feedforward neural network allows us to model a $k$th order Markov model. This also allows us to increase model complexity by inserting hidden layers that transforms into the final output probabilities.

Below is an example of a trigram model:

<img alt="2nd-order Markov Feedforward Model" src="assets/second_order_markov_feedforward.png" width="500">


In addition, for a 10-word vocabulary tri-gram language model, we will need to estimate $10^3$ parameters, i.e. the transition matrix, in a Markov language model. A feedforward neural network will have an input layer of size 20 and output layer of size 10 leading to a matrix size of 200 parameters + 10 parameters for the bias vector, which is much smaller than the Markov model.


****

**Combining this with everything we know so far**

Given a prior word as the input vector, $x_t = \phi(w_{t-1})$, as new information, we can use this new information and the previous state to predict the next state, $s_t$, where:

$s_t = \tanh(W^{s,s} s_{t-1} + W^{s, x} x_t)$

where $s_t$ represents all the relevant information from the first $t$ words

By using a hidden layer to compute the state at time $t$, it allows us to compute a state that carries information containing all prior words and the current word to compute the next word.

We can then use this computed state, $s_t$, to compute the output distribution, $p_t$, of all possible words.

$p_t = softmax(W^o s_t)$

where $W^o$ is the weight matrix that is multiplied by the current state to extract relevant features for a prediction while th softmax function transforms it into a probability $\in [0, 1]$

<img alt="Sequence Feedforward Model" src="assets/rnn_language_sequence.png" width="500">


This is the crucial difference between an RNN and NN which is where RNN allows us to take in its previous state as an input, making it *recurrent*. The RNN will learn which part of the sentence is relevant which could be anywhere in the sentence.

<hr>

# Basic code
A `minimal, reproducible example`