# Recurrent Neural Networks (RNN)

<hr>

**Modeling sequences/temporal problems**<br>

Common problems:

- Forward predictions in time-series data
- Language modeling: what word comes next given a sentence?

Some common challenges are for instance, how many steps back should we look at, how do we retain important, meaningful temporal positions that are far back? This presents challenges to engineer a mapping of *history* into a vector representation. In general, RNNs automatically addresses some of these issues that needs to be engineered with sequences.

****

**Encoding with RNN**

*We can encode everything - images, sentences, words, even events... and represent it in a new way using vectors*

$s_t = \tanh ( W^{s, s} s_{t-1} + W^{s, x} x_t )$

where $W^{s, s}, W^{s, x}$ are the weight parameters that would transform the previous state, $s_{t-1}$, along with new information, $x_t$, into a new vector that feeds into the activation function to produce a new state, $s_t$, at time $t$

<img alt="RNN Encoding" src="assets/rnn_transformation.png" width="300">



For a given sentence, we can start with a zero-vector state and add a new word in for each new state and continue for the rest of the sentence and then output a vector that represents the sentence. 

Each new word would represent a new layer in this network and critically each layer shares the same parameters across the network and applies the same transformation to produce the next state.

<img alt="RNN Encoding" src="assets/sentence_transformation.png" width="500">

****

**Gating and LSTM**

Now, we introduce the idea of gating. 

A gate vector, $g_t$ of the same dimension as $s_t$, determines *how much information to overwrite in the next state*.

$g_t = sigmoid(W^{g, s} s_{t-1} + W^{g, x} x_t)$

where $g_t \in \{0, 1\}$ and has the same dimensions as $s_t$. This is then applied in the computation of $s_t$ below:

$s_t = (1 - g_t) \bigodot s_{t-1} + g_t \bigodot \tanh (W^{s,s} s_{t-1} + W^{s, x} x_t)$

where $\bigodot$ represents element-wise multiplication between the two vectors

This essentially means that $g_t$ controls the amount of information that we want to overwrite from the new output, or similarly, how much of the previous state we would like to retain$

Using *LSTM*, we can add model complexity by applying multiple gates to control how information is read from the previous state and produced into the next state:

<img alt="LSTM Gating" src="assets/lstm.jpg" width="500">

****

**Title**

<hr>

# Basic code
A `minimal, reproducible example`