### RNN Structure

<img src="images/RNN.png" alt="RNN Diagram" height ="300">

**The weights and biases in an RNN are shared across all time steps**

$$h_t = \tanh(Ux_t + Wh_{t-1} + b_h)$$
$$o_t = Vh_t + b_o$$

$W, U$ are weights for the hidden state and the input state.
- $W$ controls how much of the last time step's hidden state we want to keep
- $U$ controls how much of the new input we want.

$V$ is the weight that maps the hidden state to the output

$o_t$ is the output at time $t$

$b_h$ and $b_o$ are biases.

The hidden state ($h$) serves as the memory of the network.
- $h_t$ has information from all the time steps before t, albeit in a compressed way.

We can concatenate the hidden state and the input into a single long vector $[h_{t-1}, x_t]$

Example:

**No concatenation:**
- $x_t: (10,)$
- $h_{t-1}: (64,)$
- $U: (64, 10)$
- $W: (64, 64)$
- $b_h: (64,)$

Multiplications:
- $Ux_t: (64,)$
- $Wh_{t-1}: (64,)$
- $\tanh(Ux_t + Wh_{t-1} + b_h): (64,)$


**With concatenation:**
- $[h_{t-1}, x_t]: (64,) + (10,) = (74,)$
- $W: (64, 74)$
- $b_h: (64,)$

Multiplications:
- $W[h_{t-1}, x_t]: (64,)$
- $\tanh(W[h_{t-1}, x_t] + b_h): (64,)$


The problem with RNNs are vanishing/exploding gradients, as $h_{t-1}$ is multiplied by $W$ with each time step. This causes RNNs to forget long-term dependencies.

### LSTM

<div style="text-align: center;">
    <img src="images/LSTM.png" alt="cell_state" height="300">
</div>

Each module in an LSTM contains four layers.

The key to LSTMs is the cell state ($C$), as it is the long-term memory of the network.

The cell state just travels straight down like on a conveyor belt, with the LSTM being able to remove or add information to the cell state through **gates**.

<div style="text-align: center;">
    <img src="images/cell_state.png" alt="cell_state" height="300">
</div>

Gates are composed of sigmoid layer and an element-wise multiplication.
An output of zero means let nothing through, and one means let everything through. 

Forget Gate Layer
- The forget layer decides what information to remove from $C_{t-1}$.

- It looks at $h_{t-1}$ and $x_t$ and outputs a number between 0 and 1.

<div style="text-align: center;">
    <img src="images/forget_layer.png" alt="forget" height="300">
</div>

The input gate layer decides which parts of the cell state to update
- 0 is don't update
- 1 is fully update

The tanh layer creates a vector of candidate values that could be added to the cell state (values between -1 and 1)

<div style="text-align: center;">
    <img src="images/input_gate_and_tanh.png" height="300">
</div>

Multiply the output from the input gate and the tanh layer to get the actual update to the cell state:
$$i_t \odot \tilde{C}_t$$


The new cell state is $f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
<div style="text-align: center;">
    <img src="images/update.png" height="300">
</div>

The output gate is another sigmoid layer and it decides what parts of the cell state to reveal as $h_t$

The cell state is pushed through tanh to make the values between -1 and 1, and this is multiplied by the output gate to get $h_t$

<div style="text-align: center;">
    <img src="images/output.png" height="300">
</div>

Both $h_t$ and $C_t$ are passed to the next time step, but $h_t$ is also stored as the step's output.

- $C_t$ = stable long-term memory.
- $h_t$ = short-term working memory + external output.

Thus the name Long Short-Term Memory