## Drawbacks of the vanilla RNN

The RNN architecture which we have used until now (commonly called a "vanilla" RNN) is very simple and fails when the sequence length is increased. This is in part because of the <b>vanishing gradient</b> problem which affects vanilla RNNs. Due to this problem, the gradient signal from nearby hidden states (wrt time) is much larger when compared with that of farther hidden states. This leads to the model learning closer dependencies but failing to capture long term dependencies. <br>
For example:<br>
Q. The chef who cooked these dishes ___________ learnt to cook from his mother.<br>
- is
- are

The answer is of course 'is'. This is because we are referring to the chef and not to the dishes. This is called syntactic recency. The problem with RNNs is that they are able to identify sequential recency and might output 'are' as the blank follows the word 'dishes'. Another reason as to why RNNs are unable to capture long term dependencies is that the hidden state is constantly being rewritten. This leads to continuous loss of information. Hence there was a need for better architectures which would be able to model short term as well as long term dependencies. There are two popularly used architectures namely
1. Gated Recurrent Units (GRUs)
2. Long Short Term Memory (LSTMs)

### GRUs

<img src='Images/GRU.png'>

Here at each timestep $t$, we have the input $x^t$ and the hidden state $h^t$. The GRU makes use of 2 gates:
1. The update gate:<br>
$\large u^{(t)} = \sigma(W_u * h^{(t-1)} + U_u * x^{(t)} + b_u)$<br>
Controls which parts of the hidden state are updated and which are preserved.

2. The reset gate:<br>
$\large r^{(t)} = \sigma(W_r * h^{(t-1)} + U_r * x^{(t)} + b_r)$<br>
Controls which parts of the previous hidden state are used to calculate new content.

These gates can be thought of as small Neural Networks which are used to calculate and extract relevant features and information from the input.

The reset gate is directly used to calculate the new hidden state content:<br>
$\large \tilde{h} = tanh\Big(W_h * (r^{(t)} \bullet h^{(t-1)}) + U_h * x^{(t)} +b_h\Big)$

The new hidden state is calculated using the update gate. It simulatneously keeps what is kept from the previous hidden state and what is updated to the new hidden state.

$\large h^{(t)} = (1 - u^{(t)}) \bullet h^{(t-1)} + u^{(t)} \bullet \tilde{h}^{(t)}$

#### How does a GRU solve the vanishing gradient problem?

GRUs make it easier to retain information long term. This can be done through the update gate. If the update gate is set to 0, the value of the new hidden state will become:<br>
$\large h^{(t)} = (1 - u^{(t)}) \bullet h^{(t-1)} + u^{(t)} \bullet \tilde{h}^{(t)}$<br>
But $u^{(t)} = 0$<br>
Hence, <br>
$\large h^{(t)} = h^{(t-1)}$

This means that the hidden state will never change. Hence from this example we can understand how the GRU will be able to capture long term or short term dependencies as it suites the problem.

### LSTMs

<img src='Images/LSTM.png' />

LSTMs are older and slightly more complex as compared to GRUs. They attempt to solve the vanishing gradient problem by having a separate memory called the <b>cell state</b>. This is separate from the hidden state. Theoretically this cell state can save information about the entire sequence. LSTMs use different gates to define how the cell state and hidden state are updated. Performance wise, there is no clear better alternative to use between GRUs and LSTMs. They have both outperformed each other at different tasks. GRUs due to their simpler structure are slightly faster to train and also easier to understand. Let us understand the working of the LSTM gates.

1. Forget gate
In this gate the input and $h_{t-1}$ is passed through a sigmoid function. This squishes the inputs between 0 and 1. The gate will "forget" values closer to 0 and "remember" values closer to 1.

2. Input gate
This gate works in a similar fashion to the forget gate. It is used to extract relevant features from the input data. The output of this gate (from a sigmoid function) is again used to decide which parts of the input are important.

3. Output gate
The output gate takes the cell state and the previous hidden state as input and calculates what the next hidden state should be. 

4. Cell state
The new cell state $C_t$ is calculated using the outputs of the forget gate and the input gate.

### Code

Using both LSTMs and GRUs in pytorch is very easy using `nn.LSTM` and `nn.GRU`. They follow a very similar API to that of `nn.RNN` and can provide better results for complex problems without much modification to the rest of the code.

### Projects

1. [Character based language model](nbs/CharacterLevelLanguageModel.ipynb)
2. [Word based language model](nbs/WordLevelLanguageModel.ipynb)