## 1. Understanding LSTM Networks

### LSTM Into.

- LSTMs are designed to avoid the problem of the vanishing gradient.
The main practical limitation posed by the vanishing gradient is that it prevents the model from learning long-term dependencies. 

- However, by avoiding the vanishing gradient problem, LSTMs have the ability to store memory for longer than ordinary RNNs (for hundreds of time steps). 

- In contrast to RNNs, which only maintain a single hidden state, LSTMs have many more parameters as well as better control over what memory to store and what to discard at a given training step. 

- For example, RNNs are not able to decide which memory to store and which to discard, as the hidden state is forced to be updated at every training step.

### What is an LSTM?

LSTMs can be seen as a more complex and capable family of RNNs. An LSTM is mainly composed of five different components:

- **Cell state:** This is the internal cell state (that is, memory) of an LSTM cell

- **Hidden state:** This is the external hidden state exposed to other layers and used to calculate predictions

- **Input gate:** This determines how much of the current input is read into the cell state

- **Forget gate:** This determines how much of the previous cell state is sent into the current cell state

- **Output gate:** This determines how much of the cell state is output into the hidden state

- We can wrap the RNN to a cell architecture as follows: 

    - the cell will output some state (with a nonlinear activation function) that is dependent on the previous cell state and the current input.

    - However, in RNNs, the cell state is continuously updated with every incoming input. This behavior is quite undesirable for storing long-term dependencies.

- LSTMs can decide when to add, update, or forget information stored in each neuron in the cell state. In other words, LSTMs are equipped with a mechanism to keep the cell state unchanged(if warranted for better performance), giving them the ability to store long-term dependencies.

- This is achieved by introducing a gating mechanism. LSTMs possess gates for each operation the cell needs to perform. The gates are continuous (often sigmoid functions) between 0 and 1, where 0 means no information flows through the gate and 1 means all the information flows through the gate. 

- An LSTM uses one such gate for each neuron in the cell. These gates control the following:
    - How much of the current input is written to the cell state (input gate)
    
    - How much information is forgotten from the previous cell state (forget gate)
    
    - How much information is output into the final hidden state from the cell state (output gate)
    

<div align='center'>
    <img src='images/abstract_lstm.png'/>
</div>
The thickness of each line
represents how much information is flowing from/to that gate (in some hypothetical scenarios)

### LSTMs in more detail

As discussed, LSTMs have a gating mechanism composed of the following three gates:

- ***Input gate:*** A gate that outputs values between 0 (**the current input is not written to the cell state**) and 1 (**the current input is fully written to the cell state**). Sigmoid activation is used to squash the output to between 0 and 1.

- ***Forget gate:*** A sigmoidal gate that outputs values between 0 (**the previous cell state is fully forgotten for calculating the current cell state**) and 1 (**the previous cell state is fully read in when calculating the current cell state**).

- ***Output gate:*** A sigmoidal gate that outputs values between 0 (**the current cell state is fully discarded for calculating the final state**) and 1 (**the current cell state is fully used when calculating the final hidden state**).

### How LSTMs differ from standard RNNs

An LSTM has a more intricate structure compared to a standard RNN. 

- One of the primary differences is that an LSTM has two different states: a cell state $c_t$ and a final hidden state $h_t$. However, an RNN only has a single hidden state $h_t$. 

- The next primary difference is that, since an LSTM has three different gates, an LSTM has much more control over how the current input and the previous cell state are handled when computing the final hidden state $h_t$.

- Having the two different states is quite advantageous. With this mechanism, we can decouple the model’s short-term and long-term memory. 

- In other words, even when the cell state is changing quickly, the final hidden state will still be changed more slowly. So, while the cell state is learning both short-term and long-term dependencies, the final hidden state can reflect either only the short-term dependencies, only the long-term dependencies, or both.

- Next, the gating mechanism is composed of three gates: the input, forget, and output gates.

- It is quite evident that this is a more principled approach (especially compared to the standard RNNs) that permits better control over how much the current input and the previous cell state contribute to the current cell state. Also, the output gate gives better control over how much the cell state contributes to the final hidden state.

* **

## 2. Improving LSTMs

Numerous extensions have been developed to help LSTMs perform better at the prediction stage:- 
- **Greedy Sampling**,

- **Beam Search**, 

- using **Word Vectors** instead of a one-hot-encoded representation of words, and 

- using **bidirectional LSTMs**

It is important to note that these optimization techniques are not specific to LSTMs; rather, any sequential model can benefit from them.

### 2.1 Greedy Sampling

If we try to always predict the word with the highest probability, the LSTM will tend to produce very monotonic results. For example, due to the frequent occurrence of stop words (e.g. the), it may repeat them many times before switching to another word.


- One way to get around this is to use **greedy sampling**, where we pick the predicted best *n* and sample from that set. This helps to break the monotonic nature of the predictions.

Let’s consider the first sentence of the previous example:
`John gave Mary a puppy.`

Say, we start with the first word and want to predict the next four words: `John __ __ _ __`.

If we attempt to choose samples deterministically, the LSTM might output something like the following:

`John gave Mary gave John.`

However, by sampling the next word from a subset of words in the vocabulary (most highly probable ones), the LSTM is forced to vary the prediction and might output the following:

`John gave Mary a puppy.`

Alternatively, it might give the following output:

`John gave puppy a puppy.`

- However, even though greedy sampling helps to add more flavor/diversity to the generated text, this method does not guarantee that the output will always be realistic, especially when outputting longer sequences of text

### 2.2 Beam Search

- Beam search is a way of helping with the quality of the predictions produced by the LSTM. 

- In this, the predictions are found by solving a search problem. Particularly, we predict several steps ahead for multiple candidates at each step. This gives rise to a tree-like structure with candidate sequences of words (below fig.). 

- The crucial idea of beam search is to produce the $b$ outputs (that is, $y_t, y_{t+1}, \cdots, y_{t+b}$) at once instead of a single output $y_t$. 

    - Here, $b$ is known as the length of the beam, and the $b$ outputs produced are known as the beam.<br></br> 

- More technically, we pick the beam that has the highest joint probability $P(y_t, y_{t+1}, \cdots, y_{t+b}|x_t)$ instead of picking the highest probable
$P(y_t|x_t)$. 

- We are looking farther into the future before making a prediction, which usually leads to better results.

<div align='center'>
    <img src='images/beam_search_1.png'/>
</div>
<div align='center'>
    <img src='images/beam_search_2.png'/>
</div>

### 2.3 Bidirectional LSTMs (BiLSTMs)

- Making LSTMs bidirectional is another way of improving the quality of the predictions of an LSTM. **By this we mean training the LSTM with text read in both directions: from the beginning to the end and the end to the beginning.**

<div align='center'>
    <img src='images/biLSTM.png'/>
</div>

* **

- **Another application of BiLSTMs is neural machine translation, where we translate a sentence of a source language to a target language**. 
    - As there is no specific alignment between the translation of one language to another, having access to both sides of a given token in the source language can greatly help to understand the context better, thus producing better translations. 
    
    - As an example, consider a translation task of translating Filipino to English. In Filipino, sentences are usually written having verb-object-subject in that order, whereas in English, it is subject-verb-object. 
    
    - In this translation task, it will be extremely helpful to read sentences both forward and backward to make a good translation.
    
* **

$\rightarrow$ ***Architecture of BiLSTM Network***

- **A BiLSTM is essentially two separate LSTM networks. One network learns data from the beginning to the end, and the other network learns data from the end to the beginning.** 

- Training occurs in two phases:
    - First, the solid-colored network is trained with data created by reading the text from the beginning to the end. This network represents the normal training procedure used for standard LSTMs. 
    
    - Secondly, the dashed network is trained with data generated by reading the text in the reversed direction. 
    
    - Then, at the inference phase, we use both the solid and dashed states' information (by concatenating both states and creating a vector) to predict the missing word.

<div align='center'>
    <img src='images/biLSTM_architecture.png'/>
</div>

* **

## 3. Other variants of LSTMs

- **Peephole connections**

    - Peephole connections allow gates to see not only the current input and the previous final hidden state, but also the previous cell state. This increases the number of weights in the LSTM cell. Having such connections has been shown to produce better results.
    
    - Vanilla LSTMs, when calculating the gates, only look at the current input and the hidden state. With peephole connections, we make the gate computations dependent on all: the current input, and the hidden and cell states. <br></br>
    
- **Gated Recurrent Units(GRUs)**

    - GRUs are a much more elegant variant of vanilla LSTMs that simplify LSTMs without compromising on performance. GRUs have only two gates and a single state, whereas vanilla LSTMs have three gates and two states.
    
* **