# **LSTM** (Long Short Term Memory Networks)


###               “the clouds are in the ???” 
### “I grew up in France… I speak fluent ??????.”

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” Sadly, in practice, RNNs don’t seem to be able to learn them. **[Hochreiter (1991) [German] and Bengio, et al. (1994)]**






# **LSTM Network**


*   LSTMs are explicitly designed to avoid the long-term dependency problem.

*   LSTM ( Long Short Term Memory ) Networks are called fancy recurrent neural networks with some additional features.

Diagram for typical LSTM at time step 't':
![LSTM at timestep t](https://miro.medium.com/max/1400/1*GHAGjzbM7LZz7bATVRdx4w.png)

The difference between standard RNN and a LSTM could be viewed below:


![Simple RNN](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png) **Standard RNN (Single layer)**![LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png) **LSTM (Four layers)**





**LSTM Notations explained below:**

![LSTM Notations](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM2-notation.png)




**Understanding LSTM**:

*Memory state “C”*:

![Memory State](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png)



*   It runs straight down the entire chain, with only some minor linear interactions.
*   It’s very easy for information to just flow along it unchanged.

*   Information can be added or removed to the memory state, carefully regulated by structures called gates. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. **So how many gates are there in an LSTM????**

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-gate.png)








Forget Gate :


*   The first step in our LSTM is to decide what information we’re going to throw away from the memory state. 
*   This decision is made by a sigmoid layer called the “forget gate layer.”

*   A 1 represents “completely keep this” while a 0 represents “completely 
get rid of this.”

*   For Eg: When we see a new subject, we want to forget the gender of the old subject.


![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png)



Input Gate:

The next step is to decide what new information we’re going to store in the memory state. This has two parts:


*   First, a sigmoid layer decides which values we’ll update (it).
*   Second, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state.

Then we find (**it∗C~t**). This is the new candidate values, scaled by how much we decided to update each state value.

For Eg: We’d want to add the gender of the new subject to the memory state, to replace the old one we’re forgetting.

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png)








It’s now time to update the old memory state, Ct−1, into the new memory state Ct.
![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png)

For Eg: This is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.




Output Gate:
Finally, we need to decide what we’re going to output. This output will be based on our updated memory state, but will be a filtered version. 


*   First, we run a sigmoid layer which decides what parts of the memory state we’re going to output.
*   Then, we put the memory state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png)



# **Variations of LSTM**

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.”

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-peepholes.png)



Another variation is to use coupled forget and input gates. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.
![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-tied.png)



Another variation is GRU (Gated Recurrent Unit).It combines the forget and input gates into a single “update gate.” It also merges the memory state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png)



**Which of these variants is best? **

Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

**What else in future???**



*   Attention - The idea is to let every step of an RNN pick information to look at from some larger collection of information (Xu, et al. (2015)).
*   Grid LSTMs by Kalchbrenner, et al. (2015).


*   RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015).



