# Long Short-Term Memory Network

Useful Resources:

1. [Chris Olah's LSTM post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
2. [Edwin Chen's LSTM post](http://blog.echen.me/2017/05/30/exploring-lstms/)
3. [Andrej Karpathy's lecture on RNNs and LSTMs from CS231n](https://www.youtube.com/watch?v=iX5V1WpxxkY)

## RNN vs LSTM

Suppose we are trying to classify a wolf in a neural network. The outputs from the sigmoid function:

- P(dog) = 0.8
- P(wolf) = 0.15
- P(fish) = 0.05

What if this image is actually a wolf, not a dog? The previous images were a bear and a fox. How do we hint that that the image is a wolf, not a dog? We pass the previous data into the network classifier.

<img src="rnn_img/c1.png" alt="drawing" width="800"/>

We can just combine the past data as a vector, then feed it through a linear function followed by an activation function (Sigmoid or hyperbolic tan). The main drawback of RNN is that information is lost over long timer intervals, this problem is known as ***Vanishing Gradient***. This is where the LSTM comes in. LSTMs allow long and short term memory to be passed into the cell and updates the long/short memory values based on the output.

<img src="rnn_img/c2.png" alt="drawing" width="800"/>

## Basics of LSTM

<img src="rnn_img/c3.png" alt="drawing" width="800"/>

The LSTM Cell is composed of 4 Gates:

1. Forget Gate
2. Learn Gate
3. Remeber Gate
4. Use Gate

<img src="rnn_img/c4.png" alt="drawing" width="800"/>

Steps:

1. Long-Term Memory is passed to Forget Gate
2. Short Term Memory and Event is Learn Gate
3. Key Info from the Info Gate and Forget Gate are combined to form the Remeber Gate which will be the New Long Term Memory
4. Key Info from the Learn Gate and Forget Gate are combined to form the Use Gate which is used to output the prediction and form the new short term memory.

<img src="rnn_img/c5.png" alt="drawing" width="800"/>


## Architecture of LSTM

The LSTM can be diagramed as the following:

<img src="rnn_img/c7.png" alt="drawing" width="800"/>

Breaking it down:

### 1. Learn Gate

1. Takes short term memory and event and combines the info
2. Takes combined info and filters out unneccessary data

<img src="rnn_img/c8.png" alt="drawing" width="600"/>

In the following example the Learn Gate takes the wolf (event) and the squirrel & tree (Short term) as inputs. The information is then combined and then proceeds to be filtered. The data from the tree is ignored due since it is meaningless in classification.

The output of the Learn Gate is $N_ti_t$ where:

<img src="rnn_img/c10.png" alt="drawing" width="400"/>

- $STM_{t-1}$ is the short term memory
- $E_T$ is the event

The new information $N_t$ is then matrix multiplied by an ignore factor $i_t$

### 2. Forget Gate

1. Long term memory is passed into forget gate
2. Data from the long term memory is multiplied by a forget factor and is returned

<img src="rnn_img/c11.png" alt="drawing" width="600"/>
<img src="rnn_img/c12.png" alt="drawing" width="400"/>

- $LTM_{t-1}$ is the long term memory vector

This is then multiplied by the forget factor with respect to the STM and the event. This is used to rid the memory of irrelevant data.

### 3. Remember Gate

1. Combines info from forget gate and learn gate to form new LTM

<img src="rnn_img/c13.png" alt="drawing" width="600"/>

### 4. Use Gate

To decide what info to use:

1. Use the output from the forget gate and pass it through a tanh activation function.
2. Consider the short term memory and event by passing it through a linear function, followed by a sigmoid activation function

<img src="rnn_img/c14.png" alt="drawing" width="600"/>
<img src="rnn_img/c16.png" alt="drawing" width="400"/>

3. Use $U_{t}V_{t}$ as output