# Long Short-Term Memory (LSTM)

In [1]:
import torch 
from torch import nn
from d2l import torch as d2l

The term __long short-term memory__ comes from the following intuition: Simple recurrent neural networks have _long-term memory_ in the form of weights. The weights change slowly during training, enconding general knowledge about the data. They also have _short-term memory_ in the form of ephemeral activations, which pass from each node to successive nodes. 

LSTMs resemble standard recurrent neural networks but here each ordinary recurrent node is replaced by a _memory cell_. Each memory cell contains an _internal state_, i.e., a node with a self-connected recurrent edge of fixed weight 1, ensuring that the gradient can pass across many time steps without vanishing or exploding.

The LSTM model introduces an intermediate type of storage via the memory cell. A memory cell is a composite unit, built from simpler nodes in a specific cnnectivity patter, with the novel inclusion of multiplicative nodes.

## Gated Memory Cell

Each memory cell is equipped with an internal state and a number of multiplicative gates that determine whether: 

1) a given input should impact the internal state (the input gate), 
2) the internal state should be flushed to 0 (the forget gate), and
3) the internal state of a given neuron should be allowed to impact the cell's output (the output gate).

### a) Gated Hidden State

The key distinction between vanilla RNNs and LSTMs is that the latter support gating of the hidden state. This means that we have dedicated mechanisms for when a hidden stae should be _updated_ and also when it should be _reset_. These mechanisms are learned and they address the concerns listed above. For instance, if the first token is of great importance we will learn not to update the hidden after the first observation. Likewise, we will learn to skip irrelevant temporary observations. Last, we will learn to reset the latent state whenever needed.

### b) Input Gate, Forget Gate, and Output Gate

The data feeding into the LSTM gates are __the input at the current time step__ and __the hidden state of the previous time step__. 