# LSTM Networks

## Recurrent Neural Networks

Humans don’t restart their thinking from zero every second. As you read this, you understand each new word because you remember the previous ones—your thoughts have *persistence*.

Traditional neural networks don’t have this ability. For example, if you want to classify what’s happening at each moment in a movie, a standard neural network can’t naturally use information from earlier scenes to understand later ones.

Recurrent Neural Networks (RNNs) solve this problem. They contain loops that allow information to carry forward through time.

<img src="img_lstm/ex1.png">

In this diagram, a neural network block **A** takes an input $x_t$ and produces an output $h_t$. The loop lets the network pass information from one time step to the next. You can think of an RNN as many copies of the same network, each passing messages forward. When we “unroll” the loop, it looks like this:

<img src="img_lstm/ex2.png">

This chain-like structure makes RNNs perfect for sequence data—like text, speech, or video.
However, the standard RNN has limitations, so we often use **LSTMs**, a special type of RNN that works much better in practice. Most modern RNN-based breakthroughs use LSTMs.


### The Problem of Long-Term Dependencies

Sometimes, only recent information is needed for a task. For example, if a language model tries to predict the last word in “the clouds are in the _,” it’s easy to guess **sky** without any earlier context. In these short-range situations, RNNs work well.

<img src="img_lstm/ex3.png">

But some tasks require information from much earlier. For example, to complete the sentence “I grew up in France… I speak fluent _,” we need the word **France** from far back to guess the correct language. The distance between the important information and where it’s needed can become very large.

As this gap grows, standard RNNs struggle to link the two pieces of information.

<img src="img_lstm/ex4.png">

In theory, RNNs *could* handle these long-term dependencies, but in practice they rarely learn them. Research by Hochreiter (1991) and Bengio et al. (1994) showed why these problems happen.

Thankfully, **LSTMs** were designed to solve exactly this issue!


### The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

<img src="img_lstm/ex8.png">

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

<img src="img_lstm/ex9.png">


The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### Step-by-Step LSTM Walkthrough

The first step in an LSTM is deciding what information to remove from the cell state. A sigmoid layer called the **forget gate** looks at $ h_{t-1} $ and $ x_t $, and outputs values between 0 and 1 for each part of the previous cell state $ C_{t-1} $.

* **1 = keep this completely**
* **0 = forget this completely**

In a language model, the cell state might contain the gender of the current subject. When a new subject appears, we want to forget the old subject’s gender.

<img src="img_lstm/ex10.png">

Next, we choose what new information to add to the cell state. This happens in two steps:

1. The **input gate** (a sigmoid layer) decides what parts to update.
2. A **tanh layer** creates candidate values $ \tilde{C}_t $ that could be added to the state.

For our language example, this is where the LSTM prepares to store the gender of the new subject.

<img src="img_lstm/ex11.png">

Now we update the cell state.

* Multiply the old state by $ f_t $ to forget what we chose to forget.
* Add $ i_t \ast \tilde{C}_t $, the new candidate values scaled by how much we want to update.

This is the moment the model actually removes the old subject’s gender and stores the new one.

<img src="img_lstm/ex12.png">

Finally, we decide what to output.
A sigmoid layer determines which parts of the cell state matter for the output. Then we apply **tanh** to the cell state (to keep values between -1 and 1) and multiply it by the sigmoid output.

In the language model, this might output information like whether the subject is singular or plural, preparing the network to choose the correct verb form.

<img src="img_lstm/ex13.png">

### Conclusion

LSTMs were a major improvement for RNNs. But researchers believe there is an even bigger step: **attention**. 

Attention lets each step of an RNN look at the most relevant information from a larger collection.
For example, in image captioning, the RNN can focus on a specific part of the image for each word it generates. The work of Xu et al. (2015) is a great example—and a good place to start learning about attention.

Attention-based models have led to many exciting breakthroughs, and even more seem to be coming!