<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/umbcdata602/fall2020/blob/master/lab_gated_cell_state.ipynb">
<img src="http://introtodeeplearning.com/images/colab/colab.png?v2.0"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# Gated cell state

Vanilla RNNs introduce a problem that's solved by gated cells, such as those used in LSTM.

As Raschka & Mirjalili point out on p577, backpropagation through time (BPTT) introduces a term in 

$$
\frac{\partial L^{(t)} }{ \partial \mathbf{W}_{hh}}
$$

that is proportional to

$$
\frac{\partial h^{(t)}}{\partial h^{(t-k)}}
$$
for time lag $k$. The recurrence relation in vanilla RNNs gives rise to the problem of vanishing and exploding gradients.


<img src="https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch16/images/16_08.png" width="500px">

To gain a conceptual understanding of the problem, consider this super simplified recurrence relation
$$
h^{(t)} = \sigma(wh^{(t-1)})
$$
which contains the essence of the problem.



The derivative for one time step is proportional to $w$,
$$
\frac{\partial h^{(t)}}{\partial h^{(t-1)}} 
= w \sigma ' (wh^{(t-1)})
$$
where $\sigma'(x) = \frac{d\sigma(x)}{dx} $.





For two time steps, we use the chain rule

$$
\frac{\partial h^{(t)}}{\partial h^{(t-2)}} 
= w \sigma'(wh^{(t-1)}) \frac{\partial h^{(t-1)}}{\partial h^{(t-2)}}
$$

and then the recurrence relation

$$
\frac{\partial h^{(t)}}{\partial h^{(t-2)}} 
= w^2 \sigma'(wh^{(t-1)}) \sigma'(wh^{(t-2)}) 
$$

to show that the gradient is proportional to $w^2$.
In general, the gradient is proportional to $w^k$,

$$
\frac{\partial h^{(t)}}{\partial h^{(t-k)}} \propto w^k
$$

The essence of the problem is that for large $k$ you have either exploding or vanishing gradients, depending whether $|w|>1$ or $|w|<1$, respectively.


<img src="https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch16/images/16_09.png" width="500px">

Gated cells avoid the problem with the gated cell state 

$$c^{(t)} = 
c^{(t-1)}f(t-1)
$$

that involves multiplication by the gate function, f(t). For one time step, the gradient is no longer strictly proportional to $w$. 
$$
\frac{\partial c^{(t)}}{\partial c^{(t-1)}} 
= f(t-1)
$$


For two time steps, the recurrence relation yields
$$
\frac{\partial c^{(t)}}{\partial c^{(t-2)}} 
= f(t-1)f(t-2)
$$

And for arbitrary $k$,

$$
\frac{\partial c^{(t)}}{\partial c^{(t-k)}} 
= \prod_{i=0}^k f(t-i)
$$

The term involving $w^k$ is gone.

While it's still possible to get vanishing or exploding gradients, there are now multiple paths to flow information through the network for large values of $k$.