# GRU and Further Topics in NMT

## How GRU Fix Things

### Backpropagation through Time

Vanishing gradient is a serious problem for basic recurrent neural networks. When the gradient becomes zero, we cannot tell whether

1. Any dependency between `t` and `t+n` in data
2. Any bad configuration of parameters

Recall that forward propagation has the following form.

$$
f(h_{t - 1}, x_{t}) = \text{tanh}(W(x_{t}) + Uh_{t-1} + b)
$$

The temporal derivative, i.e. with respect to state in time. 

$$
\frac{\partial h_{t+1}}{\partial h_{t}} = U^{T}\frac{\partial\,\text{tanh}(a)}{\partial a}
$$

The gradient is multiplied by the weight matrix `U` per time step differentiation. If we have a long time sequence, then it is multiplied by `U` to N power. If the eigenvalue of `U` is big, then it is exploding gradient. If the eigenvalue of `U` is small, then it is vanishing gradient.

#### Shortcut Connections

This implies that the error must background through all the intermediate nodes. Perhaps we can create shortcut connections!

![backprop thru time](./assets/11_backprop_thru_time.png)

We want the shortcut such that `h[t]` can affect `h[t+2]` or `h[t+3]`, then we can measure the effect of `h[t+2]` on `h[t]`. 

Essentially that is what we are doing with the gated unit. It gives us the ability to create shortcuts *adaptively*. We enable the network to learn the strength of these shortcut connections.

$$
f(h_{t-1}, x_{t}) = u_{t}\cdot\tilde{h_{t}} + (1 - u_{t}) \cdot h_{t-1}
$$

The candidate update is defined by the tilde h. 

$$
\tilde{h_{t}} = \text{tanh}\left(W[x_{t}] + U(r_{t} \cdot h_{t-1}) + b\right)
$$

#### Update Gate

The $u_{t}$ is the update gate which controls the strength of how much previous timestep should affect the currect timestep.

$$
u_{t} = \sigma\left(W_{u}[x_{t}] + U_{u}h_{t-1} + b_{u}\right)
$$

#### Reset Gate

We also need to let the network to prune unnecessary connections adaptively. So we have a reset gate.

$$
r_{t} = \sigma\left(W_{r}[x_{t}] + U_{r}h_{t-1} + b_{r}\right)
$$

### Gradient Highway

If we look at the equation again, the beauty is in the $(1 - u_{t})\cdot h_{t-1}$ part.

$$
f(h_{t-1}, x_{t}) = u_{t}\cdot\tilde{h_{t}} + (1 - u_{t}) \cdot h_{t-1}
$$

If the update gate is close to a vector of zeros, the current `h[t]` is directly reflecting `h[t-1]` which has a slope of 1. The information will flow directly forward without any new transformation. That is the perfect case for gradients to flow beautifully. This enables the network to establish long term dependency.

On the other hand, if update gate is *learned* to be close to a vector of one, it implies the hidden states are being updated aggressively. That also means there is no need for long term dependency, feel free to let the gradients vanish!


## LSTM

**NOTE**: The GRU and LSTM does not remember forever, the longest steps it can remember is around 100. It is called Long *Short Term Memory* for a reason.

The hidden state of a GRU is equivalent to the cell state of a LSTM with a small difference.

$$
\text{GRU}\; h_{t} = u_{t} \cdot \tilde{h_{t}} + (1 - u_{t})\cdot h_{t-1}
$$


$$
\text{LSTM}\; c_{t} = i_{t} \cdot \tilde{c_{t}} + f_{t}\cdot c_{t-1}
$$