# GRU and Further Topics in NMT

## How GRU Fix Things

### Backpropagation through Time

Vanishing gradient is a serious problem for basic recurrent neural networks. When the gradient becomes zero, we cannot tell whether

1. Any dependency between `t` and `t+n` in data
2. Any bad configuration of parameters

Recall that forward propagation has the following form.

$$
f(h_{t - 1}, x_{t}) = \text{tanh}(W(x_{t}) + Uh_{t-1} + b)
$$

The temporal derivative, i.e. with respect to state in time. 

$$
\frac{\partial h_{t+1}}{\partial h_{t}} = U^{T}\frac{\partial\,\text{tanh}(a)}{\partial a}
$$

The gradient is multiplied by the weight matrix `U` per time step differentiation. If we have a long time sequence, then it is multiplied by `U` to N power. If the eigenvalue of `U` is big, then it is exploding gradient. If the eigenvalue of `U` is small, then it is vanishing gradient.

#### Shortcut Connections

This implies that the error must background through all the intermediate nodes. Perhaps we can create shortcut connections!

![backprop thru time](./assets/11_backprop_thru_time.png)

We want the shortcut such that `h[t]` can affect `h[t+2]` or `h[t+3]`, then we can measure the effect of `h[t+2]` on `h[t]`. 

Essentially that is what we are doing with the gated unit. It gives us the ability to create shortcuts *adaptively*. We enable the network to learn the strength of these shortcut connections.

$$
f(h_{t-1}, x_{t}) = u_{t}\odot\tilde{h_{t}} + (1 - u_{t}) \odot h_{t-1}
$$

The candidate update is defined by the tilde h. 

$$
\tilde{h_{t}} = \text{tanh}\left(W[x_{t}] + U(r_{t} \odot h_{t-1}) + b\right)
$$

#### Update Gate

The $u_{t}$ is the update gate which controls the strength of how much previous timestep should affect the currect timestep.

$$
u_{t} = \sigma\left(W_{u}[x_{t}] + U_{u}h_{t-1} + b_{u}\right)
$$

#### Reset Gate

We also need to let the network to prune unnecessary connections adaptively. So we have a reset gate.

$$
r_{t} = \sigma\left(W_{r}[x_{t}] + U_{r}h_{t-1} + b_{r}\right)
$$

### Gradient Highway

If we look at the equation again, the beauty is in the $(1 - u_{t})\cdot h_{t-1}$ part.

$$
f(h_{t-1}, x_{t}) = u_{t}\odot\tilde{h_{t}} + (1 - u_{t}) \odot h_{t-1}
$$

If the update gate is close to a vector of zeros, the current `h[t]` is directly reflecting `h[t-1]` which has a slope of 1. The information will flow directly forward without any new transformation. That is the perfect case for gradients to flow beautifully. This enables the network to establish long term dependency.

On the other hand, if update gate is *learned* to be close to a vector of one, it implies the hidden states are being updated aggressively. That also means there is no need for long term dependency, feel free to let the gradients vanish!


## LSTM

**NOTE**: The GRU and LSTM does not remember forever, the longest steps it can remember is around 100. It is called Long *Short Term Memory* for a reason.

### Comparison

Let's compare GRU and LSTM and see their difference. GRU computes its hidden state by the following equations.

- u is update gate
- r is reset gate

$$
h_{t} = u_{t} \odot \tilde{h}_{t} + (1 - u_{t})\odot h_{t-1}
$$

$$
\tilde{h}_{t} = \text{tanh}\left(W[x_{t}] + U(r_{t}\odot h_{t-1}) + b\right)
$$

$$
u_{t} = \sigma\left(W_{u}[x_{t}] + U_{u}h_{t-1} + b_{u}\right)
$$

$$
r_{t} = \sigma\left(W_{r}[x_{t}] + U_{r}h_{t-1} + b_{r}\right)
$$

The LSTM computes its hidden state by the following equations. 

- c is cell state
- o is output gate
- i is input gate
- f is forget gate

$$
h_{t} = o_{t}\odot\text{tanh}(c_{t})
$$

$$
c_{t} = i_{t} \odot \tilde{c_{t}} + f_{t}\odot c_{t-1}
$$

$$
\tilde{c_{t}} = \text{tanh}\left(W_{c}[x_{t}] + U_{c}h_{t-1} + b_{c}\right)
$$

$$
o_{t} = \sigma\left(W_{o}[x_{t}] + U_{o}h_{t-1} + b_{o}\right)
$$

$$
i_{t} = \sigma\left(W_{i}[x_{t}] + U_{i}h_{t-1} + b_{i}\right)
$$

$$
f_{t} = \sigma\left(W_{f}[x_{t}] + U_{f}h_{t-1} + b_{f}\right)
$$


The hidden state of a GRU is equivalent to the cell state of a LSTM with a small difference.

$$
\text{GRU}\; h_{t} = u_{t} \odot \tilde{h_{t}} + (1 - u_{t})\odot h_{t-1}
$$


$$
\text{LSTM}\; c_{t} = i_{t} \odot \tilde{c_{t}} + f_{t}\odot c_{t-1}
$$

### Secret Juice

Rather than multiplying, we get `c[t]` by adding the non-linear stuff and `c[t-1]`. There is a direct linear connection between `c[t]` and `c[t-1]`. The same technique is also used in residual networks in computer vision, e.g. ResNet-50. When the network gets too deep for convolution, it suffers from the same vanishing gradient problem. The addition solves the problem!

### Practical Advice

1. Use an LSTM or GRU
2. Initialize recurrent matrices to be orthogonal
3. Initialize other matrices with small scale
4. Initialize forget gate bias to 1 *default to remembering*
5. Use adaptive learning rate algorithms, *Adam*, *AdaDelta*, ...
6. Clip the norm of the gradient to 1 ~ 5
7. Either only dropout vertically or learn how to do it right
8. Be patient
9. Do ensembles
    - Train 8 ~ 10 nets and average their predictions
    
## Machine Translation Evaluation

### BLEU

BLEU stands for bilingual evaluation understudy. The idea is that have a human to produce a reference translation. We assume that the machine translation is good to the extend that you can find word n-grams, like three words in a row, two words in a row, etc...  which also appear in the reference translation anywhere. 

#### N-gram Precision

The score lies between 0 and 1. It asks what percent of machine n-grams can be found in the reference translation? For each n-gram size, not a llowed to match identical portion of reference translation more than once. Commonly the n-gram is chosen to be 3 or 4 words.

#### Brevity Penalty

Machine cannot just type out single word `the`. If the translation is shorter than the human translation, there will be a penalty on the final score.

#### Calculation

BLEU is a weighted geometric mean of n-gram precision with a brevity  penalty factor added.

$$
p_{n} = \frac{\text{number of matched n-gram}}{\text{number of machine translation n-gram}}
$$

$$
w_{n} = \text{weight for each n-gram}
$$

$$
BP = exp\left(\text{min}\left(0, 1 - \frac{len_{ref}}{len_{MT}}\right)\right)
$$

Then the BLEU score is calculated as follows.

$$
BLEU = BP \prod^{N}_{n=1} p_{n}^{w_{n}}
$$