# Machine Translation & LSTM

## Traditional MT Model
Translation methods are statistical. It is impossible to compile a state machine for translation because there are so many different gramatical rules in a given language.

The common approach is to use a parallel corpora, i.e. mapping one sentence to another sentence and then performing the following complicated steps.

For example, source language is French `f` and target language is English `e`. The probablistic formulation using Bayes rule is,

$$
\hat{e} = \text{argmax}_{e}\;P(e \mid f) = \text{argmax}_{e}\;P(f \mid e)P(e)
$$

* Translation model `P(f|e)` is trained on parallel corpus
* Language model `P(e)` is trained on English on English corpus

And then finally we have a decoder that takes these two models and smash them together that is trained on proper translation data.

### Translation Model: Alignment
The first step is to figure out which word or phrases in source language would translate to what words or phrases in target language. This is already a very hard problem.

```python
source = 'Japan shaken by two new quakes'
target = 'Le Japon secoue par deux nouveaux seismes'
```

Notice that `Le` is not mapping to any English word because we don't say `The Japan`. Each phrase in source langauge has many possible translations resulting in a large search space.

### Summary
It has a lot of human feature engineering. It is a very complex system. It requires many different, independently trained machine learning models to perform one translation. Not to mention that there are hundreds of Human language.

## Deep Learning Approach
### Simple RNN
Here's a simple approach, train a RNN as an encoder and splits out a vector at the end. Train another RNN as a decoder, take the output from the encoder and spit out a result.

![rnn_machine_translation](./assets/rnn_machine_translation.png)

#### Encoder
Compute a hidden vector on each timestep using hidden vector from previous timestep and a word from current timestep.

$$
h_{t} = f\left(W^{hh}h_{t-1} + W^{hx}x_{t}\right)
$$

#### Decoder
Compute a hidden vector on each timestep using hidden vector from previous timestep, 

$$
h_{t} = f\left(W^{hh}h_{t-1}\right)
$$

and then feed each timestep into a softmax.

$$
y_{t} = \text{softmax}\left(W^{S}h_{t}\right)
$$

#### Objective
Minimize cross entropy loss for all target words conditioned on source words.

$$
\text{max}_{\theta} \frac{1}{N}\Sigma^{N}_{n=1} \text{log}\; P_{\theta}\left(y^{n} \mid x^{n}\right)
$$

### RNN Extensions
First, each encoder/decoder should have different weights for different languages. English encoder can be swapped out for a Chinese encoder and still work with the French decoder.

#### Hidden State
Each input of phi has its own linear transformation matrix.

$$
h_{t} = \phi(h_{t-1}) = f\left(W^{hh}h_{t-1}\right)
$$

Compute every hidden state in decoder from
* Previous hidden state
* Last hidden vector of encoder denoted as `c`
* Previous predicted output word `y[t-1]`

$$
h_{D, t} = \phi_{D}\left(h_{t-1}, c, y_{t-1}\right)
$$

#### Additionals
Train stacked/deep RNNs with multiple layers. Also consider training bi-directional encoder to avoid vanishing gradient problem. Or even train the input sequence in reverse order.

#### Better Recurrent Unit
The real improvement lies in solving the vanishing gradient problem. Vanilla RNN restrict us to short sequence of translation. We need a way to keep track memories better and provide a gradient highway. This will lead us to a better recurrent unit. 

## Gated Recurrent Unit (GRU)
Standard RNN computes hidden vector at current timestep directly. GRU computes an **update** gate and a **reset** gate first.

### Update Gate
$$
z_{t} = \sigma\left( W^{z}x_{t} + U^{z}h_{t-1}\right)
$$

`W` and `Z` are different weights that map to the same output hidden dimension.

### Reset Gate
$$
r_{t} = \sigma\left( W^{r}x_{t} + U^{r}h_{t-1}\right)
$$

Similarly, reset gate produces a vector that has the same dimension as the output of update gate. The gate vectors are ranged from 0 to 1 due to the sigmoid function being applied.

### Memory Content
$$
\tilde{h} = \text{tanh}\left(Wx_{t} + r_{t}\circ Uh_{t-1} \right)
$$

The new memory content is denoted as tilde `h`. If reset gate is all zero, then the expression ignores the previous memory and only stores the new input (e.g. word) information. The final memory at time step combines the current and previous timesteps.

$$
h_{t} = z_{t} \circ h_{t-1} + (1 - z_{t})\circ \tilde{h_{t}}
$$

Which is equivalent to

$$
h_{t} = z_{t} \circ h_{t-1} + (1 - z_{t})\circ \text{tanh}\left(Wx_{t} + r_{t}\circ Uh_{t-1} \right)
$$

### Intuition
If reset is close to zero, ignore previous hidden state and allow model to drop information that is irrelevant to the future.

Update gate controls how much of past state should matter now. If update is close to one, then we can copy information in that unit through many time steps and that means less vanishing gradient.

Units with short-term dependencies often have reset gates very active.

## LSTM
The next level of complexity for gated units is long shot-term memory cell, known as the LSTM cells. We will have three different types of gate.

### Input
The inpute gate is measuring how much should the current cell state matter.

$$
i_{t} = \sigma\left( W^{i}x_{t} + U^{i}h_{t-1}\right)
$$

### Forget
The forget gate is measuring how much should the previous cell state matter.

$$
f_{t} = \sigma\left(W^{f}x_{t} + U^{f}h_{t-1}\right)
$$

### Output
The output gate is measuring how much should the current cell state be exposed to compute hidden vector of the current time step.

$$
o_{t} = \sigma\left(W^{o}x_{t} + U^{o}h_{t-1}\right)
$$

### Cell State
Every new cell state is a function of the previous cell state and previous hidden vector.

$$
c_{t} = i_{t} \circ \text{tanh}\left( W^{c} x_{t} + U^{c}h_{t-1}\right) +
f_{t} \circ c_{t-1}
$$

And then we can compute the hidden vector from the cell state.

$$
h_{t} = o_{t} \circ \text{tanh}\left(c_{t}\right)
$$