# Neural Machine Translation and Models with Attention

Neural Machine Translation is the approach of modeling the entire machine translation process via one big artificial neural network.

## Background

### Encoder

As previously discussed, the architecture is made up of encoder and decoder (two LSTM models.) It reads a source sentence one symbol at a time. The last hidden state summarizes the entire source sentence. This hidden state will be fed into a decoder and result in a sentence of a different language.

### Decoder

The decoder is a bit different, each timestep is also fed with the last hidden state from the encoder. The last hidden state from encoder is known as a condition. It gives the decoder an awareness of the source text at every timestep which is very useful for translation.

$$
h_{decoder, t} = f\left( h_{decoder, t-1}, x_{t}, h_{encoder, f} \right)
$$

### Progress of MT

![mt_progress](./assets/mt_progress.png)

Neural MT was introduced in 2014. People started applying it in 2015. We can clearly see a big improvement from 2015 to 2016 where as other old synatic machine translation remained stagnant over the years.

### Four (+) of MT

#### End to End Training

All parameters are simultaneously optimized to minimize a loss function on the network's output.

#### Distributed representations share strength

Better exploitation of word and phrase similarities.

#### Better exploitation of context

NMT can use a much bigger context, both source and partiail target text, to translate more accurately.

#### More fluent text generation

Deep learning text generation is much higher quality.

### Google Translate

Here's a sample Chinese text
> 1519年600名西班牙人在墨西哥登陸，去征服幾百萬人口的阿茲特克帝國，初次交鋒他們損兵三分之二。

A human translation would be
> In 1519, 600 Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash.

In 2009, before the invention of NMT, here's what Google translate returned.
> 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soliders against their loss.

In 2011,
> 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the initial loss of soldiers, two thirds of their encounters.

In 2013,
> 1519 600 Spaniards landed in MExico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds.

In 2014/2015/2016,
> 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.

Here's the big improvement in 2017,
> In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec empire, the first confrontation they killed two-thirds.

It's still not exactly correct but NMT currently has improved the fluency of translated text.

## Neural Attention

The problem with feeding one fixed-dimensional vector from encoder to decoder is that it only captures the final state that a sentence is in. Naturally when human tries to translate a sentence, we don't just look at the source sentence once and then immediately write out the translated version. We usually look back and forth between the two sentences and make modifications. Thus, we need to introduce an idea of attention to neural translation.

The solution is a random access memory for the decoder. At every timestep during decoding, the decoder should be able to look at the hidden state of any timestep from the encoder. We keep a pool of hidden states frm the encoder.

![hidden_state_pool](./assets/hidden_state_pool.png)

What we are effectively doing is learning both translation and **alignment** during training time. But the next question is, how do we decide which state to use at each timestep of decoding? 

### Attention Mechanism - Scoring

When we start decoding, we feed in the last the hidden state of the encoder to our decoder as usual. When we are about to generate the next hidden state for the translation, we try to find a hidden state from the encoder that we wish to use as context. But how? We need to keep track of some scores for each of the hidden state in encoder.

> The `s` index is the index of hidden state from encoder, while `t-1` is the previous timestep from decoder.

$$
score(\bar{h}_{s}, h_{t-1})
$$

![attention_score](./assets/attention_score.png)

And then we convert these scores into alignment weights.

$$
a_{t}(s) = \frac{exp\left(score(s)\right)}{\Sigma_{s^{\prime}} exp\left(score(s^{\prime})\right)}
$$

![alignment_weights](./assets/alignment_weights.png)

And then we build a context vector for every timestep, which is essentially a weighted average.

$$
c_{t} = \Sigma_{s} a_{t}(s)\bar{h}_{s}
$$

Now we can compute the next hidden state for the decoder.

$$
h_{t} = f\left(c_{t}, h_{t-1}\right)
$$

![context_vector](./assets/context_vector.png)

### How to score it?

So far we haven't talked about the explicit implementation of the score vector. There are couple proposals. I will talk about one. Suppose we are on timestep `t`, we computed `h[t]` from the above function. We need to assign new score to each encoder hidden state `h[s]`. 

#### Bilinear Form

$$
score(\bar{h}_{s}, h_{t}) = W_{a}\bar{h}_{s} \cdot h_{t}
$$

A mediating weight matrix is inserted before the dot product of the two vectors. This weighted matrix can be learned during training to encapsulate the strength of interaction between the decoder hidden state and each encoder hidden state. For example, if the output of `np.dot(weight_a, h[s])` is mostly zeros, then we can say that the attention given to `h[s]` is weak at timestep `t`.  

### Long Sequences

Attention is needed to translate long sequence of input because vanilla LSTM is not going to cut it. If a sequence is more than 20 input units long, the performance of a regular LSTM will begin to drop off.

![long_sequence_performance](./assets/long_sequence_performance.png)