## Neural Machine Translation with RNNs

__g) The `generate_sent_masks()` function in `nmt_model.py` produces a tensor called `enc_masks`. It has shape (batch size, max source sentence length) and contains 1s in positions corresponding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are used during the attention computation in the `step()` function (lines 295-296). First explain (in around three sentences) what effect the masks have on the entire attention computation. Then explain (in one or two sentences) why it is necessary to use the masks in this
way.__

Answer:

We assign $-\infty$ to every attention score with a mask's value 1 (pad-tokens). After that we use softmax to obtain distribution of attention. And `softmax(-Inf) == 0`, so we don't pay any attention to pad-tokens at all.
It's necessary because pad-tokens doesn't exist in fact and were used just to make the shape of our data correct. We don't want them to influence on results of any part of our algorithm.

__i) Please report the model’s corpus BLEU Score. It should be larger than 21.__

Answer:

```
Corpus BLEU: 22.34060291466605
```

__j) In class, we learned about dot product attention, multiplicative attention, and additive
attention. Please provide one possible advantage and disadvantage of each attention mechanism,
with respect to either of the other two attention mechanisms. As a reminder, dot product attention
is $e_{t,i} = s^T_t h_i$ , multiplicative attention is $e_{t,i} = s^T_t Wh_i$, and additive attention is $e_{t,i} = v^T (W_1 h_i +
W_2 s_t)$.__

Answer:

We increase flexibility of our attention mechanism when moving from dot product attention to multiplicative attention to additive attention. However we also increase complexity of our model and number of trainable weights.

## Analyzing NMT Systems

__a) Here we present a series of errors we found in the outputs of our NMT model (which is the same as the one you just trained). For each example of a Spanish source sentence, reference (i.e., ‘gold’) English translation, and NMT (i.e., ‘model’) English translation, please:__

__1. Identify the error in the NMT translation.__

__2. Provide a reason why the model may have made the error (either due to a specific linguistic construct or specific model limitations).__

__3. Describe one possible way we might alter the NMT system to fix the observed error.__

__Below are the translations that you should analyze as described above. Note that out-of-vocabulary words are underlined.__

i.

Source Sentence: Aquı́ otro de mis favoritos, “La noche estrellada”.

Reference Translation: So another one of my favorites, “The Starry Night”.

NMT Translation: Here’s another favorite of my favorites, “The Starry Night”.

Answer:
    
Repeating word "favorite". Maybe it's not common in spanish to use word "one" in such context. We can provide more examples of translation of such phrase in a training set.

ii.

Source Sentence: Ustedes saben que lo que yo hago es escribir para los niños, y,
de hecho, probablemente soy el autor para niños, ms ledo en los EEUU.

Reference Translation: You know, what I do is write for children, and I’m probably America’s
most widely read children’s author, in fact.

NMT Translation: You know what I do is write for children, and in fact, I’m probably the
author for children, more reading in the U.S.


Answer:

We repeat the same phrase "for children" twice. Maybe we can teach our algorithm to use synonymical phrases more often. I.e. "I write for children" and "I'm the author for children" have the same meaning but make sentence more readable. 

iii.

Source Sentence: Un amigo me hizo eso – Richard Bolingbroke.

Reference Translation: A friend of mine did that – Richard Bolingbroke.

NMT Translation: A friend of mine did that – Richard (unk)

Answer:

We were unable to found translation for a surname "Bolingbroke". Obviously it doesn't appear in our vocabulary of target language. We can try to add it, or come with some way to exclude names from our typical flow.

iv.

Source Sentence: Solo tienes que dar vuelta a la manzana para verlo como una epifanı́a.

Reference Translation: You’ve just got to go around the block to see it as an epiphany.

NMT Translation: You just have to go back to the apple to see it as a epiphany.

Answer:

Looks like word "manzana" has two distinct meanings in Spanish: apple and block. Our model was unable to choose the correct one. Maybe we can fix it if we train our word embeddings in a way which make distinction more obvious.

v.

Source Sentence: Ella salvó mi vida al permitirme entrar al baño de la sala de profesores.

Reference Translation: She saved my life by letting me go to the bathroom in the teachers’ lounge.

NMT Translation: She saved my life by letting me go to the bathroom in the women’s room.

Answer:

We totally lost information about "teachers' lounge". Maybe it's due to too high attention to word "She". We can try to use more complex attention mechanism.

vi.

Source Sentence: Eso es más de 100,000 hectáreas.

Reference Translation: That’s more than 250 thousand acres.

NMT Translation: That’s over 100,000 acres.

Answer:

Reference translation includes transformation from metric hectars to non-metric acres, more common in English. In fact it's a problem of reference translation. We can include more examples of usage of word "hectar" in our English training set and get correct translation "100,000" hectars". Making a model to translate one units to another is too complicated and absolutel not necessary (if not harmful).

__b) Now it is time to explore the outputs of the model that you have trained! The test-set translations your model produced in question 1-i should be located in outputs/test outputs.txt. Please identify 2 examples of errors that your model produced. 2 The two examples you find should be different error types from one another and different error types than the examples provided in the previous question.__

Answer:

i. 

Source Sentence: Yo estaba asombrada.

Reference Translation: I was in awe.

NMT Translation: I was (unk)

Looks like word "awe" is too uncommon in English and is not in a vocabulary, so we were unable to translate it. We can't expand our vocabulary and corpus.

ii.

Source Sentence: Y estamos observando esto en todos los mbitos de la vida humana.

Reference Translation: So we're seeing this  in all sorts of places in human life.

NMT Translation: And we're looking at this in all the spheres of human life.

Word "looking" is not right in this context. Maybe if we include some POS-data or grammatical rules in our embeddings than we'll be able to deal with this problem.

__c) Please consider this example:__

Source Sentence $s$: el amor todo lo puede

Reference Translation $r_1$: love can always find a way

Reference Translation $r_2$: love makes anything possible

NMT Translation $c_1$: the love can always do

NMT Translation $c_2$: love can make anything possible

__Please compute the BLEU scores for $c_1$ and $c_2$. Let $\lambda_i = 0.5$ for $i \in {1, 2}$ and $\lambda_i = 0$ for $i \in {3, 4}$ (this means we ignore 3-grams and 4-grams, i.e., don’t compute $p_3$ or $p_4$). When computing BLEU scores, show your working (i.e., show your computed values for $p_1$, $p_2$, $c$, $r^∗$ and $BP$). Which of the two NMT translations is considered the better translation according to the BLEU Score? Do you agree that it is the better translation?__

Answer:

i.

For $c_1$:

| 1-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| the | 1 | 0 | 0 | 0 | 0 |
| love | 1 | 1 | 1 | 1 | 1 |
| can | 1 | 1 | 0 | 1 | 1 |
| always | 1 | 1 | 0 | 1 | 1 |
| do | 1 | 0 | 0 | 0 | 0 |

$p_1 = \frac {\sum {\text {numerator term}}} {\text {Num of 1-grams}} = \frac 3 5 = 0.6$

| 2-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| the love | 1 | 0 | 0 | 0 | 0 |
| love can | 1 | 1 | 0 | 1 | 1 |
| can always | 1 | 1 | 0 | 1 | 1 |
| always do | 1 | 0 | 0 | 0 | 0 |

$p_2 = \frac {\sum {\text {numerator term}}} {\text {Num of 2-grams}} = \frac 2 4 = 0.5$

$r_1$ is more similar to $c_1$, so $r^* = 6$. ${BP} = \exp (1 - \frac {r^*} {c}) = \exp(-0.2) \approx 0.8187$

${BLEU}_{c_1} = {BP} \times \exp(\sum_n {\lambda_n \log{p_n}}) = 0.8187 * \exp(0.5*\log 0.6 + 0.5*\log 0.5) \approx 0.4484$ 

For $c_2$:

| 1-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| love | 1 | 1 | 1 | 1 | 1 |
| can | 1 | 1 | 0 | 1 | 1 |
| make | 1 | 0 | 1 | 1 | 1 |
| anything | 1 | 0 | 1 | 1 | 1 |
| possible | 1 | 0 | 1 | 1 | 1 |

$p_1 = \frac {\sum {\text {numerator term}}} {\text {Num of 1-grams}} = \frac 5 5 = 1.0$

| 2-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| love can | 1 | 1 | 0 | 1 | 1 |
| can make | 1 | 0 | 0 | 0 | 0 |
| make anything | 1 | 0 | 1 | 1 | 1 |
| anything possible | 1 | 0 | 1 | 1 | 1 |

$p_2 = \frac {\sum {\text {numerator term}}} {\text {Num of 2-grams}} = \frac 3 4 = 0.75$

$r_2$ is more similar to $c_2$, so $r^* = 4$. ${BP} = 1.0$

${BLEU}_{c_2} = {BP} \times \exp(\sum_n {\lambda_n \log{p_n}}) = 1.0 * \exp(0.5*\log 1.0 + 0.5*\log 0.75) \approx 0.866$ 

So we have ${BLEU}_{c_1} \approx 0.4484$ and ${BLEU}_{c_2} \approx 0.866$. The second translation is better in terms of BLEU score and I can agree that it is a better transaltion then the first one.

ii.

In case of presence of only $r_1$.

For $c_1$:

| 1-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| the | 1 | 0 | NA | 0 | 0 |
| love | 1 | 1 | NA | 1 | 1 |
| can | 1 | 1 | NA | 1 | 1 |
| always | 1 | 1 | NA | 1 | 1 |
| do | 1 | 0 | NA | 0 | 0 |

$p_1 = \frac {\sum {\text {numerator term}}} {\text {Num of 1-grams}} = \frac 3 5 = 0.6$

| 2-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| the love | 1 | 0 | NA | 0 | 0 |
| love can | 1 | 1 | NA | 1 | 1 |
| can always | 1 | 1 | NA | 1 | 1 |
| always do | 1 | 0 | NA | 0 | 0 |

$p_2 = \frac {\sum {\text {numerator term}}} {\text {Num of 2-grams}} = \frac 2 4 = 0.5$

$r_1$ is more similar to $c_1$, so $r^* = 6$. ${BP} = \exp (1 - \frac {r^*} {c}) = \exp(-0.2) \approx 0.8187$

${BLEU}_{c_1} = {BP} \times \exp(\sum_n {\lambda_n \log{p_n}}) = 0.8187 * \exp(0.5*\log 0.6 + 0.5*\log 0.5) \approx 0.4484$ 

For $c_2$:

| 1-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| love | 1 | 1 | NA | 1 | 1 |
| can | 1 | 1 | NA | 1 | 1 |
| make | 1 | 0 | NA | 0 | 0 |
| anything | 1 | 0 | NA | 0 | 0 |
| possible | 1 | 0 | NA | 0 | 0 |

$p_1 = \frac {\sum {\text {numerator term}}} {\text {Num of 1-grams}} = \frac 2 5 = 0.4$

| 2-gram | ${Count}_c$ | ${Count}_{r_1}$ | ${Count}_{r_2}$ | ${max}({Count}_{r_1}, {Count}_{r_2})$ | $\text{numerator term}$ |
| - | - | - | - | - | - |
| love can | 1 | 1 | NA | 1 | 1 |
| can make | 1 | 0 | NA | 0 | 0 |
| make anything | 1 | 0 | NA | 0 | 0 |
| anything possible | 1 | 0 | NA | 0 | 0 |

$p_2 = \frac {\sum {\text {numerator term}}} {\text {Num of 2-grams}} = \frac 1 4 = 0.25$

$r_2$ is more similar to $c_2$ but is not present. So we must use $r_1$, so $r^* = 6$. ${BP} \approx 0.8187$

${BLEU}_{c_2} = {BP} \times \exp(\sum_n {\lambda_n \log{p_n}}) = 0.8187 * \exp(0.5*\log 0.4 + 0.5*\log 0.25) \approx 0.2589$ 

So we have the same ${BLEU}_{c_1} \approx 0.4484$ and much lower ${BLEU}_{c_2} \approx 0.2589$. The first translation is much better in terms of BLEU score now. But it's not that good as the second one.

iii.

The only available reference translation can strongly shift BLEU results as we've seen above. Good enough translations can be scored very low if they consist from different words comparing with reference.

iv.

Advantages of BLEU:
* it can be computed automatically and be used as a part of model's validation 
* it shows rather good results in case of multiple reference translations
* it can be interpreted as a measure of goodness

Disadvantages of BLEU:
* we need to get good corpus of reference translations
* it becomes biased in case of too few reference translations
* it can behave bad in languages which allow more freedom in a sequence of words without changing the meaning - in this case n-gram measurements can underestimate similarity between translations.