## BLEU score on test set

```
Corpus BLEU: 24.179620034653208
```

## Character-based convolutional encoder for NMT

**a) In Assignment 4 we used 256-dimensional word embeddings ($e_{word} = 256$),
while in this assignment, it turns out that a character embedding size of 50 suffices ($e_{char} = 50$).
In 1-2 sentences, explain one reason why the embedding size used for character-level embeddings is
typically lower than that used for word embeddings.**

Answer:

Number of characters, its combinations and coocurrences is much lower than this number for words in corpus. That's the reason why we don't need too big vectors to represent characters.

**b) Write down the total number of parameters in the character-based embedding model (Figure 2), then do the same for the word-based lookup embedding model (Figure 1). Write each answer as a single expression (though you may show working) in terms of $e_{char}$, $k$, $e_{word}$, $V_{word}$ (the size of the word-vocabulary in the lookup embedding model) and $V_{char}$ (the size of the character-vocabulary in the character-based embedding model). Given that in our code, $k = 5$, $V_{word} ≈ 50000$ and $V_char = 96$, state which model has more parameters, and by what factor (e.g. twice as many? a thousand times as many?).**

Answer:

For character-based embedding volume we parameterize 3 parts: embedding layer, convolutional layer and highway layer. So there will be 3 parts in our sum:

$\text{Total parameters}_\text{char_based} = V_{char}*l_{char} + l_{word}*l_{char}*k + 2 * (l_{word}*l_{word} + l_{word}) \stackrel{\text{with values}}{=} 96*50 + 256*50*5 + 2*(256*256 + 256) = 200384$

For word-based embedding there is only one embedding layer: 

$\text{Total parameters}_\text{word_based} = V_{word}*l_{word} \stackrel{\text{with values}}{=} 50000*256 = 1280000$

So character-based model has more than 6 times less parameters parameters than word-based.


**c) In step 3 of the character-based embedding model, instead of using a 1D con-
vnet, we could have used a RNN instead (e.g. feed the sequence of characters into a bidirectional
LSTM and combine the hidden states using max-pooling). Explain one advantage of using a con-
volutional architecture rather than a recurrent architecture for this purpose, making it clear how
the two contrast.**

Answer:

We showed above that number of parameters in 1D-CNN (particularly for Highway network) is $2 * (l_{word}*l_{word} + l_{word})$. Number of parameters for BiLSTM is $2*(4*(l_{char}+1)*l_{word} + l_{word}^2)$ and we can see that this number is bigger than number for 1D-CNN due <$4*(l_{char}+1)$> term. So we can say that using BiLSTM instead of 1D-CNN can be less effective in terms of number of parameters.

**d) In lectures we learned about both max-pooling and average-pooling. For each
pooling method, please explain one advantage in comparison to the other pooling method. For
each advantage, make it clear how the two contrast, and write to a similar level of detail as in the
example given in the previous question.**

Answer:

Average pooling takes in account every undelying feature in feature map, so if we don't want to pay maximum attention to only one feature than we should use average pooling. However if mean value of features in feature map is close to 0 than using average pooling will produce feature with value close to 0, like if this feature is not active at all.
Max pooling is good when we want to pay all attention to the most active feature in a feature map. This makes this pooling more suitable for sparse features (when there are only few feeatures with big values).

## Analyzing NMT Systems

**a) The following table shows some of the forms of the Spanish word _traducir_, which means ‘to translate’.**

| Form        | Spanish   | English     |
| -           | -         | -           |
| Infinitive  | traducir  | to translate |
| Present     | traduzco  | I translate |
| -           | traduces  | you translate |
| -           | traduce   | he or she translates |
| Subjunctive | traduzca  | that I translate |
| -           | traduzcas | that you translate |

**Use vocab.json to find (e.g. using grep) which of these six forms are in the word-vocabulary,
which consists of the 50,000 most frequent words in the training data for English and for Spanish.
Superstrings don’t count (e.g. having traducen in the vocabulary is not a hit for traduce). State
which of these six forms occur, and which do not. Explain in one sentence why this is a bad thing for
word-based NMT from Spanish to English. Then explain in detail (approximately two sentences)
how our new character-aware NMT model may overcome this problem.**

Answer:

There are no occurences of words "traduzco", "traduces", "traduzca" and "traduzcas" in top-50000 words of Spanish corpora. All this words will be replaced with special <UNK> token in word-based NMT and will not add any semantic meaning to the result. This will lead to poor performance for sentences with these words.

Instead character-based NMT will try to produce some word vector for unknown words using character-based embedding. We hope, that our embedding will learn usage of suffixes in Spanish and will be able to connect these unknown words with the english word "translate" in a proper form.


**b.i) In Assignments 1 and 2, we investigated word embeddings created via algorithms such a Word2Vec, and found that for these embeddings, semantically similar words are close together in the embedding space. In this exercise, we’ll compare this with the word embeddings constructed using the CharCNN trained in our NMT system.
Go to https://projector.tensorflow.org/. The website by default shows data from Word2Vec. Look at the nearest neighbors of the following words (in cosine distance).**

- financial
- neuron
- Francisco
- naturally
- expectation

**For each word, report the single closest neighbor. For your convenience, for each example take a screenshot of all the nearest words (so you can compare with the CharCNN embeddings).**

Answer:

- financial: economic
- neuron: nerve
- Francisco: san
- naturally: occuring
- expectation: norms

**b.ii) The TensorFlow embedding projector also allows you to upload your own data – you may find this useful in your projects!
Now look at the nearest neighbors of the same words. Again, report the single closest neighbors and take screenshots for yourself.**

Answer:

- financial: vertical
- neuron: Newton
- Francisco: France
- naturally: practically
- expectation: exception

**b.iii) Compare the closest neighbors found by the two methods. Briefly describe what kind of similarity is modeled by Word2Vec. Briefly describe what kind of similarity is modeled by the CharCNN. Explain in detail (2-3 sentences) how the differences in the methodology of Word2Vec and a CharCNN explain the differences you have found.**

Answer:

Word2Vec is based on cooccurences of words so 2 vectors will be close if their words are close in terms of meaning (they can be find in the same context). CharCNN is based on coocurences of characters instead. So 2 vectors will be close if their words consist of the similar set of characters in the simialr order: the same roots, same suffixes etc. And we can easily see this difference in our results - all the closest words for CharCNN are not similar to orifinal word in terms of meaning, but very close in terms of characters.  