In [100]:
import csv
from IPython.display import Image
import json

### 3 (a)

In [3]:
with open('vocab.json') as fh:
    vocab_obj = json.load(fh)


In [18]:
words = ['traducir', 'traduce', 'traduzco', 'traduces', 'traduzca', 'traduzcas']
for w in words:
    if w not in vocab_obj['src_word2id']:
        print('{0} is not in vacab'.format(w))    

traduzco is not in vacab
traduces is not in vacab
traduzca is not in vacab
traduzcas is not in vacab


These words all carry the meaning of "translate", some of the variations do not make it to the 50,000 most frequent words in the vocabulary. the word-based NMT model will fail terribly when it encounters those that do not appear in the vocab. 

Character-aware NMT should be able to reconcile this issue, as all these variations have the same "root", i.e. "tradu", so the model might be able to infer the word given the first few characters of the word. Furthermode, the variations following the root might have similar prunonciations, for example "z" in "traduzco", "traduzca", and "traduzcas", and "c" in "traduces". This could be a pattern in Spanish words that the character-aware NMT should be able to learn from the training data. 

### 3 (b) i.

<img src="images/financial_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/neuron_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/Francisco_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/naturally_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/expectation_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

### 3 (b) ii.

<img src="images/char_financial_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/char_neuron_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/char_Francisco_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/char_naturally_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

<img src="images/char_expectation_nearest_neighbors.png" alt="Drawing" style="width: 900px;"/>

### 3 (b) iii.

word-based embeddings are able to capture the word meanings ("finaicial": "economic"), whereas char-based embeddings seem to capture similarity of words based on how they spell ("financial": "vertical")

For Word2Vec, we look at words that appear together, therefore words that co-occur frequently will become neighbors. Whereas CharCNN looks at how a word spells and learn how characters build up to words. Besides, the CharCNN model does not take into account nearby words at all. Therefore words that have similar spelling will become neighbors even though they have rather distinct meanings.


### 3 (c)

In [48]:
def read_corpus(file_path):
    """ Read file, where each sentence is dilineated by a `\n`.
    @param file_path (str): path to file containing corpus
    """
    data = []
    for line in open(file_path):
        sent = line.strip() #.split(' ')
        data.append(sent)

    return data

In [63]:
source_sentences = read_corpus('en_es_data/test.es')
ref_translate = read_corpus('en_es_data/test.en')

In [49]:
a4_output = read_corpus('outputs/test_outputs_a4.txt')
len(a4_output)

8064

In [50]:
a5_output = read_corpus('outputs/test_outputs.txt')
len(a5_output)

8064

In [57]:
word_embed_unks = [(idx, s) for idx, s in enumerate(a4_output) if '<unk>' in s]

In [99]:
word_embed_unks[209]

(423,
 'The colleagues on the journey of inanimate objects are not just the light that <unk> it, and the wind that goes on to the side and the heat of the room.')

#### Find one example where word-based model produced 'UNK', and char-based decoder produced an acceptable translation

In [95]:
a4_output[423]

'The colleagues on the journey of inanimate objects are not just the light that <unk> it, and the wind that goes on to the side and the heat of the room.'

In [96]:
a5_output[423]

"The travel partners of the immediate objects are not just people -- it's also the light that shines it and the wind that goes around and the heat of the room."

In [97]:
source_sentences[423]

'Los compaeros de viaje de los objetos inanimados no son solo personas, tambin es la luz que lo alumbra y el viento que pasa a su lado y el calor de la habitacin.'

In [98]:
ref_translate[423]

"The fellow passengers for inanimate objects  are not just people,  but it's also the light shining on it  and the wind blowing past it and the heat of the room."

Explanation: The word-based embedding failed to translate "shine", but the char-based decoder manageed to do it

#### Find one example where word-based model produced 'UNK', and char-based decoder produced an incorrect translation

In [101]:
a4_output[3]

"I'm here to talk to you about circles and <unk>"

In [62]:
a5_output[3]

"I'm here today to talk to you about circles and epidemics."

In [64]:
source_sentences[3]

'Hoy estoy aqu para hablarles sobre crculos y epifanas.'

In [70]:
ref_translate[3]

"I'm here today to talk to you  about circles and epiphanies."

Explanation: the 'UNK' corresponds to epifanas, which is a rare word. The char-based decoder mistakens it with the word "epidemics".