![title](imgs/a5_intro.png)

![title](imgs/a5_problem1_intro_1.png)

![title](imgs/a5_problem1_intro_2.png)

![title](imgs/a5_problem1a.png)

## Answers:
We use embeddings to represent sparse data (words), in lower-dimensional space.

There are between O(10^5) and O(10^6) words in the English language, but only O(10^2) characters, so we would expect:

1) character frequenc matrices to **not** be sparse

2) embeddings to be of significantly lower dimensionality

![title](imgs/a5_problem1b.png)

## Answer:
A) Number of embedding parameters per word, in character embedding model: 
$$ \begin{aligned} 
N_{char} &= |e_{char}| \times |V_{char}| + \left(f \times e_{char} \times k \right) + 2 \times \left( e_char \times e_char \right) \\
N_{char} &= |e_{char}| \times |V_{char}| + \left(e_{word} \times e_{char} \times k \right) + 2 \times \left( e_char \times e_char \right) \\
N_{char} &\approx 50 \times 96 + \left(256 \times 50 \times 5\right) + 2 \times \left(256^2\right) \\
N_{char} &\approx 2 \times 10^5 \\
\end{aligned}
$$

$$ \begin{aligned}
N_{word} &= \left|e_{word}\right| \times \left|V_{word}\right| \\
N_{word} &= 256 \times 50000 \\
N_{word} &= 1.28 \times 10^7 \\
\implies \frac{N_{word}}{N_{char}} &\approx 64 
\end{aligned}
$$

So word embedding model has ~60x more parameters than a character-embedding model.

![title](imgs/a5_problem1c.png)

## Answer:
With an RNN-based character-embedding model, the earlier characters in a word have less weighting on final word vector, than later characters, due to chaining of RNN-hidden units. Whereas in sentences, word proximity within a sentence correlates with word relatedness, so RNN's seem like a natural choice. However within words, actually earlier characters should be no less important than later characters in a word, for determining semantic meaning of a word (e.g. "unlucky", n.b. prefix "un"). But CNN's, by definition apply the same operation(convolution) on across input data series, so they seem like a naturally better choice for character-level NLP.

![title](imgs/a5_problem1d.png)

## Answer:
Max-Pooling is better for detecting contrasts, or edges in Computer Vision.

Average-Pooling is more sensitive to data in all input cells, whereas Max-Pooling only sensitive to max-valued input cell. It's disadvantage is that if cells are very-high and very-low, the average may end up close to zero, hiding the variation in data.

![title](imgs/a5_problem1ei.png)

![title](imgs/a5_problem1eii.png)

[Code](vocab.py)

![title](imgs/a5_problem1f.png)

[Code](utils.py)

![title](imgs/a5_problem1g.png)

[Code](vocab.py)

![title](imgs/a5_problem1h.png)

[Code](highway.py)

![title](imgs/a5_problem1i.png)

[Code](cnn.py)

![title](imgs/a5_problem1j.png)

[Code](model_embeddings.py)

![title](imgs/a5_problem1k.png)

[Code](nmt_model.py)

![title](imgs/a5_problem1l.png)

![title](imgs/a5_problem2_intro_i.png)

![title](imgs/a5_problem2_intro_ii.png)

![title](imgs/a5_problem2a.png)

[Code](char_decoder.py)

![title](imgs/a5_problem2b.png)

[Code](char_decoder.py)

![title](imgs/a5_problem2c.png)

[Code](char_decoder.py)

![title](imgs/a5_problem2d.png)

[Code](char_decoder.py)

![title](imgs/a5_problem2e.png)

![title](imgs/a5_problem2fi.png)

![title](imgs/a5_problem2fii.png)

## Answer: my BLEU score was ~24.

![title](imgs/a5_problem3a.png)

## Answers:
1) Present in vocab.json:
* _traducir_
* _traduce_

2) Not present in vocab.json:
* _traduzco_
* _traduces_
* _traduzca_
* _traduzcas_

3) Explanation:
* If verb forms aren’t in training data corpus, then word-based NMT model won’t recognize these words, won’t be able to translate. 
* Character-based NMT models could help here, by estimating word’s meaning-encoding, through composition of character-level encodings. 
* So when new word appears in test data, but never in training data, character-based NMT could estimate word’s meaning using it’s characters of which it’s composed, and it’s character-level embeddings.


![title](imgs/a5_problem3bi.png)

## Answers:
Word2Vec Word-Embeddings:
* financial: economic (0.462)
* neuron: nerve (0.559) 
* Francisco: san (0.184)
* naturally: occurring (0.545)
* expectation: norms(0.627)

![title](imgs/a5_problem3bii-part1.png)

![title](imgs/a5_problem3bii-part2.png)

## Answers:
Part II: Character-based Embeddings:
* financial: vertical (0.301) 
* neuron: Newton (0.354)
* Francisco: France (0.420)
* naturally: practically (0.302)
* expectation: exception (0.389)

### financial:
![financial](imgs/word_financial.png)

### neuron:
![neuron](imgs/word_neuron.png)

### Francisco:
![Francisco](imgs/word_Francisco.png)

### naturally:
![naturally](imgs/word_naturally.png)

### expectation:
![expectation](imgs/word_expectation.png)

![title](imgs/a5_problem3biii.png)

### financial:
![financial](imgs/char_financial.png)

### neuron:
![neuron](imgs/char_neuron.png)

### Francisco:
![Francisco](imgs/char_Francisco.png)

### naturally:
![naturally](imgs/char_naturally.png)

### expectation:
![expectation](imgs/char_expectation.png)

![title](imgs/a5_problem3c.png)

## Answer:
1) Spanish:
* Bien, al da siguiente estbamos en Cleveland.
* La epifana es que la muerte es parte de la vida.

2)	English
* Well, the next day we were in Cleveland.
* The epiphany is  that death is a part of life.

3) A4 Translation:
* Well, the next day we were in &lt;unk&gt;
* &lt;unk&gt; is that death is part of life.

4) Char-CNN Translation:
* Well, the next day we were in Christmas.  
* The evidence is that death is part of life.

5)	Evaluation:
* Example a: incorrect, perhaps because "Cleveland" is a proper name (although the suffix "land" could have provided a hint that word referred to proper place).
* Example b: largely correct, because "evidence" and "epiphany" have semantically-related (if not similar) meanings.