## Homework 3 - Machine Translation - MDS Computational Linguistics

### Assignment Topics
- Seq2seq with attention
- Evaluation metric


### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: March 15, 2020, 18:00:00 (Vancouver time)



## Load required packages

In [None]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
from torchtext.datasets import TranslationDataset

import spacy
import numpy as np

## Set seed of randomization and working device

In [None]:
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

## Attention! Tidy Submission

rubric={mechanics:3}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions).
- You should not use this notebook (i.e., Lab3.ipynb) to train your model on Colab. Colab may change the layout of the jupyter notebook, and then we cannot grade on your notebook, the system will miss your grade!
- We provide another jupyter notebook (i.e., `Lab3_exp.ipynb`) for you. You can load `Lab3_exp.ipynb` to Colab and run your experiments in this jupyter notebook. 
- Please download `Lab3_exp.ipynb` from Colab and include it in your final submission. 
- Some comments in your code will help us grade. Please use heading, comments, and mardown notations to organize your code. 
- Please feel free to add cells in `Lab3_exp.ipynb`.
- You don't need to submit your checkpoints.
- You can reuse any scripts of lab tutorials.

**Dataset**

In all the questions of this lab, we continue to use the English-French bilingual corpus of [Multi30k](https://github.com/multi30k/dataset) dataset that we used in lab2 and tutorials. Our task is to `translate text from French language to English language`. All your model should be trained on `train_eng_fre.tsv`, validated on `val_eng_fre.tsv`, and tested on `test_eng_fre.tsv`. 

## Exercise 1: Seq2Seq Variant

In last week, we used a **uni-directional LSTM Encoder** to compress the information of a source language into two context representation vectors of a fixed length (i.e., the final hidden and cell states). Then, we use the final hidden and cell states to initialize the hidden and cell states of uni-directional LSTM Decoder. However, the capability of these representations can be limited. They can easily forget the earlier information of a long sequence and co-reference relationships. Before introducing the attention mechanism, we can also use some tricks to solve the issue of context **information bottleneck** problem. 

In this exercise, please write a script to implement the following tricks in **a single seq2seq model**:
1. Use a **bi-directional LSTM** as the **Encoder** to get the context representation of the source sentence. 

2. Instead of using the final hidden and cell states of the bi-directional encoder to initialize the uni-directional decoder, (A) for $h_0$ use **the mean of all hidden states** and (B) for $c_0$ use the **final cell state** of the bi-directional encoder. 

Combine 1 & 2 in **a single seq2seq model**.

**Instruction:**
- Please paste your experiment codes to answer the corresponding question below.
- You should train your model with the following hyper-parameters:
```
INPUT_DIM = 6004
OUTPUT_DIM = 6004
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
N_LAYERS = 1
BI_DIRECTION = True
ENC_HID_DIM = 512
DEC_HID_DIM = XXXXX (you should figure out this hyper-parameter). 
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.3
TEACH_FORCING_RATE = 0.5
LEARNING_RT = 0.001
MAX_EPOCH = 15
optimizer = optim.Adam(model.parameters(),lr=LEARNING_RT)
```
- You should use `init_weights()` function from this week tutorial to initialize model with normal distribution with `mean=0` and `std=0.01`. 
- Your seed of randomization should be 77 (i.e., manual_seed = 77). 
- You should use `nn.CrossEntropyLoss()` loss function and ignore `<pad>` tokens.
- You should save the model checkpoint at the end of each epoch. You also need to save your vocabularies.
- You should use a different checkpoint directory to avoid overwriting previous models. 
- Then, you load the best checkpoint and evaluate it on TEST set. 
- You should keep your vocabularies and best checkpoints. We will use them in Exercise 3 (Error analysis). 

**Hints:**
- Although the encoder is bi-directional LSTM, the decoder must be a uni-directional LSTM.
- This exercise is more related to last tutorial (i.e., week 2). 
- The last hidden state of the LSTM is `h_n` of shape (num_layers * num_directions, batch, hidden_size).
- The last cell state of the LSTM is `c_n` of shape (num_layers * num_directions, batch, hidden_size).
- All the hidden states from the last LSTM layer is `output` of shape (seq_len, batch, num_directions * hidden_size).
- The initialization states (i.e., $s_0$,$c_0$) of Decoder must match the dimension of Decoder. Namely, you should give a appropriate number of `DEC_HID_DIM` via analyzing the relation between tensor shapes. 
- You can use `print(XXX.shape)` to check the shape of your tensor. If the tensor shape doesn't match the desired shape of tensor, you should reshape it using `.view(), .squeeze(), .unsqueeze() or .permute() function.`.

**To facilitate your model evaluation, we provide a `inference()` function which calculates BLEU score based on a test corpus (test_eng_fre.tsv).**

In [None]:
def inference(model, file_name, src_vocab, trg_vocab, attention=False, max_trg_len = 64):
    '''
    Function for translation inference

    Input: 
    model: translation model;
    file_name: the directoy of test file that the first column is target reference, and the second column is source language;
    trg_vocab: Target torchtext Field
    attention: the model returns attention weights or not.
    max_trg_len: the maximal length of translation text (optinal), default = 64

    Output:
    Corpus BLEU score.
    '''
    from nltk.translate.bleu_score import corpus_bleu
    from nltk.translate.bleu_score import sentence_bleu
    from torchtext.data import TabularDataset
    from torchtext.data import Iterator

    # convert index to text string
    def convert_itos(convert_vocab, token_ids):
        list_string = []
        for i in token_ids:
            if i == convert_vocab.vocab.stoi['<eos>']:
                break
            else:
                token = convert_vocab.vocab.itos[i]
                list_string.append(token)
        return list_string

    test = TabularDataset(
      path=file_name, # the root directory where the data lies
      format='tsv',
      skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
      fields=[('TRG', trg_vocab), ('SRC', src_vocab)])

    test_iter = Iterator(
    dataset = test, # we pass in the datasets we want the iterator to draw data from
    sort = False,batch_size=128,
    sort_key=None,
    shuffle=False,
    sort_within_batch=False,
    device = device,
    train=False
    )
  
    model.eval()
    all_trg = []
    all_translated_trg = []

    TRG_PAD_IDX = trg_vocab.vocab.stoi[trg_vocab.pad_token]

    with torch.no_grad():
    
        for i, batch in enumerate(test_iter):

            src = batch.SRC
            #src = [src len, batch size]

            trg = batch.TRG
            #trg = [trg len, batch size]

            batch_size = trg.shape[1]

            # create a placeholder for traget language with shape of [max_trg_len, batch_size] where all the elements are the index of <pad>. Then send to device
            trg_placeholder = torch.Tensor(max_trg_len, batch_size)
            trg_placeholder.fill_(TRG_PAD_IDX)
            trg_placeholder = trg_placeholder.long().to(device)
            if attention == True:
                output,_ = model(src, trg_placeholder, 0) #turn off teacher forcing
            else:
                output,_ = model(src, trg_placeholder, 0) #turn off teacher forcing
            # get translation results, we ignor first token <sos> in both translation and target sentences. 
            # output_translate = [(trg len - 1), batch, output dim] output dim is size of target vocabulary.
            output_translate = output[1:]
            # store gold target sentences to a list 
            all_trg.append(trg[1:].cpu())

            # Choose top 1 word from decoder's output, we get the probability and index of the word
            prob, token_id = output_translate.data.topk(1)
            translation_token_id = token_id.squeeze(2).cpu()

            # store gold target sentences to a list 
            all_translated_trg.append(translation_token_id)
      
    all_gold_text = []
    all_translated_text = []
    for i in range(len(all_trg)): 
        cur_gold = all_trg[i]
        cur_translation = all_translated_trg[i]
        for j in range(cur_gold.shape[1]):
            gold_convered_strings = convert_itos(trg_vocab,cur_gold[:,j])
            trans_convered_strings = convert_itos(trg_vocab,cur_translation[:,j])

            all_gold_text.append(gold_convered_strings)
            all_translated_text.append(trans_convered_strings)

    corpus_all_gold_text = [[item] for item in all_gold_text]
    corpus_bleu_score = corpus_bleu(corpus_all_gold_text, all_translated_text)  
    return corpus_bleu_score

`inference()` function will take five variables `model, file_name, trg_vocab,attention, and max_trg_len` as inputs and return a corpus cumulative BLEU-4 score. Here is a use case.

In [None]:
# use case
print(inference(model_best, "./drive/My Drive/Colab Notebooks/eng-fre/test_eng_fre.tsv", SRC, TRG, True, 64))

### 1.0 Please paste your full training log here. Which epoch is the best?
rubric={accuracy:2}
Your log should look like this:
```
Epoch: 01 | Time: 1m 25s
	Train Loss: 4.293 | Train PPL:  73.188
	 Val. Loss: 4.263 |  Val. PPL:  71.012 
   ............
```


**Your answer:**
**My best model is trained with XX epochs.**

```
Your log goes here
```

### 1.1 Please report the cumulative BLEU-4 score on test set (i.e., `test_eng_fre.tsv`) via corpus_bleu() function.
rubric={accuracy:2}

Hint: You can use `inference()` function. 

**Your answer goes here**

**My best model obtains XX.XX cumulative BLEU-4 score.**

### Please paste your code for the corresponding questions.

### 1.2 Pleae revise the following code to build the appropriate vocabularies:
rubric={accuracy:2}

In [None]:
# Your code goes here
TRG.build_vocab(train)
SRC.build_vocab(train)

### 1.3 You may need to revise `class Encoder`. Please show your code for `class Encoder`:
rubric={accuracy:2, efficiency:2}

In [None]:
# Your code goes here

### 1.4  You may need to revise `class Decoder`. Please show your code for `class Decoder`:
rubric={accuracy:2, efficiency:2}

In [None]:
# Your code goes here

### 1.5 You may need to revise `class Seq2Seq`. Please show your code for `class Seq2Seq`:
rubric={accuracy:2, efficiency:2}

In [None]:
# Your code goes here

### 1.6 You may need to revise the code of instantiating classes. Please show your code for instantiation:
rubric={accuracy:2}

In [None]:
# Your code goes here

## Exercise 2: Seq2Seq with varient of attention mechanism

Attention mechanisms boost the performance of deep learning models in machine translation and classification tasks, helping models select informative context to form the context representation. They also provide meaningful interpretation of the behavior of black-box deep learning models. In the tutorial of this week, we implemented the **concatenative/additive attention** which was initially proposed by [Bahdanau et al. (2015)](https://arxiv.org/pdf/1409.0473.pdf) to help memorize long source sentences in neural machine translation model. Additive attention computes the attention score in the following way, $e_{ij} = v_a tanh (W_a [s_{i-1};h_j;]) \in \mathcal{R}$ where $W_a \in \mathcal{R}^{h\times 2h}$ and $v_a \in \mathcal{R}^{1\times h}$.

In the followup work, [Luong et al. (2015)](https://arxiv.org/pdf/1508.04025.pdf) examines two type of attention-based models (i.e., **global attetnion and local attention**) with effective classes of attention alignment functions (e.g., concat product, dot product, and general product). Generally, **a global approach** always attends to all source tokens, and **a local approach** only looks at a subset of source tokens at a time. [Luong et al. (2015)](https://arxiv.org/pdf/1508.04025.pdf) report the performances of models under different settings. The results show the model that is trained with **global attention with dot product alignment function** is better than MLP model that is trained with **global attention and concat product alignment function**. We briefly studied **global attention and concat product alignment function** as **Dot-Product/Multiplicative** attention in the tutorial.     

In this exercise, you need to write code to implement **global attention with dot product alignment function** (**Dot-Product/Multiplicative**) from [Luong et al. (2015)](https://arxiv.org/pdf/1508.04025.pdf) to solve our translation task. 

Hint:
- This exercise is more related to the tutorial of this week (i.e., week 3).

### 2.1 Warm Up
rubric={accuracy:8}

As a quick warm-up, take a minute to review **the code from the tutorial of this week** and identify the code to answer corresponding. You can copy and paste the code from the tutorial into the box below. You don't need to write any of your code in this section. 

Bahdanau et al., use **global attention** with one alignment function, the **concatenative/additive attention**. This will take in the previous hidden state of the decoder, $s_{t-1} \in \mathcal{R}^h$, and all of the stacked hidden states from the encoder, $H \in \mathcal{R}^{T\times h}$. First, we calculate the *energy* between the previous decoder hidden state (i.e., $s_{t-1}$) and the encoder hidden states (i.e., $H$).

The first thing we do is `repeat` the previous decoder hidden state $T-1$ times to obtain a matrix, $S_{t-1} \in \mathcal{R}^{T \times h}$.

**Please paste the line(s) that perform this `repeat` operation:**

In [None]:
# Paste code here

We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`W_a`) and a $\tanh$ activation function. 

$$E_t = \tanh(W_a([S_{t-1}; H]^T))$$

where $E_t$ is a **[decoder_hidden_dim, src_len]** tensor.

**Please paste the line(s) that calculate $E_t$:**

In [None]:
# Paste code here

 Then, we want convert $E_t$ to be a vector of **[src_len]** size for **each example** in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, decoder_hidden_dim]** tensor, $v_a$.

$$\hat{a}_t = v_a E_t$$

**Please paste the line(s) that calculate $\hat{a}_t$ in a batch:**

In [None]:
# Paste code here

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

**Please paste the line(s) that calculate $a_t$ and return it to Decoder:**

In [None]:
# Paste code here

### 2.2 Write code to inplement global attention with dot product/multiplicative attention function
rubric={mechanics:1}

Intead of using the **additive attention** to get attenntion score $a_t$:
$$\alpha_i = softmax(v_a \tanh(W_a[S_{i-1}; H]^T))$$

**We will use the *global attention with dot product alignment function* (Dot-Product/Multiplicative based attention function, that is, $\alpha_i = softmax([s_{i-1}^T h_1,\cdots,s_{i-1}^T h_t])$ to calculate attention score $\alpha_i$ and follow the subsequent steps to generate translation token $\hat{y_t}$**: 
1. initialize the `outputs` tensor is created to hold all predictions, $\hat{Y} = \{\hat{y_1} ... \hat{y_t}\}$ where t is the maximal length of target language;
2. the source sequence, $X = \{x_1,..., x_t\}$, is fed into the encoder to receive last hidden state, $h_t$, and last cell state $c^{Encoder}_t$;
3. the initial decoder hidden state is set to be the $h_t$, and the initial decoder cell state is set to be the $c_t$. (i.e., $s_0$ = $h_t$; $c^{Decoder}_0$ = $c^{Encoder}_t$);
4. we use a batch of `<sos>` tokens as the first `input` (i.e., $y_0$);
5. we then decode within a loop:

 for i in range(1,t): t is the maximal length of target language
    1. inserting the input token $y_i$, previous hidden state, $s_{i-1}$, and previous cell state, $c^{Decoder}_{i-1}$, into the Decoder;
    2. use **`attention_function()`** to calculate attention vector based on $h_1,\dots,h_t$ (all encoder hidden states stacked up is $H$) and $s_{i-1}$;
        3. use this attention vector to create a weighted context vector, $c_i$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $\alpha_t$ as the weights (i.e., $c_i = \alpha_t^T H$);
    4. concatenate the embedding corresponding to the current decoder input token with with weighted context vector $c_i$, and pass this concatination through the decoder RNN, to get a new hidden state $s_{i}$ that shape is `[batch, decoder hidden dimension]`.
    5. pass $s_{i}$ through the linear layer, $f_output$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$.
    6. decide if use **teacher forcing** or not, setting the next input as appropriate.


where **`attention_function()`** is based on **dot product attention**:
$\alpha_i = softmax([s_{i-1}^T h_1,\cdots,s_{i-1}^T h_t])$


**The pseudo code for computing attention vector:**

```
class Decoder(nn.Module):
        INPUT: 
        current Decoder hidden state (s_{i-1}), decoder_output: [batch size, dec hid dim]
        all hidden state of last layer of Encoder (H), encoder_outputs: [src len, batch size, enc hid dim]
        OUTPUT: 
        attention_vector (a_t), attention_vector: (batch_size, src_len)
        ------------------------------------------------------------------------------------------
        # For-loop version: 
        attention_vector = Variable(torch.zeros(batch_size, ecoder_src_len))

        # For every batch, every time step of encoder's hidden state, calculate attention score.
        for b in range(batch_size):
            for t in range(max_src_len):
                # Luong et al. (2015) equation(8) -- dot form content-based attention:
                attention_vector[b,t] = decoder_output[b] **dot product** encoder_outputs[t,b]
                
        ------------------------------------------------------------------------------------------
        # Vectorized version:

        1. attention_vector =  
           (batch_size, seq_len=1, hidden_size) **batch matrix-matrix product** (batch_size, hidden_size, max_src_len) = (batch_size, seq_len=1, max_src_len)
           
         return attention_vector
```

Equation 8 from Luong's paper is:
<img src="attention_images/luong_eqn8.png\" title="equation 8 from Luong paper" height="550" width="450"/>

**Instruction:**
- Please paste your experiment codes to answer the corresponding question below.
- You should train your model with the following hyper-parameters:
```
INPUT_DIM = 6004
OUTPUT_DIM = 6004
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
N_LAYERS = 1
BI_DIRECTION = True
ENC_HID_DIM = 512
DEC_HID_DIM = XXXXX (you should figure out this hyper-parameter). 
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.3
TEACH_FORCING_RATE = 0.5
LEARNING_RT = 0.001
MAX_EPOCH = 15
optimizer = optim.Adam(model.parameters(),lr=LEARNING_RT)
```
- You should use `init_weights()` function from this week tutorial to initialize model with normal distribution which `mean=0` and `std=0.01`.
- Your seed of randomization should be 77 (i.e., manual_seed = 77). 
- You should use `nn.CrossEntropyLoss()` loss function and ignore `<pad>` tokens.
- You should save the model to checkpoint at the end of each epoch. You also need to save your vocabularies.
- You should use a different checkpoint directory to avoid overwriting previous models. 
- Then, you load the best checkpoint and evaluate it on TEST set. 
- You should keep your vocabularies and best checkpoints. We will use them in Exercise 3 (Error analysis). 

**Hints:**
- This exercise is more related to last tutorial (i.e., week 2). 
- Although the Encoder is bi-directional LSTM, the Decoder must be a uni-directional LSTM.
- The last hidden state of the LSTM is `h_n` of shape (num_layers * num_directions, batch, hidden_size).
- The last cell state of the LSTM is `c_n` of shape (num_layers * num_directions, batch, hidden_size).
- The all hidden states from the last LSTM layer is `output` of shape (seq_len, batch, num_directions * hidden_size).
- The initialization states (i.e., $s_0$,$c_0$) of Decoder must match the dimension of Decoder. Namely, you should give a appropriate number of `DEC_HID_DIM` via analyzing the relation between tensor shapes. 
- You can use `print(XXX.shape)` to check the shape of your tensor. If the tensor shape doesn't match the desired shape of tensor, you should reshape it using `.view(), .squeeze(), .unsqueeze() or .permute() function.`.

### 2.2.1 Please paste your fully training log here. Which epoch is the best?
rubric={accuracy:2}
You log should look like this:
```
Epoch: 01 | Time: 1m 25s
	Train Loss: 4.293 | Train PPL:  73.188
	 Val. Loss: 4.263 |  Val. PPL:  71.012 
   ............
```

**You answer:**
**My best model is trained with XX epochs.**

```
Your log goes here
```

### 2.2.2 Please report the cumulative BLEU-4 score on test set (i.e., `test_eng_fre.tsv`) via the corpus_bleu() function.
rubric={accuracy:2}

Hint: You can use `inference()` function. 

**Your answer goes here**

**My best model obtains XX.XX cumulative BLEU-4 score.**

### Please paste your code for the corresponding questions.

### 2.2.3 You may need to revise `class Encoder`. Pleae show your code for `class Encoder`:
rubric={accuracy:2, efficiency:2}

In [None]:
# Your code goes here

### 2.2.4  You may need to revise `class Decoder`. Pleae show your code for `class Decoder`:
rubric={accuracy:4, efficiency:2}

In [None]:
# Your code goes here

### 2.2.4a  Give your code for the `class Attention`:
rubric={accuracy:4, efficiency:2}

In [None]:
# Your code goes here

### 2.2.5 You may need to revise `class Seq2Seq`. Pleae show your code for `class Seq2Seq`:
rubric={accuracy:2, efficiency:2}

In [None]:
# Your code goes here

### 2.2.6 You may need to revise the code of instantiating classes. Pleae show your instantiation code:
rubric={accuracy:2}

In [None]:
# Your code goes here

### 2.2.7 Please write code to visualize the attention alignment of following sentences using your best model in 2.2 and include your visualization pictures in your Lab3 directory. Your visualization pictures should be named `ex2_sentence_n.png` where $n$ is the index. Please include your code in `Lab3_exp.ipynb`.
rubric={accuracy:3}

In [None]:
sentence_1 = "Une femme lit un magazine par dessus l'épaule d'une autre femme."
sentence_2 = "Un gars torse nu regardant au loin tandis que trois femmes passent devant une foule assise devant un café."
sentence_3 = "Deux groupes de baigneurs barbotent dehors."

**Put your translation here**

My translation:
- sentence_1:
- sentence_2:
- sentence_3:

### 2.2.8 Please write code to visualize the attention alignment of three sentences in 2.2.7 using your best model of `seq2seq+additive attention (this week tutorial)` and include your visualization pictures in your Lab3 directory. Your visualization pictures should be named `tutorial3_sentence_n.png` where $n$ is the index. Please include your code in `Lab3_exp.ipynb`.
rubric={accuracy:3}

**Put your translation here**

My translation:
- sentence_1:
- sentence_2:
- sentence_3:

### 2.2.9 Please compare the translation of 2.2.7 and 2.2.8 and provide a error analysis. (OPTIONAL)
rubric={reasoning:2}

Hints:
- try to identify issues with the translation.
- If you don’t know the two languages, you can compare against Google translation. 
- Which model is better? Why?
- You answer should be less than 100 words.

**You answer goes here:**

## Exercise 3 Error analysis

Now, we have four seq2seq models: (1) basic seq2seq from week2 tutorial (`seq2seq_tutorial.ipynb`); (2) seq2seq with additive attention from week3 tutorial (`attention_tutorial.ipynb`); (3) seq2seq variant of Exercise 1; (4) seq2seq with dot product attention in Exercise 2. Please use the best models of these four models to answer the following questions. 

If you had problems creating any of these 4 models, just report your results on the models you were able to make work/develop.

### 3.1 Please report the cumulative BLEU-4 score on test set (i.e., `test_eng_fre.tsv`) via the corpus_bleu() function.
rubric={accuracy:1}

Hint: You can use `inference()` function. 

**You answer goes here:**
- For model (1), **my best model obtains XX.XX cumulative BLEU-4 score with XX epoch(s).** 
- For model (2), **my best model obtains XX.XX cumulative BLEU-4 score with XX epoch(s).** 
- For model (3), **my best model obtains XX.XX cumulative BLEU-4 score with XX epoch(s).** 
- For model (4), **my best model obtains XX.XX cumulative BLEU-4 score with XX epoch(s).** 

### 3.2 Please evaluate any one of four models on on test set (i.e., `test_eng_fre.tsv`) and report its BLEU-1, cumulative BLEU-2, BLEU-3, and BLEU-4 scores via the corpus_bleu() function.
rubric={accuracy:2}

Hints: 
- You can use the `inference()` function. But you will need to revise few lines of this function.
- You may need to review the BLEU implementation in week 2 tutorial. 

**You answer goes here:**
- **I evaluate on model (X).** 
- **This model obtains XX.XX BLEU-1 score** 
- **This model obtains XX.XX cumulative BLEU-2 score** 
- **This model obtains XX.XX cumulative BLEU-3 score** 
- **This model obtains XX.XX cumulative BLEU-4 score** 

### 3.3 Please explain the results of 3.2 briefly. 
rubric={reasoning:2}

Hints: What is the relationship between BLEU-n scores? Why? 

**You answer goes here:**

To analyze the effects of sentence length, we create two subsets of test file: A. `long_test_eng_fre.tsv` which only includes sentence pairs that English reference is longer than 19 tokens; B. `short_test_eng_fre.tsv` which only includes sentence pairs that English reference is shorter than 9 tokens. `long_test_eng_fre.tsv` includes 67 sentence pairs.  `short_test_eng_fre.tsv` includes 85 sentence pairs. 

### 3.4 Please report the cumulative BLEU-4 score on long sentences of test set (i.e., `long_test_eng_fre.tsv`) via the corpus_bleu() function. 
rubric={accuracy:2}

Hint: You can use `inference()` function. 

**You answer goes here:**

Evaluate on `long_test_eng_fre.tsv`:
- For model (1), **my best model obtains XX.XX cumulative BLEU-4.** 
- For model (2), **my best model obtains XX.XX cumulative BLEU-4.**
- For model (3), **my best model obtains XX.XX cumulative BLEU-4.**
- For model (4), **my best model obtains XX.XX cumulative BLEU-4.**

### 3.5 Please report the cumulative BLEU-4 score on short sentences of test set (i.e., `short_test_eng_fre.tsv`) via the corpus_bleu() function. 
rubric={accuracy:2}

Hint: You can use `inference()` function. 

**You answer goes here:**

Evaluate on `short_test_eng_fre.tsv`:
- For model (1), **my best model obtains XX.XX cumulative BLEU-4.** 
- For model (2), **my best model obtains XX.XX cumulative BLEU-4.**
- For model (3), **my best model obtains XX.XX cumulative BLEU-4.**
- For model (4), **my best model obtains XX.XX cumulative BLEU-4.**

### 3.6 Please compare the results of questions of 3.1-3.3. Which model does perform best? What differences between their performance? How does the length of sentence affect performance?
rubric={reasoning:4}

Hint:
- You can review the Section 5 of [Luong et al. (2015)](https://arxiv.org/pdf/1508.04025.pdf) to answer this question.
- You answer should be less than 100 words. 

**Your answer goes here:**

## Exercise 4 Conceptual Questions

### 4.1 BLEU
rubric={reasoning:2}

BLEU is one of the most widely used metrics for NLP, but why can't we just use accuracy to measure the success of translations? Does accuracy make sense in machine translation? 

Hint: Feel free to do some research to answer this question.

**Your answer goes here:**

### 4.2 Teacher Forcing
rubric={reasoning:2}

For training seq2seq models we often use Teacher Forcing as part of the training procedure. Identify two advantages of using this approach in training.  

Hint: Feel free to do some research to answer this question.

**Your answer goes here:**

### 4.3 Attention bottleneck
rubric={reasoning:1}

Learning attention functions automatically require large volumes of data especially for recurrent neural networks. Can you recommend a solution to overcome this problem? Barrett and Bingel investigate an approach to solve this limitation. Take a few minutes to SKIM the abstract and introduction and write a few sentences summarizing takeaways that you can use in practice.

Barrett, M., Bingel, J., Hollenstein, N., Rei, M., & Søgaard, A. (2018, October). Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 302-312).


**Your answer goes here:**

### 4.4 Global Attention Mechanisms
rubric={reasoning:1}

Global attention mechanisms consider all the encoder hidden states when deriving the context vector while local attention mechanisms consider only a subset of the encoder hidden states when deriving the context vector. [Luong et al. 2015](https://arxiv.org/pdf/1508.04025.pdf) proposed **three global attention mechanisms**. In the tutorial, we looked at (terminologies are borrowed from Luongs's paper):
* Dot: $e_{ij} = s_{i-1}^Th_j \in \mathcal{R}$
* Concat: $e_{ij} = v_a tanh( W_a [s_{i-1};h_j;]) \in \mathcal{R}$ where $W_a \in \mathcal{R}^{h\times 2h}$ and $v_a \in \mathcal{R}^{1\times h}$

Can you find out the third global attention mechanism by going over the paper (especially Section 3.1) and write it down in the same format as above?

**Your answer goes here:**