## Homework 2 - Machine Translation - MDS Computational Linguistics

### Assignment Topics
- seq2seq Models
- Grid Search vs. Random Search for Hyperparameters Tuning


### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.7.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: March 6, 2021, 23:59:00 (Vancouver time)



In [None]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
from torchtext.datasets import TranslationDataset

import spacy
import numpy as np

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

In [None]:
manual_seed = 531
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


### Dataset

In all the questions of this lab, we continue to use the English-French bilingual corpus of [Multi30k](https://github.com/multi30k/dataset) dataset. Our task is to `translate text from French language to English language`. All your model should be trained on `train_eng_fre.tsv`, validated on `val_eng_fre.tsv`, and tested on `test_eng_fre.tsv`. 

## Exercise T2: Seq2Seq Tutorial Recap

#### This exercise isn't meant to be too challenging, but we do want you to be able to talk about the code/architecture with your team in case you need help navigating the code.

In the tutorial, we used a **uni-directional LSTM Encoder** to compress the information of a source language into two context representation vectors of a fixed length (i.e., the final hidden and cell states). Then, we use the final hidden and cell states to initialize the hidden and cell states of uni-directional LSTM Decoder. However, the capability of these representations can be limited. They can easily forget the earlier information of a long sequence and co-reference relationships. Before introducing the attention mechanism, we can also use some tricks to solve the issue of context **information bottleneck** problem. 

In this exercise, please write a script to implement the following tricks in **a single seq2seq model**:
1. Use a **bi-directional LSTM** as the **Encoder** to get the context representation of the source sentence. 

2. Instead of using the final hidden and cell states of the bi-directional encoder to initialize the uni-directional decoder, (A) for $h_0$ use **the mean of all hidden states** and (B) for $c_0$ use the **final cell state** of the bi-directional encoder. 

Combine 1 & 2 in **a single seq2seq model**.

**Instruction:**
- Please paste your experiment codes to answer the corresponding question below.
- You should train your model with the following hyper-parameters:
```
INPUT_DIM = 6004
OUTPUT_DIM = 6004
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
N_LAYERS = 1
BI_DIRECTION = True
ENC_HID_DIM = 512
DEC_HID_DIM = XXXXX (you should figure out this hyper-parameter). 
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.3
TEACH_FORCING_RATE = 0.5
LEARNING_RT = 0.001
MAX_EPOCH = 15
optimizer = optim.Adam(model.parameters(),lr=LEARNING_RT)
```
- You should use `init_weights()` function from this week tutorial to initialize model with normal distribution with `mean=0` and `std=0.01`. 
- Your seed of randomization should be 531 (i.e., manual_seed = 531). 
- You should use `nn.CrossEntropyLoss()` loss function and ignore `<pad>` tokens.
- You should save the model checkpoint at the end of each epoch. You also need to save your vocabularies.
- You should use a different checkpoint directory to avoid overwriting previous models. 
- Then, you load the best checkpoint and evaluate it on TEST set. 
- You should keep your vocabularies and best checkpoints. We will use them in Exercise 3 (Error analysis). 

**Hints:**
- Although the encoder is bi-directional LSTM, the decoder must be a uni-directional LSTM. 
- The last hidden state of the LSTM is `h_n` of shape (num_layers * num_directions, batch, hidden_size).
- The last cell state of the LSTM is `c_n` of shape (num_layers * num_directions, batch, hidden_size).
- All the hidden states from the last LSTM layer is `output` of shape (seq_len, batch, num_directions * hidden_size).
- The initialization states (i.e., $s_0$,$c_0$) of Decoder must match the dimension of Decoder. Namely, you should give a appropriate number of `DEC_HID_DIM` via analyzing the relation between tensor shapes. 
- You can use `print(XXX.shape)` to check the shape of your tensor. If the tensor shape doesn't match the desired shape of tensor, you should reshape it using `.view(), .squeeze(), .unsqueeze() or .permute() function.`.

**To facilitate your model evaluation, we provide a `inference()` function which calculates BLEU score based on a test corpus (test_eng_fre.tsv).**

In [None]:
def inference(model, file_name, src_vocab, trg_vocab, attention=False, max_trg_len = 64):
    '''
    Function for translation inference

    Input: 
    model: translation model;
    file_name: the directoy of test file that the first column is target reference, and the second column is source language;
    trg_vocab: Target torchtext Field
    attention: the model returns attention weights or not.
    max_trg_len: the maximal length of translation text (optinal), default = 64

    Output:
    Corpus BLEU score.
    '''
    from nltk.translate.bleu_score import corpus_bleu
    from nltk.translate.bleu_score import sentence_bleu
    from torchtext.data import TabularDataset
    from torchtext.data import Iterator

    # convert index to text string
    def convert_itos(convert_vocab, token_ids):
        list_string = []
        for i in token_ids:
            if i == convert_vocab.vocab.stoi['<eos>']:
                break
            else:
                token = convert_vocab.vocab.itos[i]
                list_string.append(token)
        return list_string

    test = TabularDataset(
      path=file_name, # the root directory where the data lies
      format='tsv',
      skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
      fields=[('TRG', trg_vocab), ('SRC', src_vocab)])

    test_iter = Iterator(
    dataset = test, # we pass in the datasets we want the iterator to draw data from
    sort = False,batch_size=128,
    sort_key=None,
    shuffle=False,
    sort_within_batch=False,
    device = device,
    train=False
    )
  
    model.eval()
    all_trg = []
    all_translated_trg = []

    TRG_PAD_IDX = trg_vocab.vocab.stoi[trg_vocab.pad_token]

    with torch.no_grad():
    
        for i, batch in enumerate(test_iter):

            src = batch.SRC
            #src = [src len, batch size]

            trg = batch.TRG
            #trg = [trg len, batch size]

            batch_size = trg.shape[1]

            # create a placeholder for traget language with shape of [max_trg_len, batch_size] where all the elements are the index of <pad>. Then send to device
            trg_placeholder = torch.Tensor(max_trg_len, batch_size)
            trg_placeholder.fill_(TRG_PAD_IDX)
            trg_placeholder = trg_placeholder.long().to(device)
            if attention == True:
                output,_ = model(src, trg_placeholder, 0) #turn off teacher forcing
            else:
                #original 
                #output,_ = model(src, trg_placeholder, 0) #turn off teacher forcing
                
                # update:
                output = model(src, trg_placeholder, 0) #turn off teacher forcing
            # get translation results, we ignor first token <sos> in both translation and target sentences. 
            # output_translate = [(trg len - 1), batch, output dim] output dim is size of target vocabulary.
            output_translate = output[1:]
            # store gold target sentences to a list 
            all_trg.append(trg[1:].cpu())

            # Choose top 1 word from decoder's output, we get the probability and index of the word
            prob, token_id = output_translate.data.topk(1)
            translation_token_id = token_id.squeeze(2).cpu()

            # store gold target sentences to a list 
            all_translated_trg.append(translation_token_id)
      
    all_gold_text = []
    all_translated_text = []
    for i in range(len(all_trg)): 
        cur_gold = all_trg[i]
        cur_translation = all_translated_trg[i]
        for j in range(cur_gold.shape[1]):
            gold_convered_strings = convert_itos(trg_vocab,cur_gold[:,j])
            trans_convered_strings = convert_itos(trg_vocab,cur_translation[:,j])

            all_gold_text.append(gold_convered_strings)
            all_translated_text.append(trans_convered_strings)

    corpus_all_gold_text = [[item] for item in all_gold_text]
    corpus_bleu_score = corpus_bleu(corpus_all_gold_text, all_translated_text)  
    return corpus_bleu_score

`inference()` function will take five variables `model, file_name, trg_vocab,attention, and max_trg_len` as inputs and return a corpus cumulative BLEU-4 score. Here is a use case.

In [None]:
# use case
print(inference(model_best, "./drive/My Drive/Colab Notebooks/eng-fre/test_eng_fre.tsv", SRC, TRG, True, 64))

### Please paste your code for the corresponding questions.

### T2.1 Please revise the following code to build the appropriate vocabularies:
rubric={accuracy:2}

In [1]:
# Your code goes here


### T2.2 You may need to revise `class Encoder`. Please show your code for `class Encoder`:
rubric={accuracy:2}

In [None]:
# Your code goes here

### T2.3  You may need to revise `class Decoder`. Please show your code for `class Decoder`:
rubric={accuracy:2}

In [None]:
# Your code goes here

### T2.4 You may need to revise `class Seq2Seq`. Please show your code for `class Seq2Seq`:
rubric={accuracy:2}

In [None]:
# Your code goes here

### T2.5 You may need to revise the code of instantiating classes. Please show your code for instantiation:
rubric={accuracy:2}

In [None]:
torch.manual_seed(manual_seed)

# Your code goes here


### T2.6 Please paste your full training log here. Which epoch is the best?
rubric={accuracy:2}
Your log should look like this:
```
Epoch: 01 | Time: 1m 25s
	Train Loss: 4.293 | Train PPL:  73.188
	 Val. Loss: 4.263 |  Val. PPL:  71.012 
   ............
```


**Your answer:**
**My best model is trained with XX epochs.**

```
Your log goes here
```

### T2.7 Please report the cumulative BLEU-4 score on test set (i.e., `test_eng_fre.tsv`) via corpus_bleu() function.
rubric={accuracy:2}

Hint: You can use `inference()` function. 

**Your answer goes here**

**My best model obtains XX.XX cumulative BLEU-4 score.**

## Exercise 1: Seq2Seq Review


### 1.1 Warm Up
rubric={reasoning:2}

As a quick warm up. Take a minute to review the code from the tutorial and identify hyper parameters related to the three sections of code: Encoder, Decoder, and Seq2Seq. You can just copy and paste these hyper parameters into the box below.   

*Note just because something is fed into the initialization function doesn't mean it's a hyperparameter (e.g. input dimension which depends on the problem), similarly there is at least one hyper-parameter which is relevant but not put in the initialization function (for better or worse).*

#### Paste the relevant hyper parameters in each section:
```
Encoder:

Decoder:

Seq2Seq:
```

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim,n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dropout = dropout
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, enc_hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.lstm(embedded)
       
        # outputs are always from the top hidden layer, if bidirectional outputs are concatenated.
        # outputs shape [sequence_length, batch_size, hidden_dim * num_directions]
        # hidden is of shape [num_layers * num_directions, batch_size, hidden_size]
        # cell is of shape [num_layers * num_directions, batch_size, hidden_size]
        
        return hidden, cell

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, dec_hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.output_dim = output_dim
        self.dec_hid_dim = dec_hid_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, dec_hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(dec_hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
             
        # input is of shape [batch_size]
        # hidden is of shape [n_layer * num_directions, batch_size, hidden_size]
        # cell is of shape [n_layer * num_directions, batch_size, hidden_size]
        
        input = input.unsqueeze(0)
        
        # input shape is [1, batch_size]. reshape is needed rnn expects a rank 3 tensors as input.
        # so reshaping to [1, batch_size] means a batch of batch_size each containing 1 index.
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]    
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        # output shape is [sequence_len, batch_size, hidden_dim * num_directions]
        # hidden shape is [num_layers * num_directions, batch_size, hidden_dim]
        # cell shape is [num_layers * num_directions, batch_size, hidden_dim]

        # sequence_len and num_directions will always be 1 in the decoder.
        # output shape is [1, batch_size, hidden_dim]
        # hidden shape is [num_layers, batch_size, hidden_dim]
        # cell shape is [num_layers, batch_size, hidden_dim]
        
        prediction = self.fc_out(hidden.squeeze(0)) # linear expects as rank 2 tensor as input
        # predicted shape is [batch_size, output_dim]
        
        return prediction, hidden, cell

In [None]:
class Seq2Seq(nn.Module):
    ''' This class contains the implementation of complete sequence to sequence network.
    It uses to encoder to produce the context vectors.
    It uses the decoder to produce the predicted target sentence.
    Args:
        encoder: A Encoder class instance.
        decoder: A Decoder class instance.
    '''
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src is of shape [sequence_len, batch_size]
        # trg is of shape [sequence_len, batch_size]
        # if teacher_forcing_ratio is 0.5 we use ground-truth inputs 50% of time and 50% time we use decoder outputs.

        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # to store the outputs of the decoder
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # context vector, last hidden and cell state of encoder to initialize the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> tokens
        input = trg[0, :]

        for t in range(1, max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            use_teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            input = (trg[t] if use_teacher_force else top1)

        # outputs is of shape [sequence_len, batch_size, output_dim]
        return outputs

## Exercise 2: Initialization


### 2.1 seq2seq weight initialization
rubric={accuracy:3}

In the tutorial, we used a Normal distribution for our weight initialization.

On the same translation task, compare how this initialization does with using a Uniform Distribution as well as initializing with weights of Zero.  Report the BLEU-4 score of each of the models based on training using these different initializations.

(We can load the French and English pickle files that you saved before to save some time!)

In [None]:
import pickle
#load your pickles
with open("./drive/My Drive/Colab Notebooks/ckpt/TRG.Field","rb")as f:
     TRG = pickle.read(f)

with open("./drive/My Drive/Colab Notebooks/ckpt/SRC.Field","rb")as f:
     SRC = pickle.read(f)

In [None]:
#feel free to change by commenting out and replacing parts
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)

In [None]:
#training procedure goes here as needed. (can use the tutorial as a guide)
#BE SURE TO USE THE SAME SEED EACH TIME YOU RUN!
manual_seed = 531
torch.manual_seed(manual_seed)


### Fill in
```
Normal Intialization BLEU-4:
Uniform Initialization BLEU-4:
Zero Initialization BLEU-4:
```

## 3 BLEU in the face

BLEU can be a little counter-intuitive, it's not the only metric used to evaluate translations (ROUGE, METEOR etc.), but it's by far the most common. Some issues with it is that it's not really a percentage (even though it's out of 100) and is more useful to think about in terms of relative comparison to other translations in the same domain (rather than % of a sentence translated "correctly"). The goal of this problem is to walk you through evaluating a couple different strings to see how differences in the translation impact the final BLEU score, so that you have a more intuitive sense of what the metric means.


#### Sidebar on BLEU
BLEU in some domains might be very low, but still potentially useful. State of the art translation results in, for instance, low-resource spoken language translation might be single digits to at most low 20s, what's important is that the score is compared to other scores in the same domain (under the same conditions).

Finally we'll just be using a single reference sentence for this example, in some cases you might have access to corpora with multiple references, this generally will lead to overall higher calculated BLEU score (but may not actually reflect an improvement in translation quality per se).

### Sentence BLEU calculation

For the following sentences follow the steps to compute the Sentence BLEU score by hand.  
```
Candidate Sentence 1: "The the the the the the the the the the the the the the the the" [Length: 16]

Candidate Sentence 2: "The north wind and sun are awakening, which is even stronger." [Length: 11]

Candidate Sentence 3: "Sun had a quarrel" [Length: 4]

Reference Sentence: "The North Wind and the Sun had a quarrel about which of them was the stronger." [Length: 16]
```

#### For all of the below problems fill in the appropriate section with the fraction or decimal that corresponds to the calculated (partial) BLEU score based on the different components that make up BLEU

### 3.1 Unigram Precision
rubric={accuracy:2}

Calculate the unigram precision (e.g. how each unigram appears in the reference sentence, divided by the length of the candidate sentence) for each of these sentences. (Hint: easy to just count words that are NOT in the reference) LEAVE YOUR ANSWER AS A FRACTION (e.g. 7/13)

```
Candidate 1:
Candidate 2:
Candidate 3:
```

### 3.2 Clipped Unigram Precision
rubric={accuracy:2}

Calculate the unigram precision (as before) but clip it so that unigrams can only be counted up to the max number of times they occur in the reference sentence. LEAVE YOUR ANSWER AS A FRACTION (e.g. 7/13)

```
Candidate 1:
Candidate 2:
Candidate 3:
```

### 3.3 Clipped n-gram Precision
rubric={accuracy:3}

As our clipped unigram precision before, but this time calculate the clipped 2,3,and 4-gram precisions of our sentences. (HINT: There are [length - (n-1)] n-grams in a sentence.)  

```
Candidate 1:
2gram:
3gram:
4gram:
Candidate 2:
2gram:
3gram:
4gram:
Candidate 3:
2gram:
3gram:
4gram:
```

### 3.4 Brevity Penalty 
rubric={accuracy:2}

BLEU gives a brevity penalty (BP) based on the length of the candidate sentence ($c$) compared to the reference sentence ($r$) as follows: for $c \geq r$ then $BP = 1$ for $c < r$ then $BP = exp(1-r/c)$

Compute the Brevity Penalty for our candidate sentences.  (Can leave as decimal)

```
Candidate 1:
Candidate 2:
Candidate 3:
```

### 3.5 Final BLEU 
rubric={accuracy:3}

The final BLEU score is then calculated as such: 
$BLEU = BP * exp(\sum_{n=1}^N w_n log p_n)$   where $w_n$ is the weight for each of the n-grams (in our case use .25 each) and $p_n$ is the clipped precision for each of the n-grams.

```
Candidate 1:
Candidate 2:
Candidate 3:
```

## Exercise 4: Conceptual Questions

### 4.1 Sequence Length
rubric{reasoning:1}

Seq2Seq models have flexibility in terms of the sequence length they can handle. Explain briefly. (2-3 sentences are enough)

**Response goes here**

### 4.2 Same language?
rubric{reasoning:1}

Seq2seq models aren't just limited to translation, consider the task of simplifying sentences to make them easier to read. You could do this by training a seq2seq model on a say a parallel corpus that incorporates English Wikipedia articles aligned to Simple English Wikipedia articles. List two other applications of applying a seq2seq model that could take input in the same language as its output. Make sure you explain each application in 1-2 sentences.  (Assume you could make/find the approrpriate aligned corpuses to make this feasible) 

**Response goes here**

### 4.3 Bidirectional?
rubric{reasoning:1}

Would bidirectional LSTMs/RNNs work to build an Encoder/Decoder model? Why/Why not?

**Response goes here**

### 4.4 Weights
rubric{reasoning:3}

There are several different ways to set the weights of different layers in a NN. We've seen Uniform, Normal distributions as well as setting them to some constant value. [Glorot & Bengio (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?source=post_page---------------------------) investigate how the initialization of these layers can greatly impact the performance of deep neural networks. Take a few minutes to *SKIM* the abstract and then *SKIM* the bullet points in the conclusion and write a few sentences summarizing takeaways that you can use in practice.


Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256).

**Response goes here**