## Homework 2 - Machine Translation - MDS Computational Linguistics

### Assignment Topics
- seq2seq Models
- Grid Search vs. Random Search for Hyperparameters Tuning


### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: March 7, 2020, 18:00:00 (Vancouver time)



In [None]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
from torchtext.datasets import TranslationDataset

import spacy
import numpy as np

In [None]:
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

## Exercise 1: Seq2Seq Review


### 1.1 Warm Up
rubric={reasoning:2}

As a quick warm up. Take a minute to review the code from the tutorial and identify hyper parameters related to the three sections of code: Encoder, Decoder, and Seq2Seq. You can just copy and paste these hyper parameters into the box below.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim,n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dropout = dropout
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, enc_hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.lstm(embedded)
       
        # outputs are always from the top hidden layer, if bidirectional outputs are concatenated.
        # outputs shape [sequence_length, batch_size, hidden_dim * num_directions]
        # hidden is of shape [num_layers * num_directions, batch_size, hidden_size]
        # cell is of shape [num_layers * num_directions, batch_size, hidden_size]
        
        return hidden, cell

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, dec_hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.output_dim = output_dim
        self.dec_hid_dim = dec_hid_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, dec_hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(dec_hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
             
        # input is of shape [batch_size]
        # hidden is of shape [n_layer * num_directions, batch_size, hidden_size]
        # cell is of shape [n_layer * num_directions, batch_size, hidden_size]
        
        input = input.unsqueeze(0)
        
        # input shape is [1, batch_size]. reshape is needed rnn expects a rank 3 tensors as input.
        # so reshaping to [1, batch_size] means a batch of batch_size each containing 1 index.
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]    
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        # output shape is [sequence_len, batch_size, hidden_dim * num_directions]
        # hidden shape is [num_layers * num_directions, batch_size, hidden_dim]
        # cell shape is [num_layers * num_directions, batch_size, hidden_dim]

        # sequence_len and num_directions will always be 1 in the decoder.
        # output shape is [1, batch_size, hidden_dim]
        # hidden shape is [num_layers, batch_size, hidden_dim]
        # cell shape is [num_layers, batch_size, hidden_dim]
        
        prediction = self.fc_out(hidden.squeeze(0)) # linear expects as rank 2 tensor as input
        # predicted shape is [batch_size, output_dim]
        
        return prediction, hidden, cell

In [None]:
class Seq2Seq(nn.Module):
    ''' This class contains the implementation of complete sequence to sequence network.
    It uses to encoder to produce the context vectors.
    It uses the decoder to produce the predicted target sentence.
    Args:
        encoder: A Encoder class instance.
        decoder: A Decoder class instance.
    '''
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src is of shape [sequence_len, batch_size]
        # trg is of shape [sequence_len, batch_size]
        # if teacher_forcing_ratio is 0.5 we use ground-truth inputs 50% of time and 50% time we use decoder outputs.

        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # to store the outputs of the decoder
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # context vector, last hidden and cell state of encoder to initialize the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> tokens
        input = trg[0, :]

        for t in range(1, max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            use_teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            input = (trg[t] if use_teacher_force else top1)

        # outputs is of shape [sequence_len, batch_size, output_dim]
        return outputs

## Exercise 2: Initialization


### 2.1 seq2seq weight initialization
rubric={accuracy:7}

In the tutorial, we used a Normal distribution for our weight initialization.

On the same translation task, compare how this initialization does with using a Uniform Distribution as well as initializing with weights of Zero.  Report the BLEU-4 score of each of the models based on training using these different initializations.

(We can load the French and English pickle files that you saved before to save some time!)

In [None]:
import pickle
#load your pickles
with open("./drive/My Drive/Colab Notebooks/ckpt/TRG.Field","rb")as f:
     TRG = pickle.load(f)

with open("./drive/My Drive/Colab Notebooks/ckpt/SRC.Field","rb")as f:
     SRC = pickle.load(f)

In [None]:
#feel free to change by commenting out and replacing parts
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)

In [None]:
#training procedure goes here as needed. (can use the tutorial as a guide)
#BE SURE TO USE THE SAME SEED EACH TIME YOU RUN!
manual_seed = 77
torch.manual_seed(manual_seed)
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

## Exercise 3: Hyper Parameters

### 3.1 Hyper parameter exploration
rubric={accuracy:7}

Since you should have built a network for 2.1 test the network using the following hyper parameters configurations and report BLEU-4 for each configuration.

## Exercise 4: Conceptual Questions

### 4.1 Sequence Length
rubric={reasoning:1}

Seq2Seq models have flexibility in terms of the sequence length they can handle. Explain briefly. (2-3 sentences are enough)

**Response goes here**

### 4.2 Same language?
rubric={reasoning:1}

Seq2seq models aren't just limited to translation, consider the task of simplifying sentences to make them easier to read. You could do this by training a seq2seq model on a say a parallel corpus that incorporates English Wikipedia articles aligned to Simple English Wikipedia articles. List two other applications of applying a seq2seq model that could take input in the same language as its output. Make sure you explain each application in 1-2 sentences.  (Assume you could make/find the approrpriate aligned corpuses to make this feasible)

**Response goes here**

### 4.3 Bidirectional?
rubric={reasoning:1}

Would bidirectional LSTMs/RNNs work to build an Encoder/Decoder model? Why/Why not?

**Response goes here**

### 4.4 Weights
rubric={reasoning:3}

There are several different ways to set the weights of different layers in a NN. We've seen Uniform, Normal distributions as well as setting them to some constant value. [Glorot & Bengio (2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?source=post_page---------------------------) investigate how the initialization of these layers can greatly impact the performance of deep neural networks. Take a few minutes to *SKIM* the abstract and then *SKIM* the bullet points in the conclusion and write a few sentences summarizing takeaways that you can use in practice.


Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256).

**Response goes here**