# TP3: using pre-trained embeddings and tuning a NN

In this practical session, we will tune a neural network model, thus modifying the values of the hyper-parameters. We will also explore changes in the architecture. 

The dataset comes from IMDb (Internet Movie Database). The task is binary genre classification from movie description. Here we use the training data, that we will split into a training set, a validation set and a test set.

Our base model will be a FFNN or a RNN with **continuous representations intialized with pre-trained word embeddings**. The embeddings used are GloVe with 50 dimensions, built using Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). **All tokens are in lowercase**.

Upload the files: 
* train_....txt
* glove.6B.50d.txt.gz (it can take several minutes).

You are required to:
- read carefully all the instructions
- add code when asked
- answer questions and add comments in a form of a short report about the experiments 

There is a total of 15 questions with the id Q1, .., Q15. 

You can send either directly a notebook containing the code + report, or the code in a notebook and a separate file with the report. Put your files on Modle (Controle continu - Groupe 1 ou 2). The filename must contain your name.

Due date: 
- Group 2: 11/03 

In [None]:
import numpy as np
import random, io
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report

torch.manual_seed(0) # For reproducibility: https://pytorch.org/docs/stable/notes/randomness.html

In [None]:
# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print(device)

# PART 1: read the data


## 1.1 Dataset

The code below allows to read and load the dataset for genre classification.

In [None]:
train_path = "train_drama-comedy_group2.txt"

label_mapping = {'drama':0, 'comedy':1}
labels = set()
fin = io.open(train_path, 'r', encoding='utf-8', newline='\n', errors='ignore')
data_iter = []

for line in fin:
  id, title, label, text = line.split(':::')
  label = label.strip()
  labels.add(label)
  # lower case, because GloVe contains lower cased words
  data_iter.append( tuple([label, text.lower().strip() ]) )

print("Labels:", labels)
print("Total number of examples:",len(data_iter))

# List of examples (label, text)
train = data_iter[:20709]
dev = data_iter[20709:23709] 
test =  data_iter[23709:] 

print("Train:", len(train), "Valid:", len(dev), "Test:", len(test))


Now, we need to tokenize our data, and build the corresponding vocabulary (on the train set).

#### Q1: **Print the size of the vocabulary.**

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# splits the string sentence by space.
tokenizer = get_tokenizer( None ) 

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# Build vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: label_mapping[x]

## Print the size of the vocabulary
# ..

## 1.2 Loading the word embeddings

We will use the GloVe pre-trained word emeddings, with 50 dimensions, as contained in the file glove.6B.50d.txt.gz.

#### Q2: **Write a function that loads the vectors** 
This function allows to build a dictionary mapping a word to its vector, as defined in the GloVe file. 


In [None]:
import io, gzip

## Write a function that loads the vectors
# and build a dictionnary mapping a token to its vector
# ...
#def load_vectors(fname):


embed_file='glove.6B.50d.txt.gz'
#vectors = load_vectors( embed_file )

#### Q3: **Print the following information about the embeddings:** 

* print the vector for the token 'the'
* print the vocabulary of the GloVe embeddings
* print the size/dimension of the embeddings
* compare the vocabulary of the embeddings with the vocabulary built on training set: 
  * How many words in your data do not appear in the embeddings vocabulary?
  * Do you think it could be an issue? why?
  * Why do we have all these unknown words?

In [None]:
# Print information about the embeddings
# ...

#### Q4: **Propose a solution to reduce the number of unknown words.**



## 1.3 Building batches (code given)

The function below can be used to build batches of examples based on offsets, as used in e.g. EmbeddingBag.

In [None]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label.to(device), text_list.to(device), offsets.to(device)

# PART 2: Feed Forward Neural Network with pretrained embeddings

Now we need to define our learning model. 

First, we're going to build a matrix containing the embedding vectors of the words in our vocabulary (i.e. the words present in the training set). Then, we will use this matrix to initialize our embedding layer.

## 2.1 Build the weights matrix

We want to build the *weights matrix* that will be used to initialize our embedding layer. 
In this matrix, we associate each word in our (training) data to the vector retrieved from the GloVe file.

#### Q5: **Build a matrix associating each word to its vector:** 
  * For each word in the dataset’s vocabulary, we check if it is in GloVe’s vocabulary:
    * if yes: load its pre-trained word vector. 
    * else: initialize a random vector.
    
At the end, print the matrix size. 

In [None]:
emb_dim = 50
matrix_len = len(vocab)
weights_matrix = np.zeros((matrix_len, emb_dim))

# Build the weights matrix
# ...

# At the end, cast the weight matrix to float32
weights_matrix = weights_matrix.to(torch.float32)

## 2.2 Model definition

Now we can define our model with an embedding layer that takes pretrained embeddings.

The code is very similar to what we had previously, except that we need to initialize the embedding layer using the weight matrix.

#### Q6: **Define the embedding layer using the weight matrix.**

You need to build embedding bags using pretrained embeddings. Let the embeddings to be trainable (i.e. not freezed). Look at the documentation for the *nn.EmbeddingBag*: https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

#### Q7: **Add the other parts of the code:** 
  * *__init__()* function: add the first linear layer, the activation function, and the linear output layer (note that the 'input_dim' has been replaced with 'embed_dim': the input of our model is embeddings bags).
  * *forward()* function: get the input through the embedding layer, the hidden layer (i.e. linear function + activation function) and the output layer.

In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self, embed_dim, hidden_dim, output_dim, weights_matrix):
        super(FeedforwardNeuralNetModel2, self).__init__()
        ## Create the embedding layer from the weight matrix
        # ...

        # Linear function
        # ...

        # Non-linearity
        # ...

        # Linear function (readout)
        # ... 

    def forward(self, text, offsets):
        # Embedding layer
        # ...

        # Linear function  
        # ...

        # Non-linearity  
        # ...

        # Linear function (readout) 
        # ...
        
        return out

## 2.3 Train and evaluate (code given)

The functions below can be used to train and evaluate your model (they are designed to work when dealing with batches and EmbeddingBag, i.e. they use the 'offsets' associated with the batches in the input).

In [None]:
def train_woffset( model, train_loader, optimizer, num_epochs=5 ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for label, input, offsets in train_loader:
            input = input.to(device)
            label = label.to(device)
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()
            # Step 2. Forward pass to get output/logits
            outputs = model( input, offsets )
            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()
            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)
        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/total_count, total_acc/total_count))
        total_acc, total_count = 0, 0
        train_loss = 0

def evaluate_woffset( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for label, input, offsets in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input, offsets)
            predictions.extend( torch.argmax(probs, dim=1).cpu().numpy() ) # <-----
            gold.extend([int(l) for l in label])
    print(classification_report(gold, predictions))
    return gold, predictions

#### Q8: **Run an experiment**

Now we are ready to run a first experiment using our model.

▶▶ **Run a first experiment with:**
* Learning rate: 0.01
* Batch size: 64
* Hidden dimension: 16
* Epochs: 5
* Optimizer: SGD

Note that:
* Loss is still the Cross Entropy Loss
* Evaluation is done on the development set

In [None]:
# Run a first experiment with the setting described above

# Hyperparameters
# ...
EPOCHS = 5 # epoch
LR = 0.01  # learning rate
BATCH_SIZE = 64 # batch size for training
hidden_dim = 16 # size of the hidden layer

output_dim = 2
emb_dim = 50

# Load the data
train_loader = DataLoader(train, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
dev_loader = DataLoader(dev, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

# Initialize the model
# ...

# Define the loss and optimization method to be used
# ...

# Train the model
# ...

# Evaluate on dev
# ...

# PART 3: Tune the model 

The model comes with a variety of hyper-parameters. To find the best model, we need to test different values for these free parameters.




#### Q9: **Test different values of the hyper-parameters:**

We want to see what could be the best performance of the defined model, without modifying the input. Use the development set to tune your model, that is compare the performance when modifying the following elements. At each step, you can keep the best value found at the previous step (or test a few combinations).

* a) Optimizer: try at least Adam, Adagrad and RMSProp (in addition to SGD)
* b) Batch size: test a few values, e.g. 1 (= no batch), 256, 2048 (in addition to 64, already tested above)
* c) Max number of epochs: test with 20, 50, 100 epochs. Do you think 100 epochs are required? Can you decide on a 'reasonable' (trade-off between speed and performance) number of epochs?
* d) Learning rate: test a few values e.g. 0.00001, 1 (in addition to 0.01). What do you observe?
* e) Size of the hidden layer: test a few values, e.g. 4, 128, 1024 (in addition to 16)

 
Propose a final best set of parameters, based on your experiments, and present final results on the test set.

Try to propose a small report using tables and/or plots that shows how performance change with respect to the different hyper-parameters. 

#### Q10: **Modify the architecture of the FFNN:**

Now modify your model definition to test with:
* a) A different activation function: test at least tanh and ReLu (in addition to sigmoid)
* b) One additional hidden layer

Don't try to change again the hyper-parameters, give the results with the best set previously obtained and any value for the second hidden dimension.

# PART 4: RNNs

Now we want to test a RNN architecture to perform our classification task. In the following exercises, we **will not use the pre-trained word embeddings** but randomly initialized continuous vectors with 300  dimensions.

Note that here, the embedding layer transforms our words into continuous vectors that are the inputs of our RNN. The RNN builds the document representation and is thus a replacement of the 'embedding bag'. 

## 4.1 Using Batches (code given)

When using RNNs, we can't use the offset trick to build batches: the problem is that all the documents in a batch need to have the same length, because the size of the input defines the size of the network (i.e. each xi is associated with a state si). 

The solution is called **padding**: we add zeros at the end of the sequences that are shorter than the max length. 

The easiest solution to do so is to pad the sequences using *torch.nn.utils.rnn.pad_sequence* as done below within the *collate_batch_pad* function. This function returns a tensor of padded sequences, that can be directly used as input of our model.

https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate_batch_pad(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
    label = torch.tensor(label_list, dtype=torch.int64)
    # Instead of concatenating, we use padding
    text_list = pad_sequence(text_list, padding_value=0) # <-------
    return label.to(device), text_list.to(device)

## 4.2 LSTM

First, we'll test an LSTM. In PyTorch, defining an LSTM is done via the addition of an LSTM layer in the architecture.




#### Q11: **Add an embedding layer.**

Note that here we're not using *embedding bags*, but just an embedding layer that maps our word to vectors. The representation of the document is then obtained via the RNN. 

See: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html


#### Q12: **Add the LSTM layer.** 

An LSTM layer will transform our input into a vector representation with the size hidden_dim.
Add the LSTM layer and the output layer (we don't put additional hidden layer here).

See: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html





#### Forward (code given)

Note that in the forward pass, we need to reshape the data using:
```
x = x.view(len(x), 1, -1)
```

We need to reshape our input data before passing it to the LSTM layer, because it takes a 3D tensor with (Sequence lenght, Batch size, Input size). This is done with the 'view' method, the pytorch 'reshape' function for tensors.

Read the code carefuly to be sure you understand how we build the representation using the LSTM, that is using its last hidden state.

```
out, (ht, ct) = self.lstm( x )
y = self.fc2(ht[-1])
```


In [None]:
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
        super(LSTMModel, self).__init__()

        self.batch_size = batch_size

        # Embedding layer
        # ...

        # LSTM layer
        # ...

        # Linear fct
        # ...

        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text):
        embeds = self.embedding(text)
        x = embeds.view(len(text), self.batch_size, -1)
        out, (ht, ct) = self.lstm( x )
        y = self.fc2(ht[-1])
        return y


# These fct are modified to ignore offsets
def train_lstm( model, train_loader, optimizer, num_epochs=5 ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for label, input in train_loader:
            input = input.to(device)
            label = label.to(device)
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()
            # Step 2. Forward pass to get output/logits
            outputs = model( input )
            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()
            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)
        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/total_count, total_acc/total_count))
        total_acc, total_count = 0, 0
        train_loss = 0

def evaluate( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for label, input in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input)
            predictions.extend( torch.argmax(probs, dim=1).cpu().numpy() ) # <-----
            gold.extend([int(l) for l in label])
    print(classification_report(gold, predictions))
    return gold, predictions

#### Q13: **Run an experiment using the following hyper-parameters**

We can now test our LSTM on our dataset. 

* epochs = 5 
* learning rate =  0.001
* size of the hidden layer = 32
* batch size = 128 
* embbeding dimension = 300
* optimizer: Adam

In [None]:
# Hyperparameters
EPOCHS = 5 # epoch
LR = 0.001  # learning rate
hidden_dim = 32 # size of the hidden layer
batch_size = 128 
emb_dim = 300

output_dim = 2
vocab_size = len(vocab)

# Load data
train_loader = DataLoader(train, shuffle=False, batch_size=batch_size, 
                          collate_fn=collate_batch_pad, drop_last=True)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_size, 
                        collate_fn=collate_batch_pad, drop_last=True)

# Initialize the model
# ...


# Train the model
# ...

# Evaluate on dev
# ...

## 4.3 GRU and bi-GRU

We now want to try another RNN architecture called GRU: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html



#### Q14: **Modify your code to implement a GRU instead of an LSTM, and run an experiment (same setting).** 

Here you also need to modify the forward function.

In [None]:
class GRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
        super(GRUModel, self).__init__()

        # Embedding layer
        # ...

        # GRU layer
        # self.gru = 

        # Linear fct
        # ...  

    def forward(self, text):
        embeds = self.embedding(text)
        x = embeds.view(len(text), self.batch_size, -1)
        # Compute y
        # ...

        return y

In [None]:
# Hyperparameters
EPOCHS = 5 # epoch
LR = 0.001  # learning rate
hidden_dim = 32 # size of the hidden layer
batch_size = 128 
emb_dim = 300

output_dim = 2
vocab_size = len(vocab)

# Load data
train_loader = DataLoader(train, shuffle=False, batch_size=batch_size, 
                          collate_fn=collate_batch_pad, drop_last=True)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_size, 
                        collate_fn=collate_batch_pad, drop_last=True)

# Initialize the model
# ...

# Train the model
# ...

# Evaluate on dev
# ...

#### Q15: **Modify your code to implement a bi-GRU, i.e. a bi-directional GRU and run an experiment (same setting).** 

Which architecture gave the best results? Looking at the performance on training and development set, what could you say?

Hint: be careful, what is the size of the output of a bi-RNN?

In [None]:
class BiGRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, batch_size):
        super(GRUModel, self).__init__()

        self.batch_size = batch_size
        
        # Embedding layer
        # ...

        # biGRU layer
        # ...

        # linear layer
        # ...

    def forward(self, text):
        embeds = self.embedding(text)
        x = embeds.view(len(text), self.batch_size, -1)
        # Compute y
        # ...
        
        return y

In [None]:
# Hyperparameters
EPOCHS = 5 # epoch
LR = 0.001  # learning rate
hidden_dim = 32 # size of the hidden layer
batch_size = 128
emb_dim = 300

output_dim = 2
vocab_size = len(vocab)

# Load data
train_loader = DataLoader(train, shuffle=False, batch_size=batch_size, 
                          collate_fn=collate_batch_pad, drop_last=True)
dev_loader = DataLoader(dev, shuffle=False, batch_size=batch_size, 
                        collate_fn=collate_batch_pad, drop_last=True)

# Initialize the model
# ...

# Train the model
# ...

# Evaluate on dev
# ...