# Assignment 2

This notebook is intended to produce the plots and figures for the report on Problem 1 of the practical. You should not run this notebook in Google Colab until you have finished constructing the correct solutions for transformer_solution.py and encoder_decoder_solution.py

This notebook provides some limited commentary on several HuggingFace Features and toolage. You will use HuggingFace Datasets to load the Amazon Polarity dataset for sentiment analysis. The notebook will define a Bert tokenizer, collate functions, and then train and evaluate several models using the HuggingFace utilities mentioned above. Remember, the most crucial part here is running the experiments for the report.

In [1]:
#@title Mount your Google Drive
# If you run this notebook locally or on a cluster (i.e. not on Google Colab)
# you can delete this cell which is specific to Google Colab. You may also
# change the paths for data/logs in Arguments below.
% matplotlib inline
% load_ext autoreload
% autoreload 2

# from google.colab import drive
# drive.mount('/content/gdrive')

UsageError: Line magic function `%` not found.


In [None]:
#@title Link your assignment folder & install requirements
#@markdown Enter the path to the assignment folder in your Google Drive
# If you run this notebook locally or on a cluster (i.e. not on Google Colab)
# you can delete this cell which is specific to Google Colab. 
import sys
import os
import shutil
import warnings

folder = ""  #@param {type:"string"}
!ln -Ts "$folder" / content / assignment 2 > / dev / null

# Add the assignment folder to Python path
# if '/content/assignment' not in sys.path:
#   sys.path.insert(0, '/content/assignment')

# Check if CUDA is available
import torch

if not torch.cuda.is_available():
    warnings.warn('CUDA is not available.')

### Running on GPU
For this assignment, it will be necessary to run your experiments on GPU. To make sure the notebook is running on GPU, you can change the notebook settings with
* (EN) `Edit > Notebook Settings`
* (FR) `Modifier > Paramètres du notebook`


In [None]:
% matplotlib inline

import os
import numpy
import numpy as np
import matplotlib.pyplot as plt
import urllib.request
from sklearn.metrics import f1_score
import time
import json

from typing import List, Dict, Union, Optional, Tuple
import torch

from dataclasses import dataclass
from torch.utils.data import DataLoader
import torch.optim as optim
from tqdm.auto import tqdm

# !pip install -qqq datasets transformers --upgrade
from datasets import Dataset
import transformers

from datasets import load_dataset
from tokenizers import Tokenizer

from transformer_solution import Transformer
from encoder_decoder_solution import EncoderDecoder
from transformers import AutoTokenizer

from transformers import AutoModel
from encoder_decoder_solution import EncoderDecoder
from transformer_solution import Transformer
import torch.nn as nn

torch.random.manual_seed(0)

In [None]:
dataset_train = load_dataset("amazon_polarity", split="train[:10000]", cache_dir="assignment/data")
dataset_test = load_dataset("amazon_polarity", split="test[:1000]", cache_dir="assignment/data")

In [10]:
#@title 🔍 Quick look at the data { run: "auto" }
#@markdown Lets have quick look at a few samples in our test set.
n_samples_to_see = 3  #@param {type: "integer"}
for i in range(n_samples_to_see):
    print("-" * 30)
    print("title:", dataset_test[i]["title"])
    print("content:", dataset_test[i]["content"])
    print("label:", dataset_test[i]["label"])

------------------------------
title: Great CD
content: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"
label: 1
------------------------------
title: One of the best game music soundtracks - for a game I didn't really play
content: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. 

### 1️⃣ Tokenize the `text`
Tokenize the `text`portion of each sample (i.e. parsing the text to smaller chunks). Tokenization can happen in many ways; traditionally, this was done based on the white spaces. With transformer-based models, tokenization is performed based on the frequency of occurrence of "chunk of text". This frequency can be learned in many different ways. However the most common one is the [**wordpiece**](https://arxiv.org/pdf/1609.08144v2.pdf) model. 
> The wordpiece model is generated using a data-driven approach to maximize the language-model likelihood
of the training data, given an evolving word definition. Given a training corpus and a number of desired
tokens $D$, the optimization problem is to select $D$ wordpieces such that the resulting corpus is minimal in the
number of wordpieces when segmented according to the chosen wordpiece model.

Under this model:
1. Not all things can be converted to tokens depending on the model. For example, most models have been pretrained without any knowledge of emojis. So their token will be `[UNK]`, which stands for unknown.
2. Some words will be mapped to multiple tokens!
3. Depending on the kind of model, your tokens may or may not respect capitalization

In [11]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
#@title 🔍 Quick look at tokenization { run: "auto", vertical-output: true }
input_sample = "Welcome to IFT6135. We now teach you 🤗(HUGGING FACE) Library :DDD."  #@param {type: "string"}
tokenizer.tokenize(input_sample)

['welcome',
 'to',
 'if',
 '##t',
 '##6',
 '##13',
 '##5',
 '.',
 'we',
 'now',
 'teach',
 'you',
 '[UNK]',
 '(',
 'hugging',
 'face',
 ')',
 'library',
 ':',
 'dd',
 '##d',
 '.']

### 2️⃣ Encoding
Once we have tokenized the text, we then need to convert these chuncks to numbers so we can feed them to our model. This conversion is basically a look-up in a dictionary **from `str` $\to$ `int`**. The tokenizer object can also perform this work. While it does so it will also add the *special* tokens needed by the model to the encodings. 

In [13]:
#@title 🔍 Quick look at token encoding { run: "auto"}
input_sample = "Welcome to IFT6135. We now teach you 🤗(HUGGING FACE) Library :DDD."  #@param {type: "string"}

print("--> Token Encodings:\n", tokenizer.encode(input_sample))
print("-." * 15)
print("--> Token Encodings Decoded:\n", tokenizer.decode(tokenizer.encode(input_sample)))

--> Token Encodings:
 [101, 6160, 2000, 2065, 2102, 2575, 17134, 2629, 1012, 2057, 2085, 6570, 2017, 100, 1006, 17662, 2227, 1007, 3075, 1024, 20315, 2094, 1012, 102]
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
--> Token Encodings Decoded:
 [CLS] welcome to ift6135. we now teach you [UNK] ( hugging face ) library : ddd. [SEP]


### 3️⃣ Truncate/Pad samples
Since all the sample in the batch will not have the same sequence length, we would need to truncate the longer sequences (i.e. the ones that exeed a predefined maximum length) and pad the shorter ones so we that we can equal length for all the samples in the batch. Once this is achieved, we would need to convert the result to `torch.Tensor`s and return. These tensors will then be retrieved from the [dataloader](https://https//pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

In [14]:
class Collate:
    def __init__(self, tokenizer: str, max_len: int) -> None:
        self.tokenizer_name = tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
        self.max_len = max_len

    def __call__(self, batch: List[Dict[str, Union[str, int]]]) -> Dict[str, torch.Tensor]:
        texts = list(map(lambda batch_instance: batch_instance["title"], batch))
        tokenized_inputs = self.tokenizer(
            texts,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt",
            return_token_type_ids=False,
        )

        labels = list(map(lambda batch_instance: int(batch_instance["label"]), batch))
        labels = torch.LongTensor(labels)
        return dict(tokenized_inputs, **{"labels": labels})

In [15]:
#@title 🧑‍🍳 Setting up the collate function { run: "auto" }
tokenizer_name = "bert-base-uncased"  #@param {type: "string"}
sample_max_length = 256  #@param {type:"slider", min:32, max:512, step:1}
collate = Collate(tokenizer=tokenizer_name, max_len=sample_max_length)

In [16]:
class ReviewClassifier(nn.Module):
    def __init__(self, backbone: str, backbone_hidden_size: int, nb_classes: int):
        super(ReviewClassifier, self).__init__()
        self.backbone = backbone
        self.backbone_hidden_size = backbone_hidden_size
        self.nb_classes = nb_classes
        self.back_bone = AutoModel.from_pretrained(
            self.backbone,
            output_attentions=False,
            output_hidden_states=False,
        )
        self.classifier = torch.nn.Linear(self.backbone_hidden_size, self.nb_classes)

    def forward(
            self, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: Optional[torch.Tensor] = None
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        back_bone_output = self.back_bone(input_ids, attention_mask=attention_mask)
        hidden_states = back_bone_output[0]
        pooled_output = hidden_states[:, 0]  # getting the [CLS] token
        logits = self.classifier(pooled_output)
        if labels is not None:
            loss_fn = torch.nn.CrossEntropyLoss()
            loss = loss_fn(
                logits.view(-1, self.nb_classes),
                labels.view(-1),
            )
            return loss, logits
        return logits


class ReviewClassifierLSTM(nn.Module):
    def __init__(self, nb_classes: int, encoder_only: bool = False, dropout=0.5):
        super(ReviewClassifierLSTM, self).__init__()
        self.nb_classes = nb_classes
        self.encoder_only = encoder_only
        self.back_bone = EncoderDecoder(dropout=dropout, encoder_only=encoder_only)
        self.classifier = torch.nn.Linear(256, self.nb_classes)

    def forward(
            self, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: Optional[torch.Tensor] = None
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        hidden_states, _ = self.back_bone(input_ids, attention_mask)
        pooled_output = hidden_states
        logits = self.classifier(pooled_output)
        if labels is not None:
            loss_fn = torch.nn.CrossEntropyLoss()
            loss = loss_fn(
                logits.view(-1, self.nb_classes),
                labels.view(-1),
            )
            return loss, logits
        return logits


class ReviewClassifierTransformer(nn.Module):
    def __init__(self, nb_classes: int, num_heads: int = 4, num_layers: int = 4, block: str = "prenorm", dropout: float = 0.3):
        super(ReviewClassifierTransformer, self).__init__()
        self.nb_classes = nb_classes
        self.back_bone = Transformer(num_heads=num_heads, num_layers=num_layers, block=block, dropout=dropout)
        self.classifier = torch.nn.Linear(256, self.nb_classes)

    def forward(
            self, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: Optional[torch.Tensor] = None
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        back_bone_output = self.back_bone(input_ids, attention_mask)
        hidden_states = back_bone_output
        pooled_output = hidden_states
        logits = self.classifier(pooled_output)
        if labels is not None:
            loss_fn = torch.nn.CrossEntropyLoss()
            loss = loss_fn(
                logits.view(-1, self.nb_classes),
                labels.view(-1),
            )
            return loss, logits
        return logits

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"--> Device selected: {device}")


def train_one_epoch(
        model: torch.nn.Module, training_data_loader: DataLoader, optimizer: torch.optim.Optimizer, logging_frequency: int, testing_data_loader: DataLoader, logger: dict):
    model.train()
    optimizer.zero_grad()
    epoch_loss = 0
    logging_loss = 0
    start_time = time.time()
    mini_start_time = time.time()
    for step, batch in enumerate(training_data_loader):
        batch = {key: value.to(device) for key, value in batch.items()}
        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        logging_loss += loss.item()

        if (step + 1) % logging_frequency == 0:
            freq_time = time.time() - mini_start_time
            logger['train_time'].append(freq_time + logger['train_time'][-1])
            logger['train_losses'].append(logging_loss / logging_frequency)
            print(f"Training loss @ step {step + 1}: {logging_loss / logging_frequency}")
            eval_acc, eval_f1, eval_loss, eval_time = evaluate(model, testing_data_loader)
            logger['eval_accs'].append(eval_acc)
            logger['eval_f1s'].append(eval_f1)
            logger['eval_losses'].append(eval_loss)
            logger['eval_time'].append(eval_time + logger['eval_time'][-1])

            logging_loss = 0
            mini_start_time = time.time()

    return epoch_loss / len(training_data_loader), time.time() - start_time


def evaluate(model: torch.nn.Module, test_data_loader: DataLoader):
    model.eval()
    model.to(device)
    eval_loss = 0
    correct_predictions = {i: 0 for i in range(2)}
    total_predictions = {i: 0 for i in range(2)}
    preds = []
    targets = []
    start_time = time.time()
    with torch.no_grad():
        for step, batch in enumerate(test_data_loader):
            batch = {key: value.to(device) for key, value in batch.items()}
            outputs = model(**batch)
            loss = outputs[0]
            eval_loss += loss.item()

            predictions = np.argmax(outputs[1].detach().cpu().numpy(), axis=1)
            preds.extend(predictions.tolist())
            targets.extend(batch["labels"].cpu().numpy().tolist())

            for target, prediction in zip(batch["labels"].cpu().numpy(), predictions):
                if target == prediction:
                    correct_predictions[target] += 1
                total_predictions[target] += 1
    accuracy = (100.0 * sum(correct_predictions.values())) / sum(total_predictions.values())
    f1 = f1_score(targets, preds)
    model.train()
    return accuracy, round(f1, 4), eval_loss / len(test_data_loader), time.time() - start_time


def save_logs(dictionary, log_dir, exp_id):
    log_dir = os.path.join(log_dir, exp_id)
    os.makedirs(log_dir, exist_ok=True)
    # Log arguments
    with open(os.path.join(log_dir, "args.json"), "w") as f:
        json.dump(dictionary, f, indent=2)

In [19]:
nb_epoch = 100
batch_size = 512
logging_frequency = 5
learning_rate = 1e-5

train_loader = DataLoader(dataset_train, batch_size=batch_size, shuffle=True, collate_fn=collate)
test_loader = DataLoader(dataset_test, batch_size=batch_size, shuffle=False, collate_fn=collate)
for i in range(1, 7):
    experimental_setting = i
    # 4 experimental settings

    if experimental_setting == 1:
        model = ReviewClassifierLSTM(nb_classes=2, dropout=0.3, encoder_only=True)
    if experimental_setting == 2:
        model = ReviewClassifierLSTM(nb_classes=2, dropout=0.3, encoder_only=False)
    if experimental_setting == 3:
        model = ReviewClassifierTransformer(nb_classes=2, num_heads=4, num_layers=2, block='prenorm', dropout=0.3)
    if experimental_setting == 4:
        model = ReviewClassifierTransformer(nb_classes=2, num_heads=4, num_layers=4, block='prenorm', dropout=0.3)
    if experimental_setting == 5:
        model = ReviewClassifierTransformer(nb_classes=2, num_heads=4, num_layers=2, block='postnorm', dropout=0.3)
    if experimental_setting == 6:
        model = ReviewClassifier(backbone="bert-base-uncased", backbone_hidden_size=768, nb_classes=2)
        for parameter in model.back_bone.parameters():
            parameter.requires_grad = False
        logging_frequency = 703

    # setting up the optimizer
    optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=learning_rate, eps=1e-8)
    model.to(device)
    logger = dict()
    logger['train_time'] = [0]
    logger['eval_time'] = [0]
    logger['train_losses'] = []
    logger['eval_accs'] = []
    logger['eval_f1s'] = []
    logger['eval_losses'] = []

    logger['parameters'] = sum([p.numel() for p in model.back_bone.parameters() if p.requires_grad])
    nb_epoch = 1

    train_loss, train_time = train_one_epoch(model, train_loader, optimizer, logging_frequency, test_loader, logger)
    eval_acc, eval_f1, eval_loss, eval_time = evaluate(model, test_loader)
    logger["total_train_loss"] = train_loss
    logger["total_train_time"] = train_time
    logger["final_eval_loss"] = eval_loss
    logger["final_eval_time"] = eval_time
    logger["final_eval_acc"] = eval_acc
    logger["final_eval_f1"] = eval_f1
    logger['train_time'] = logger['train_time'][1:]
    logger['eval_time'] = logger['eval_time'][1:]

    print(f"    Epoch: {1} Loss/Test: {eval_loss}, Loss/Train: {train_loss}, Acc/Test: {eval_acc}, F1/Test: {eval_f1}, Train Time: {train_time}, Eval Time: {eval_time}")
    save_logs(logger, "assignment/log", str(experimental_setting))


--> Device selected: cuda


NotImplementedError: Module [ModuleList] is missing the required "forward" function

In [2]:
import os
import numpy
import matplotlib
matplotlib.use("tkagg")
import matplotlib.pyplot as plt
import urllib.request
from sklearn.metrics import f1_score
import time
import json
import csv

from typing import List, Dict, Union, Optional, Tuple
import torch


In [3]:
def load_logs(log_dir, exp_id):
    log_dir = os.path.join(log_dir, exp_id)
    with open(os.path.join(log_dir, "args.json"), "r") as f:
        dictionary = json.load(f)
    return dictionary

In [4]:
train_losses, eval_losses, eval_accs = [], [], []
total_train_time, final_eval_time = [], []
for i in range(1, 7):
    experimental_setting = i
    dictionary = load_logs("assignment/log", str(experimental_setting))
    train_losses.append(dictionary["train_losses"])
    eval_losses.append(dictionary["eval_losses"])
    eval_accs.append(dictionary["eval_accs"])

    total_train_time.append(dictionary["total_train_time"])
    final_eval_time.append(dictionary["final_eval_time"])
    # print(type(dictionary))

In [5]:
for i in range(0, 6):
    matplotlib.pyplot.figure("train_losses")
    matplotlib.pyplot.plot(train_losses[i], label=f"experiment {i+1}")
    matplotlib.pyplot.legend()
    matplotlib.pyplot.xlabel("training iteration")
    matplotlib.pyplot.ylabel("training loss")
    matplotlib.pyplot.savefig("assignment/figure/train_losses.pdf")

    matplotlib.pyplot.figure("eval_losses")
    matplotlib.pyplot.plot(eval_losses[i], label=f"experiment {i+1}")
    matplotlib.pyplot.legend()
    matplotlib.pyplot.xlabel("training iteration")
    matplotlib.pyplot.ylabel("validation loss")
    matplotlib.pyplot.savefig("assignment/figure/eval_losses.pdf")

    matplotlib.pyplot.figure("eval_accs")
    matplotlib.pyplot.plot(eval_accs[i], label=f"experiment {i+1}")
    matplotlib.pyplot.legend()
    matplotlib.pyplot.xlabel("training iteration")
    matplotlib.pyplot.ylabel("validation accuracy")
    matplotlib.pyplot.savefig("assignment/figure/eval_accs.pdf")

matplotlib.pyplot.show()

In [7]:
print(f"# arch. best_train_lost best_val._loss best_val._acc. tot_train_time eval_time")
# print(f"# arch. best_train_lost best_val._loss best_val._acc. tot_train_time eval_time")

architecture_name = ["LSTM/E(only)", "LSTM/E/D", "Transformer/2L/pre", "Transformer/4L/pre", "Transformer/2L/post", "BERT"]
# architecture_name = ["LSTM/E(only)", "LSTM/E/D", "Transf./2/pre", "Transf./4/pre", "Transf./2/post", "BERT"]

for i in range(0, 6):
    best_tl, best_vl, best_va = numpy.min(train_losses[i]), numpy.min(eval_losses[i]), numpy.max(eval_accs[i])

    print(f"{i+1} {architecture_name[i]} {best_tl:.3f} {best_vl:.3f} {best_va:.3f} {total_train_time[i]:.0f} {final_eval_time[i]:.3f}")

# arch. best_train_lost best_val._loss best_val._acc. tot_train_time eval_time
1 LSTM/E(only) 0.343 0.409 81.200 986 0.110
2 LSTM/E/D 0.313 0.371 83.300 1487 0.151
3 Transformer/2L/pre 0.421 0.454 78.700 2502 0.269
4 Transformer/4L/pre 0.420 0.436 79.700 4527 0.397
5 Transformer/2L/post 0.429 0.456 78.500 2412 0.232
6 BERT 0.420 0.425 80.900 30362 5.843


In [7]:
print(f"experiment architecture_name best_train_lost best_valid_loss best_valid_accuracy total_train_time final_eval_time")
architecture_name = ["LSTM/encoder_only", "LSTM/encoder/decoder", "Transformer/2 heads/prenorm", "Transformer/4 heads/prenorm", "Transformer/2 heads/postnorm", "BERT"]
# architecture_name = ["LSTM/encoder_only", "LSTM/encoder/decoder", "Transformer/2 heads/prenorm", "Transformer/4 heads/prenorm", "Transformer/2 heads/postnorm", "BERT"]

for i in range(0, 6):
    best_tl, best_vl, best_va = numpy.min(train_losses[i]), numpy.min(eval_losses[i]), numpy.max(eval_accs[i])

    print(f"{i+1} {architecture_name[i]:25} {best_tl:.3f} {best_vl:.3f} {best_va:.3f} {total_train_time[i]:10.3f} {final_eval_time[i]:.3f}")


experiment architecture_name best_train_lost best_valid_loss best_valid_accuracy total_train_time final_eval_time
1 LSTM/True                 0.343 0.409 81.200    937.284 0.106
2 LSTM/False                0.315 0.371 83.600   1424.998 0.147
3 Transformer/2/prenorm     0.445 0.474 77.700   2421.104 0.242
4 Transformer/4/prenorm     0.683 0.684 62.900   4503.754 0.413
5 Transformer/2/postnorm    0.479 0.496 76.500   2411.199 0.244
6 BERT                      0.420 0.425 80.900  30305.980 5.805
