## Introduction

* # TODO: INSERT LINK TO WANDB VIEW

## Setup

#### Install dependencies

* **torch**: PyTorch framework for the creation of neural networks
* **lightning**: Lightning wrapper for pytorch for simple network training
* **huggingface_hub**: HuggingFace hub for downloading word vectors
* **datasets**: HuggingFace datasets to download and load the data set
* **wandb**: Weights & Biases for experiment tracking
* **fasttext**: Word embedding library
* **nltk**: Natural Language Toolkit used for word tokenization
* **torchmetrics**: Extension to lightning to compute model metrics

In [1]:
import sys

%pip install -q torch lightning huggingface_hub datasets wandb nltk torchmetrics

if sys.platform == 'win32': # Windows requires different fasttext implementation
    %pip install -q fasttext-wheel
else: 
    %pip install -q fasttext

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


#### Load dataset

Use the pre-defined method to load the dataset and do the train and validation split

In [2]:
from datasets import load_dataset, Dataset

train: Dataset = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid: Dataset = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test: Dataset = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

README.md:   0%|          | 0.00/7.39k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/160k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9741 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1221 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1140 [00:00<?, ? examples/s]

8741 1000 1221


#### Setup Weights & Biases

Login to weights and biases to enable experiment tracking

In [31]:
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

  ········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/jovyan/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mschurtenberger-david[0m ([33mdavid-schurtenberger[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Preprocessing

#### Vocabulary/Embedding

* I decided to use the **FastText** library for this project, since in class it was said that FastText is superior to the other embedding models and there is no problem with embedding unknown words, because it can create word vectors from their subwords. Furthermore, I will be working with the **facebook/fasttext-en-vectors** word vectors from the HuggingFace hub. They embed words from the English language, which is the only relevant language.

* This choice influences decisions in the following pre-processing steps.

#### Format cleaning (e.g. html-extracted text)

* No format cleaning is performed, because we work with a carefully assembled and standardized dataset used in model benchmarking.

#### Tokenization

* *word_tokenizer* from the **nltk** library will be used. This tokenizer works well for the English language. It also splits punctuation from text, which matches the tokens the fasttext word vectors were trained on.

#### Lowercasing, stemming, lemmatizing, stopword/punctuation removal

* **Lowercasing**: Although the word vectors in use were trained on case-sensitive data, the tokenized words will be lowercased to reach a smaller vocabulary and minimize out-of-vocabulary words.
* **Stemming**: The word embedding model was not trained on word stems and therefore no stemming is carried out.
* **Lemmatizing**: The word tokens to be embedded will not be lemmatized, because the fasttext model was trained on un-lemmatized words and the n-gram encoding of the words used in fasttext preserves sub-word information.
* **Stopword/Punctudation removal**: Since the task is to answer common-sense questions, stopwords and punctuation will not be removed. Most of the questions are quite short and the loss of information if either a critical stopword in the question or punctuation that changes the meaning of the question is removed could be significant.

#### Removal of unknown/other words

* Since I am working with a fasttext model, the removal of unknown words is not necessary, because vectors for them can implicitly be built from their n-gram vectors. Also, the encounter of unknown words is not expected.

#### Truncation

* Input will not be truncated. After some data review, the question yielding the most embedded word vectors yields a tensor of shape **300x67**. Depending on the input format, the RNN model will have to perform significantly less than 100 time steps for the longest input, which is deemed to be feasible. Also, if padding is implemented correctly, for every timestep only the necessary amount of time steps will be executed.

#### Feature selection

* Of the available features, the **question**, the **choices** and the **answerKey** were chosen. While the *questionConcept* seemed like an interesting feature at first, after some data review it was determined, that this feature often simply contains a word from the question. In the end this feature was left out in order not to give too much emphasis to a single word that does likely not help answering the question at all.

#### Input format: how is data passed to the model?

###### Classifier

   * I chose the input for the *Classifier Model* to be a tensor of size **1800**. The first 300 elements are the averages of the embedded question tokens, next are 300 elements for every embedded and averaged answer vector from answer option 'A' to 'E'.
      * The average of the question vectors was chosen, because it is a good tradeoff between information retention and input  dimension for the classifier.
      * The question vector is before the answer vectors, because "Q&A" also has question first, then answers.
      * The answers are arranged from 'A' to 'E' because of alphabetical order.
      * The average of the answer embeddings has been chosen, since answers can consist of multiple words and therefore may yield multiple embedding vectors.

###### RNN + Classifier

   * I chose the input for the *RNN + Classifier Model* to be tensor of size **300 x (N + 10)**. The first *N* columns of the tensor are the word-embeddings of the question. The last *10* columns are the averages of the word embeddings for each answer choice, separated by a *SEP* token.
      * As separation token the character **¦** was chosen, because it is known to the embeddings model, but does not appear in the data.
      * The separation token was introduced to signal to the model, that after the input after this reserved token is an answer choice.
      * All answer embeddings are concatenated to the question embeddings to only need one full pass through the RNN model to get a prediction.
      * The answer embeddings are after the question embeddings, because "Q&A" also has question first, then answer.

#### Label format: what should the model predict?

* Both model architectures will predict a vector of length 5. Every answer choice ('A' through 'E') is encoded on an index in the vector (0 through 4). Since the classifier at the last stage of the model predicts the likelihood of each output, this output format seems the most reasonable. Also, with this kind of output, the model only needs to run once for every classification, which should increase compute performance.

#### Train/valid/test splits

* As seen in the *Introduction* section, the train/validation/test splits are performed as defined in the course

#### Batching, padding

* # TODO: test on gpu-hub for optimal batching. padding only necessary for RNN model?
* For each model, 
* Only the input of the **RNN + Classifier** model is of variable size, therefore padding is only necessary for that model.

### Tokenize

Create method to tokenize and lowercase a given text

In [3]:
import nltk

nltk.download("punkt_tab")

def tokenize(text: str) -> list[str]:
    return [w.lower() for w in nltk.word_tokenize(text, language="english")]

[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Word embeddings

Download the english fasttext word vectors and load their model into the variable *wv_model*

Create a function to embed a list of tokenized words and return them as a list of pytorch tensors

In [4]:
import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("facebook/fasttext-en-vectors", "model.bin")
wv_model = fasttext.load_model(model_path)

def get_embeddings_for_tokens(tokens: list[str]):
    return torch.stack([torch.tensor(wv_model[t]) for t in tokens]).T
    

model.bin:   0%|          | 0.00/7.24G [00:00<?, ?B/s]

### Data Loading and Formatting

Create a **pytorch** *Dataset* class in which the HuggingFace dataset is loaded and preprocessed. This allows for an easy integration with a *DataLoader* afterward.

In [5]:
from typing import Callable

import torch
from torch.utils.data import Dataset
from datasets import Dataset as HFDataset

TransformMethod = Callable[[torch.Tensor, list[torch.Tensor]], torch.Tensor]

Create a separator token from a character that is known to the word vector model, but unused in the train, valid and test datasets 

In [6]:
def char_not_in_huggingface_dataset(char: str, dataset: HFDataset) -> bool:
    for datapoint in dataset:
        if char in datapoint["question"] or any(char in c for c in datapoint["choices"]["text"]):
            return False
    return True
    

placeholder = "¦"
assert placeholder in wv_model # Check if placeholder is a known token in the model
assert char_not_in_huggingface_dataset(placeholder, train)
assert char_not_in_huggingface_dataset(placeholder, valid)
assert char_not_in_huggingface_dataset(placeholder, test)

SEP_TOKEN = torch.tensor(wv_model[placeholder])

In [7]:
KEY_INDEX_MAPPING = {
    "A": 0,
    "B": 1,
    "C": 2,
    "D": 3,
    "E": 4,
}

class CommonsenseQADataset(Dataset):
    _target_transform: TransformMethod
    
    def __init__(self, dataset: HFDataset):
        self.dataset: list[dict[str, torch.tensor | list[torch.tensor]]] = []
        self._transform_hugging_face_dataset(dataset)
        
    def set_target_transform(self, transform: TransformMethod):
        self._target_transform = transform
    
    def _transform_hugging_face_dataset(self, dataset: HFDataset):
        self.dataset.extend([{
            "question": get_embeddings_for_tokens(tokenize(entry["question"])),
            "choices": torch.hstack([get_embeddings_for_tokens(tokenize(choice)).mean(dim=1).unsqueeze(1) for choice in entry["choices"]["text"]]),
            "answer": torch.eye(5)[KEY_INDEX_MAPPING[entry["answerKey"]]],
        } for entry in dataset])
        if len(self.dataset) != len(dataset):
            raise RuntimeError("Converted dataset is not full reflection of source data")
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        data_point = self.dataset[idx]
        feature = self._target_transform(data_point["question"], data_point["choices"])
        target = data_point["answer"]
        return feature, target

Interchangeable transform function for **Classifier** network

* Function input: Question-tensor (300 x N) and answer-tensor (300 x 5)
* Function output: Classifier vector (1800)
    * Average of question vectors yields vector of size 300
    * Question vector and answer vectors are concatenated (5 * 300 = 1800)

In [8]:
def classifier_target_transform(question: torch.Tensor, answers: torch.Tensor) -> torch.Tensor:
    return torch.cat((question.mean(dim=1), answers.T.flatten()))

Interchangeable transform function for **RNN + Classifier** network

* Function input: Question-tensor (300 x N) and answer-tensor (300 x 5)
* Function output: RNN-tensor (300 x (N + 10)) with N question vectors, 5 answer vectors and 5 SEP_TOKEN vectors separating the question from the answer and the answers from each other.

In [9]:
def rnn_target_transform(question: torch.Tensor, answers: torch.Tensor) -> torch.Tensor:
    separated_answers = torch.empty(300, 10)
    for i in range(5):
        separated_answers[:, 2*i] = SEP_TOKEN
        separated_answers[:, 2*i+1] = answers[:, i]
    return torch.cat([question, separated_answers], dim=1)

Transform HuggingFace datasets to pytorch Datasets

In [10]:
train_data = CommonsenseQADataset(train)
valid_data = CommonsenseQADataset(valid)
test_data = CommonsenseQADataset(test)

# 1. Architecture: WordEmbeddings &rarr; Classifier

## Model

I chose to use **lightning** to create a streamlined model training process. The *LightningModule* subclass was created with the help of the [API doc](https://lightning.ai/docs/pytorch/LTS/common/lightning_module.html#lightningmodule-api) and the "experiment_tracking" notebook that we looked at in the Project Discussion lecture

The model architecture complies with the required architecture in the project description.
* Between input and hidden layer there is RELU-non-linearity as activation function. Reason: required
* The output of the second layer is activated using SoftMax. Reason: meaningful output activation for multiclass classification
* The metrics "val_loss", "val_acc", "train_loss", "train_acc" are logged after every epoch. Reason: Meaningful metrics, not overwhelming experiment tracking view

In [11]:
import lightning as L
import torchmetrics
import torch.nn as nn
import torch

In [38]:
class CqaClassifier(L.LightningModule):
    def __init__(
            self, 
            input_dim: int = 1800, 
            hidden_dim: int = 4096, 
            output_dim: int = 5, 
            learning_rate: float = 1e-4,
            adam_epsilon: float = 1e-8,
            weight_decay: float = 0.0,          
    ):
        super().__init__()
        
        self.save_hyperparameters()
        
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.loss_fn = nn.CrossEntropyLoss()
        
        self._train_acc = torchmetrics.Accuracy("multiclass", num_classes=output_dim)
        self._train_loss = []
        self._valid_acc = torchmetrics.Accuracy("multiclass", num_classes=output_dim)
        self._valid_loss = []
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        fc1 = torch.relu(self.fc1(x))
        output = torch.softmax(self.fc2(fc1), dim=1)
        return output
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self._train_loss.append(loss)
        self._train_acc(y_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self._valid_loss.append(loss)
        self._valid_acc(y_hat, y)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.log('test_loss', loss)

    def on_train_epoch_end(self):
        loss = torch.stack(self._train_loss).mean()
        self.log_dict({'train_loss': loss, 'train_acc': self._train_acc.compute()}, prog_bar=True)
        self._train_loss.clear()
        self._train_acc.reset()

    def on_validation_epoch_end(self):
        loss = torch.stack(self._valid_loss).mean()
        self.log_dict({'valid_loss': loss, 'valid_acc': self._valid_acc.compute()}, prog_bar=True)
        self._train_loss.clear()
        self._valid_acc.reset()

    def configure_optimizers(self):
        return torch.optim.AdamW(
            self.parameters(), 
            lr=self.hparams.learning_rate, 
            eps=self.hparams.adam_epsilon, 
            weight_decay=self.hparams.weight_decay
        )
        

## Training

#### Utilities

Create a Utility class for run parameters, that automatically creates a meaningful run-name and returns the config for wandb

* Require a model_name, this will also be used as run-name on wandb. Reason: directly clear what model is trained, but not too much information in run name as recommended from wandb
* Optional hyperparameter *kwargs* that are passed to the wandb initialization as config. Reason: Track all hyperparameters for reproducability

In [39]:
class RunParameters:
    def __init__(self, model_name: str, **kwargs):
        self._name = model_name
        self._params = dict(kwargs)
    
    def __call__(self) -> dict:
        return self._params
    
    def __str__(self):
        return self._name

Utility function to find the optimal batch size. A batch size that uses 80% of the available GPU memory was deemed to be optimal. Reason: Use bulk of memory available, but leave some headroom

In [40]:
def batch_size_finder(data: Dataset, model: L.LightningModule, max_memory_usage=0.8) -> tuple[int, float]:
    device = "cuda" if torch.cuda.is_available() else "cpu"
    if not device == "cuda":
        raise RuntimeError("Can only be run on gpu")
    batch_size, memory_usage = 1, 0
    model.to(device)
    
    while True:
        loader = torch.utils.data.DataLoader(data, batch_size=batch_size)
        inputs = labels = torch.tensor([])
        for i, l in loader:
            if i.shape > inputs.shape:
                inputs, labels = i, l
        try:
            inputs, labels = inputs.to(device), labels.to(device)
            _ = model(inputs)
            
            # Check memory usage
            mem_allocated = torch.cuda.memory_allocated(device)
            mem_reserved = torch.cuda.memory_reserved(device)
            mem_usage = mem_allocated / mem_reserved

            print(f"Batch Size: {batch_size}, Memory Usage: {mem_usage:%}")

            if mem_usage <= memory_usage:
                return batch_size // 2, memory_usage
            
            if mem_usage >= max_memory_usage:
                return batch_size, mem_usage
           
            batch_size *= 2
            memory_usage = mem_usage
        
        except RuntimeError as e:
            if 'out of memory' in str(e):
                return batch_size // 2, memory_usage
            raise e
    

#### Run parameters

Set the data target transformer to the Classifier model function

In [41]:
train_data.set_target_transform(classifier_target_transform)
valid_data.set_target_transform(classifier_target_transform)

Find optimal batch size for a model with the given amount of hidden layers on the training set

In [42]:
HIDDEN_DIM = 4096
BATCH_SIZE, memory_usage = batch_size_finder(train_data, CqaClassifier(hidden_dim=HIDDEN_DIM))
print(f"Batch Size: {BATCH_SIZE} @ {memory_usage:%} memory usage")
BATCH_SIZE = 256 # Overwrite batch size, because entire dataset could be loaded at once

Batch Size: 1, Memory Usage: 90.755547%
Batch Size: 1 @ 90.755547% memory usage


In [43]:
parameters = RunParameters(
    "Classifier-v1",
    learning_rate=1e-4,
    epochs=20,
    hidden_dim=HIDDEN_DIM,
    adam_epsilon=1e-8,
    weight_decay=0.0,
    train_batch_size=BATCH_SIZE,
    valid_batch_size=BATCH_SIZE,
)

#### Initialize wandb experiment tracking for run

In [47]:
from lightning.pytorch.loggers import WandbLogger

In [None]:
wandb.init(
    entity="david-schurtenberger",
    project="NLP_Project_1",
    name=str(parameters),
    config=parameters(),
)
wandb_logger = WandbLogger(project="NLP_Project_1")

#### Training routine

Define the function **train_classifier** which describes the training routine for the classifier model using the **Trainer** class of *lightning* 

In [49]:
def train_classifier(config, logger, train_loader, valid_loader):
    L.seed_everything(42)
    model = CqaClassifier(
        hidden_dim=config.get("hidden_dim"),
        learning_rate=config.get("learning_rate"),
        adam_epsilon=config.get("adam_epsilon"),
        weight_decay=config.get("weight_decay")
    )
    trainer = L.Trainer(
        max_epochs=config.get("epochs"),
        accelerator="auto",
        devices=1,
        logger=logger,
    )
    trainer.fit(model, train_loader, valid_loader)

Instantiate data loaders

In [50]:
train_loader = torch.utils.data.DataLoader(train, wandb.config["train_batch_size"], shuffle=True)
valid_loader = torch.utils.data.DataLoader(train, wandb.config["valid_batch_size"])

Execute training run

In [51]:
train_classifier(wandb.config, wandb_logger, train_loader, valid_loader)

Seed set to 42
You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type               | Params | Mode 
----------------------------------------------------------
0 | fc1        | Linear             | 7.4 M  | train
1 | fc2        | Linear             | 20.5 K | train
2 | loss_fn    | CrossEntropyLoss   | 0      | train
3 | _train_acc | MulticlassAccuracy | 0      | train
4 | _valid_acc | MulticlassAccuracy | 0      | train
----------------------------------------------------------
7.4 M     Trainable params
0         Non-trainable params
7.4 M     Total params
29.590    Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


ValueError: too many values to unpack (expected 2)

In [None]:
wandb.finish()

## Evaluation

## Interpretation

# WordEmbeddings &rarr; RNN &rarr; Classifier

## Model

## Training

## Evaluation

## Interpretation