# Project 2: RNN BoolQ



# Introduction

Classification of BoolQ with RNNs.

The BoolQ dataset is preprocessed to be used with GloVe word embeddings. Then `question` and `passage` are concatenated to be the input to the classification model. Which results in an input of 304x300.

A classification model with 2 GRU layers is created and connected with 2 linear layers for outputting the probability if the passage answers the question.
The linear layers are connected through ReLu and Sigmoid is used to output the probabilities.

Experiments were made with various hyperparameters configuations to determine which works best. Those were hidden layer size, learning rate, dropout and weight decay.

The final model was with hidden layer size 256, learning rate 0.001, dropout 0 and weight decay 0

With following performances:
- Balanced accuracy: 54%
- F1: 44%
- Loss: 7.1
- Precision: 69%
- Recall: 65%
- Accuracy: 64%


W&B Link: https://wandb.ai/yelin-zhang-hslu/nlp-project-2/workspace?nw=5v0ojjkq1bf

# Setup
Install dependencies, import used libraries and download used embeddings.

## Tools used
- GPUHub JupyterLab
- Pytorch Lightning documentation
- No AI tools used, as they do not help with reading API documentation and GitHub issues 

## Changes to stage 1
- Split up documentation and add description of chapter.
- HuggingFace login is not needed.

## Dependencies
Python 3.11.9

Install all necessary dependencies
- PyTorch: `torch lightning`
- Hugging Face: `huggingface_hub datasets`
- Weights & Biases: `wandb`
- nltk: `nltk`
- numpy: `numpy`

Optional
- Lint and Formatting: `ruff`

In [2]:
%pip install torch==2.3.1 lightning==2.4.0 huggingface_hub==0.25.2 datasets==3.0.1 wandb==0.18.3 nltk==3.9.1 numpy==1.26.4 ruff==0.6.9

Note: you may need to restart the kernel to use updated packages.


## Notebook setup
- Import all necessary libraries.

In [3]:
from datasets import load_dataset
from pathlib import Path
import shutil
from huggingface_hub import hf_hub_download
from datasets import Value
import wandb
from lightning.pytorch.loggers import WandbLogger
import torchmetrics
from torch import optim, nn, utils
import lightning as L
import nltk
import torch
import string
import numpy as np

- Log into Hugging Face and Weights & Biases.

In [5]:
%env "WANDB_NOTEBOOK_NAME" "project2-stage2"
wandb.login()
WANDB_PROJECT = "nlp-project-2"

env: "WANDB_NOTEBOOK_NAME"="project2-stage2"


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33myelin-zhang[0m ([33myelin-zhang-hslu[0m). Use [1m`wandb login --relogin`[0m to force relogin


- Download GloVe Wikipedia embeddings from HuggingFace and unzip them in the `data` folder.

> Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation

In [6]:
GLOVE_PRETRAINED = "glove.6B.zip"
model_path = hf_hub_download(repo_id="stanfordnlp/glove", filename=GLOVE_PRETRAINED)
shutil.unpack_archive(model_path, "./glove_embeddings")

- Load embeddings into `embeddings` dictionary variable.

In [7]:
embeddings = {}
with open("./glove_embeddings/glove.6B.300d.txt") as glove:
    for word_vec in glove:
        values = word_vec.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings[word] = vector

# Preprocessing

Predefined requirements:
- Train / Validation / Test split
- Existing word embedding model: word2vec, GloVe, fastText
- Download the BoolQ dataset with `datasets` and split it in the predefined way.

Data treatment steps:
- Lower case `text.lower()`
    - Reason being the used GloVe pretrained embeddings on Wikipedia are uncased
- Tokenize with nltk
    - Use `word_tokenize` as we are interested in every word for word embedding with GloVe
- Remove punctuation and non ascii characters (phoenetics etc.)
    - Punctuation and non ascii characters are not relevant for answering questions
    - Stop words are not removed as they are relevant for answering the question
    - Remove ascii by encoding and decoding with `ascii`. `text.encode('ascii', 'ignore').decode('ascii')`
    - Remove punctuation by checking against `string.punctuation` 
- Word embedding with GloVe (pretrained on Wikipedia)
    - Perfered to word2vec because GloVe works with co-occurence and answering questions is about context
    - Prefered to fastText because through previous processing no subword embeddings are needed
    - Skip the word if it is not in the vocabulary of `embeddings`
- Truncating by averaging of passage
    - This is needed as in the dataset there are a few very long outliers, which would bloat the input to the model
    - Enforce a maximum length, where ~99% of remaining passages not truncated `np.percentile(passages_lengths, 99)`
    - Take the average of what should be truncated and add it to the end of the passage vector
- Padding with 0s for question and passage for minimum length `np.pad`
    - This is needed for the concatenation of question and passage, as they need to have the same length
    - Pad all questions to maximum length of all questions
    - Pad all passages to maximum length of all passages determined previously
- Concatenate question and passage as the input for the model
    - `np.concatenate` is used to have a single input for the model, which is not in an extra dimension as when using `np.stack`
    - Add a seperator of 8 vectors with only zeros between question and passage 
        - This is needed to be able to differentiate between question and passage
- Remove `question` and `passage` columns from the dataset
    - They are not needed anymore as they are now part of the input
    - `dataset.remove_columns(["question", "passage"])`

Used features:
- `question` and `passage` as word vectors
- `answer` as label

Input format: concatenated `question` and `passage` word vectors (max length of question + seperator + max length of passage x embedding size)

Label format:
- convert `answer` boolean to 1 or 0
- Model output is probability of 1
- `dataset.cast_column("answer", Value("int32"))`

Batch size: 64 for faster training

## Correctness tests
- Check processed passages and questions before embedding if they still make sense 
- Check embedding lengths
- Check how many words are not in the vocabulary and maybe adjust which pretrained GloVe emebeddings are used based on that

## Changes to stage 1
- Clarification of seperator between question and passage. Not 0 character, but multiple vectors with only 0 values as seperator.
- After concatenation the data is being reshaped, to reduce the input dimension.

## Implementation
The preprocessing is split into two main steps.
1. Creation of the word vectors
2. Converting into desired input format

These are being run seperately to be able to do the padding and truncation.

HuggingFace datasets cache computations, which caused an issue as global variables are set by the function which will be missing when not executed. This caused a sweep to finish with the wrong dimensions. Therefore causing me to have to execute the same swepp again with the correct dimensions.

Preprocessing computation has to be reasonably fast, as the GPUHub seems to lose the Jupyter kernel after a few hours requiring me to rerun it. 

The result of the correctness checks is:
- The answer to the question can still be pieced together just before creating the `passage` embeddings, therefore the processing was done succesfully.
- Downloaded Glove Word embeddings are 300 long and the result is as expected.
- About 8000 words are not in the vocabulary, but from looking at the missed words they do not seem very important.
- The dimensional reduction of the model input `query` results in the expected size

Download and split dataset in predefined way

In [8]:
# Predefined dataset loading code
train_raw = load_dataset("google/boolq", split="train[:-1000]")
valid_raw = load_dataset("google/boolq", split="train[-1000:]")
test_raw = load_dataset("google/boolq", split="validation")

print(len(train_raw), len(valid_raw), len(test_raw))

8427 1000 3270


- Lower case `text.lower()`
    - Reason being the used GloVe pretrained embeddings on Wikipedia are uncased
- Tokenize with nltk
    - Use `word_tokenize` as we are interested in every word for word embedding with GloVe
- Remove punctuation and non ascii characters (phoenetics etc.)
    - Punctuation and non ascii characters are not relevant for answering questions
    - Stop words are not removed as they are relevant for answering the question
    - Remove ascii by encoding and decoding with `ascii`. `text.encode('ascii', 'ignore').decode('ascii')`
    - Remove punctuation by checking against `string.punctuation` 
- Word embedding with GloVe (pretrained on Wikipedia)
    - Perfered to word2vec because GloVe works with co-occurence and answering questions is about context
    - Prefered to fastText because through previous processing no subword embeddings are needed
    - Skip the word if it is not in the vocabulary of `embeddings`

- Check processed passages and questions before embedding if they still make sense 

Why not:
- stemming/ lemmatization
    - GloVe pretrained embeddings do not use either technique, to keep the preprocessing same I also do not apply stemming or lemmatization
- removal of other words/ stopwords
    - Stopwords are important to answering the question, as negations and other important words are counted as stopwords.
    - Other words are automatically removed if they do not appear in the embedding vocabulary
- format cleaning
    - Removing non ascii can be counted as format cleaning, but otherwise the dataset is already cleaned.

In [9]:
# get punctuations to be removed
punctuation = set(list(string.punctuation))

text_values = ["question", "passage"]
passage_lens = []
max_question_len = 0

missed_words = set()
embedding_vocabulary = set(embeddings.keys())


def preprocess_dataset(row, debug=False):
    global max_question_len
    for key in text_values:
        # lower case the text and remove non-ascii characters
        lowered_value = row[key].lower().encode("ascii", errors="ignore").decode()
        # tokenize the text
        tokenized_value = [word for word in nltk.word_tokenize(lowered_value)]
        # filter out punctuations
        filtered_value = [word for word in tokenized_value if word not in punctuation]
        # convert to word vectors
        embedded_value = []
        for word in filtered_value:
            if word not in embedding_vocabulary:
                missed_words.add(word)
                continue
            embedded_value.append(embeddings[word])
        row[key] = embedded_value
        if debug:
            print(
                f"\n{key}:\nlowered and ascii: {lowered_value}"
                f"\ntokenized: {tokenized_value}"
                f"\nfiltered: {filtered_value}"
            )
    # save passage lengths
    passage_lens.append(len(row["passage"]))
    # save longest question length
    q_len = len(row["question"])
    if q_len > max_question_len:
        max_question_len = q_len
    return row


# can overwrite important global parameters if executed later
preprocess_dataset(train_raw[0], True)

train_vectorized = train_raw.map(preprocess_dataset, load_from_cache_file=False)
valid_vectorized = valid_raw.map(preprocess_dataset, load_from_cache_file=False)
test_vectorized = test_raw.map(preprocess_dataset, load_from_cache_file=False)


question:
lowered and ascii: do iran and afghanistan speak the same language
tokenized: ['do', 'iran', 'and', 'afghanistan', 'speak', 'the', 'same', 'language']
filtered: ['do', 'iran', 'and', 'afghanistan', 'speak', 'the', 'same', 'language']

passage:
lowered and ascii: persian (/prn, -n/), also known by its endonym farsi ( frsi (fsi) ( listen)), is one of the western iranian languages within the indo-iranian branch of the indo-european language family. it is primarily spoken in iran, afghanistan (officially known as dari since 1958), and tajikistan (officially known as tajiki since the soviet era), and some other regions which historically were persianate societies and considered part of greater iran. it is written in the persian alphabet, a modified variant of the arabic script, which itself evolved from the aramaic alphabet.
tokenized: ['persian', '(', '/prn', ',', '-n/', ')', ',', 'also', 'known', 'by', 'its', 'endonym', 'farsi', '(', 'frsi', '(', 'fsi', ')', '(', 'listen', ')',

Map:   0%|          | 0/8427 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

- Check embedding lengths
- Check how many words are not in the vocabulary and maybe adjust which pretrained GloVe emebeddings are used based on that

In [9]:
print("\nmissed words", len(missed_words), list(missed_words)[:50])

assert len(train_vectorized[0]["passage"][0]) == 300, "Word vecor should be 300 long"
assert len(missed_words) > 5, "Too few missed words, preprocessing failed"


missed words 8623 ['decimetre', '/brbeds/', 'degree-level', 'refectories', 'j.s', 'dump-truck', 'posteriore', 'late-16th', 'time-gap', 'egnx', 'member-turned', 'wros', '23,876,155', 'driveclub', 'non-trace', 'trans-meridian', 'panther-platform', 'ethenyl', 'thyrsiflora', 'book-to-film', '19.', 'cruller', "'an", '4,842', 'fransmart', '1812.', 'francisco/daly', '150,682,490', 'press-ready', 'shokugeki', 'algee', 'robertson-dworet', 'stopping.', 'aw4', '738,432', 'shenell', 'pdf/x-1a', 'pgbt', '1.3333.', 'ho-gul', '109,673.', '244,106', '130,000,000', '/nti', 'harberts', 'sochna', 'penitus', 'i-476', 'rs-a', 'coffee-mate']


Convert answer boolean to 1 or 0

In [10]:
train_casted = train_vectorized.cast_column("answer", Value("int32"))
valid_casted = valid_vectorized.cast_column("answer", Value("int32"))
test_casted = test_vectorized.cast_column("answer", Value("int32"))

- Truncating by averaging of passage
    - This is needed as in the dataset there are a few very long outliers, which would bloat the input to the model
    - Enforce a maximum length, where ~99% of remaining passages not truncated `np.percentile(passages_lengths, 99)`
    - Take the average of what should be truncated and add it to the end of the passage vector
- Padding with 0 for question and passage for minimum length `np.pad`
    - This is needed for the concatenation of question and passage, as they need to have the same length
    - Pad all questions to maximum length of all questions
    - Pad all passages to maximum length of all passages determined previously
- Concatenate question and passage as the input for the model
    - `np.concatenate` is used to have a single input for the model, which is not in an extra dimension as when using `np.stack`
    - Add a seperator of 8 zeros between question and passage 
        - This is needed to be able to differentiate between question and passage
     
Why not:
- averaging question and passage to a "sentence vector" and then concatenate
    - The max length of questions is very short, therefore the gain would be minimal while losing more information.
    - Passages are longer but I assume by not averaging to one vector more contextual information can be preserved. Such as word ordering. Which would result in slightly better results.

In [11]:
# get 99th percentile of passage lengths, which will be used to truncate the data
max_passage_len = int(np.percentile(passage_lens, 99))


question_passage_padding = 8


def truncate_pad_data(row):
    # truncate by averaging too long passages to max_passage_len
    if row["passage"].shape[0] > max_passage_len:
        to_average = row["passage"][max_passage_len - 1 :]
        row["passage"][max_passage_len - 1] = np.mean(to_average, axis=0)
        row["passage"] = row["passage"][:max_passage_len]
    # pad too short passages to max_passage_len
    elif row["passage"].shape[0] < max_passage_len:
        row["passage"] = np.pad(
            row["passage"], ((0, max_passage_len - row["passage"].shape[0]), (0, 0))
        )

    # pad too short questions to max_question_len plus question passage seperator
    if row["question"].shape[0] < max_question_len + question_passage_padding:
        row["question"] = np.pad(
            row["question"],
            (
                (
                    0,
                    max_question_len
                    + question_passage_padding
                    - row["question"].shape[0],
                ),
                (0, 0),
            ),
        )

    # concatenate question and passage to a single input
    row["query"] = np.concatenate((row["question"], row["passage"]), axis=0).reshape(-1)
    return row


train_concatenated = train_casted.with_format("np").map(truncate_pad_data)
valid_concatenated = valid_casted.with_format("np").map(truncate_pad_data)
test_concatenated = test_casted.with_format("np").map(truncate_pad_data)

Check second preprocessing pass for correctness.

In [12]:
input_size = (max_question_len + question_passage_padding + max_passage_len) * 300
print(train_concatenated[0]["query"].shape)
print(f"input size: ({max_question_len}+8+{max_passage_len})*300={input_size}")
assert train_concatenated[0]["query"].shape == (
    input_size,
), f"Query shape should be {input_size}"
assert train_concatenated[1]["query"].shape == (
    input_size,
), f"Query shape should be {input_size}"

(91200,)
input size: (21+8+275)*300=91200


Remove unnecessary columns as they are represented in `query`

In [13]:
train = train_concatenated.remove_columns(["question", "passage"])
valid = valid_concatenated.remove_columns(["question", "passage"])
test = test_concatenated.remove_columns(["question", "passage"])

# Model

Predefined requirements:
- RNN
    - LSTM or GRU
- Classifier
    - 2 Layers
    - ReLu

## Network Architecture
- Input layer
    - `torch.nn.GRU`
    - Input Dimension: query + question word vector dimension  
    - Output Dimension: hidden layer dimension
    - Activation: `torch.nn.ReLu`
- Output layer
    - `torch.nn.GRU`
    - Input Dimension: hidden layer dimension
    - Output Dimension: 1
        - Output is probability of class (1 = 100% true, 0 = 0% true)
    - Activation: `torch.sigmoid`
- Normalization: [GloVe word vectors are already normalized](https://github.com/JungeAlexander/GloVe/blob/master/eval/python/evaluate.py#L29-L33) 
- Regularization: done by optimizer

Using GRU because it is simpler than LSTM and has similar performance, while being faster to train because it has one gate less.

### Loss function
Binary Cross-Entropy: 
- Best choice for binary classification problems
- `torch.nn.BCELoss`

### Optimizer
AdamW:
- Better with less hyperparamater tuning than SGD and the default Adam
- `torch.optim.AdamW`

## Correctness test
Test run of training, validation, test and prediction with 1 input

## Changes to stage 1
- Forgot to define the linear classification layers.
- Split documentation between code cells
- Max epochs 50, as training took longer than expected

## Implementation
The correctness test of the model runs without problems.

In [14]:
class GRUQClassifier(L.LightningModule):
    def __init__(self, hidden_size=128, dropout=0, lr=1e-3, weight_decay=0):
        super().__init__()
        self.lr = lr
        self.hidden = hidden_size
        self.dropout = dropout
        self.weight_decay = weight_decay

        self.loss = torch.nn.BCELoss()

        self.layer_rnn = nn.GRU(
            input_size, hidden_size, num_layers=2, dropout=self.dropout
        )
        self.layer_1 = nn.Linear(hidden_size, hidden_size)
        self.layer_2 = nn.Linear(hidden_size, 1)

        self.save_hyperparameters()

    def configure_optimizers(self):
        return optim.AdamW(
            self.parameters(), lr=self.lr, weight_decay=self.weight_decay
        )

    def forward(self, x):
        x, _ = self.layer_rnn(x)
        x = self.layer_1(x)
        x = torch.relu(x)
        x = self.layer_2(x)
        x = torch.sigmoid(x)
        return x

Correctness test of the model definition, by running the model with one batch.

In [15]:
trainer = L.Trainer(fast_dev_run=True, limit_test_batches=1)
m = GRUQClassifier()
m(torch.from_numpy(train[:1]["query"]))

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.


tensor([[0.5025]], grad_fn=<SigmoidBackward0>)

 ### Checkpoints
Best epochs based on best validation balanced accuracy:
- uploaded to wandb for later use

In [16]:
checkpoint = L.pytorch.callbacks.ModelCheckpoint(
    save_top_k=10, monitor="val_balanced_accuracy", mode="max"
)

## Experiments
- Hidden layers dimension (128, 256, 512)
    - To check if more complex models are needed
- Dropout (0, 1e-1, 2e-1)
    - To check how much regularization is needed (avoid under/overfitting)
- Learning rate (1e-3, 1e-4, 1e-5)
    - To check which learning rate is optimal
    - No learning rate scheduler is needed as AdamW handles adjusting learning rates dynamically on its own with the passed learning rate being the maximum
- Weight Decay (0, 1e-1, 1e-2)
    - To check how much regularization is needed (avoid under/overfitting)


### Early stop
Compare to previous epochs validation balanced accuracy
- wandb sweeps use the Hyperband algorithm
- Max epochs 50
- Check every 10 epochs

In [17]:
experiments = {
    "method": "grid",
    "metric": {"goal": "maximize", "name": "val_balanced_accuracy"},
    "parameters": {
        "hidden_size": {"values": [128, 256, 512]},
        "dropout": {"values": [0, 1e-1, 2e-1]},
        "lr": {"values": [1e-3, 1e-4, 1e-5]},
        "weight_decay": {"values": [0, 1e-1, 1e-2]},
    },
    "early_terminate": {"type": "hyperband", "max_iter": 500, "s": 50, "eta": 3},
}

# Training

- Log training and validation metrics to wandb after every epoch
    - Log at end of epoch by using `training_epoch_end` and `validation_epoch_end`
    - Balanced accuracy
        - Accuracy is not a good metric for imbalanced datasets, as it can be misleading
        - `torchmetrics.functional.classification.accuracy(preds, target, task='multiclass', num_classes=2, average='macro')`
    - Loss
        - Loss should decrease over time
    - Precision Recall curve
        - Show the tradeoff between precision and recall
        - `torchmetrics.functional.classification.precision_recall_curve(pred, target, task='multiclass', num_classes=2)`
    - F1
        - F1 is a better performance measure than accuracy in imbalanced datasets 
        - `torchmetrics.functional.classification.f1_score(preds, target, task='multiclass', num_classes=2, average='macro')`

## Changes to stage 1
There was no need to log in the hooks `epoch_end` as pytorch lightning automatically configures the logger to log on epoch when using it in the `step` functions.
https://lightning.ai/docs/pytorch/stable/extensions/logging.html#automatic-logging

## Implementation
Precision/Recall curves were used as there is a class imbalance.

F1 was chosen to be macro as to reflect both classes F1 in one score instead of only the majority class like in project 1.

There might be problems when continuing using the same jupyter kernel after running the sweep. https://docs.wandb.ai/guides/sweeps/start-sweep-agents/#stop-wb-agent

Restarting the kernel to execute lower cells helps.

The correctness test of the train and validation definition runs without problems.

In [18]:
def training_step(self, batch, batch_idx):
    _, _, loss, acc, f1_score, prc = self._get_pred_metrics(batch)
    self.log_dict(
        {
            "train_loss": loss,
            "train_balanced_accuracy": acc,
            "train_f1": f1_score,
            "train_precision": prc[0].mean(),
            "train_recall": prc[1].mean(),
            "train_threshold": prc[2].mean(),
        }
    )
    return loss


def _get_pred_metrics(self, batch):
    answers = batch["answer"]
    pred = self(batch["query"])
    pred = pred.view(-1)

    loss = self.loss(pred, answers.float())
    acc = torchmetrics.functional.classification.accuracy(
        pred, answers, task="multiclass", num_classes=2, average="macro"
    )
    prc = torchmetrics.functional.classification.precision_recall_curve(
        pred, answers, task="binary"
    )
    f1_score = torchmetrics.functional.classification.f1_score(
        pred, answers, task="multiclass", num_classes=2, average="macro"
    )
    return pred, answers, loss, acc, f1_score, prc


GRUQClassifier.training_step = training_step
GRUQClassifier._get_pred_metrics = _get_pred_metrics

- Run validation after every epoch
    - To check how the model is doing on unseen data

In [19]:
def validation_step(self, batch, batch_idx):
    _, _, loss, acc, f1_score, prc = self._get_pred_metrics(batch)
    self.log_dict(
        {
            "val_loss": loss,
            "val_balanced_accuracy": acc,
            "val_f1": f1_score,
            "val_precision": prc[0].mean(),
            "val_recall": prc[1].mean(),
            "val_threshold": prc[2].mean(),
        }
    )
    return loss


GRUQClassifier.validation_step = validation_step

Check if the train and validation was defined correctly

In [20]:
trainer.fit(
    model=m,
    train_dataloaders=utils.data.DataLoader(train, batch_size=4),
    val_dataloaders=utils.data.DataLoader(train, batch_size=4),
)

You are using a CUDA device ('NVIDIA A16') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:654: Checkpoint directory /home/jovyan/NLP/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type    | Params | Mode 
----------------------------------------------
0 | loss      | BCELoss | 0      | train
1 | layer_rnn | GRU     | 35.2 M | train
2 | layer_1   | Linear  | 16.5 K | train
3 | layer_2   | Linear  | 129    | train
----------------------------------------------
35.2 M    Trainable params
0         Non-trainable params
35.2 M    Total params
140.746   Total estimated model params size (MB)
4        

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=1` reached.


- Use wandb sweeps for hyperparameter tuning.
    - Grid search will be used, as the hyperparameter choices are discrete and the search space is not too large (3x3x3x3 = 81 experiments)
    - Manually doing many experiments is tedious therefore use wandb sweeps
    - Best integration into wandb instead of other libraries such as optuna, ray
    - `wandb.sweep`

In [None]:
train_loader = utils.data.DataLoader(
    train, batch_size=64, num_workers=2, pin_memory=True
)
valid_loader = utils.data.DataLoader(
    valid, batch_size=64, num_workers=2, pin_memory=True
)


def sweep():
    with wandb.init(project=WANDB_PROJECT) as run:
        name = f"gru/hidden_size:{wandb.config['hidden_size']}/dropout:{wandb.config['dropout']}/lr:{wandb.config['lr']}/weight_decay:{wandb.config['weight_decay']}"
        run.name = name
        logger = WandbLogger(project=WANDB_PROJECT, log_model="all", name=name)
        classifier = GRUQClassifier(**wandb.config)
        trainer = L.Trainer(
            max_epochs=50,
            logger=logger,
            accelerator="gpu",
            devices=1,
            callbacks=[checkpoint],
        )
        try:
            trainer.fit(
                model=classifier,
                train_dataloaders=train_loader,
                val_dataloaders=valid_loader,
            )
        except Exception as e:
            print(e)
        finally:
            wandb.finish()


sweep_id = wandb.sweep(sweep=experiments, project=WANDB_PROJECT)

wandb.agent(sweep_id, function=sweep)
wandb.teardown()

After all experiments have run select best runs based on the balanced accuracy as the final model to be evaluated.

Balanced accuracy is the decision metric as it also includes the negative predictions, unlike F1. We are also interested in the negatives because they also have to be predicted correctly for question answering.

In [22]:
# check wandb for sweep id, if the notebook lost the variable
SWEEP_ID = "glglzbew"

api = wandb.Api()
sweep = api.sweep(f"yelin-zhang-hslu/{WANDB_PROJECT}/{SWEEP_ID}")
runs = sorted(
    sweep.runs,
    key=lambda run: run.summary.get("val_balanced_accuracy", 0),
    reverse=True,
)
val_acc = runs[0].summary.get("val_balanced_accuracy", 0)
print(f"Best run {runs[0].name} with {val_acc} validation balanced accuracy")

Best run gru/hidden_size:256/dropout:0/lr:0.001/weight_decay:0 with 0.5568439364433289 validation balanced accuracy


# Evaluation
Most metric implementation will reuse the code from the training phase, as they are the same.

Additionally the accuracy and confusion matrix will also be examined. Both will only be implemented for the evaluation step. 
- Accuracy is to be able to compare the model to the previous project
    - As well as to check how it compares to the dataset imbalance
    - Additionally because accuracy is easier to understand as a metric than balanced accuracy
    - `torchmetrics.functional.classification.accuracy(preds, target, task='multiclass', num_classes=2, average='micro')`
- The confusion matrix is to be able to see where the model tends to make mistakes.
    - If it only predicts one class or of it mixes in predictions of the other class
    - `torchmetrics.functional.confusion_matrix(preds, target, num_classes=2)`

Metrics used for evaluation:
- Accuracy
- Balanced Accuracy
- Precision
- Recall
- F1
- Confusion matrix

There will be no changing of parameters after the final model has been evaluated. As that would be train-test leakage.

## Implementation
Implementation of the confusion matrix in the pytorch lightning hooks is difficult for wandb. As there seems to be no good way to incementally log the values.
Therefore it it implemented with `predict` seperatly

## Result
By evaluating the top 10 hyperparamter configurations the experiments revealed:
- Hidden size did not matter much, as all sizes were represented and the best performing was not the largest
- Larger Learning rate had better performance
- Models with lower dropout performed better
- Models with lower weight decay also performed better

In total the best model was with Hidden size 256, learning rate 0.001, dropout 0 and weight decay 0

With following performances:
- Balanced accuracy
    - Train: 83%
    - Validation: 55%
    - Test: 54%
- F1
    - Train: 81%
    - Validation: 45%
    - Test: 44%
- Loss
    - Train: 0
    - Validation: 6.5
    - Test: 7.1
- Precision
    - Train: 80%
    - Validation: 66%
    - Test: 69%
- Recall
    - Train: 93%
    - Validation: 64%
    - Test: 65%
- Accuracy test: 64%

With following confusion matrix (Switch to light mode if it does not look right in dark mode):

![Confusion matrix](https://images2.imgbox.com/b8/49/xUx4cV9k_o.png)

The model managed to predict many of the not answerable question correctly and did not overfit to the majority class.

Validation and test performance are very close which is good, as the model did not manage to overfit on the validation data. But a large gap is in test to validation performance, this seems to only happen to the best runs. The 2nd, 3rd highest validation accuracy also have close test to validation accuracy.  

In [23]:
def test_step(self, batch, batch_idx):
    preds, answers, loss, balanced_acc, f1_score, prc = self._get_pred_metrics(batch)
    acc = torchmetrics.functional.classification.accuracy(
        preds, answers, task="binary", num_classes=2
    )
    self.log_dict(
        {
            "test_loss": loss,
            "test_accuracy": acc,
            "test_balanced_accuracy": balanced_acc,
            "test_f1": f1_score,
            "test_precision": prc[0].mean(),
            "test_recall": prc[1].mean(),
            "test_threshold": prc[2].mean(),
        }
    )
    return loss


def predict_step(self, batch, batch_idx, dataloader_idx=0):
    return self(batch["query"])


GRUQClassifier.test_step = test_step
GRUQClassifier.predict_step = predict_step

Check if the implementation for test and predict are correct

In [24]:
trainer.test(model=m, dataloaders=utils.data.DataLoader(train, batch_size=4))
trainer.predict(model=m, dataloaders=utils.data.DataLoader(train, batch_size=4))

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Testing: |          | 0/? [00:00<?, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy                 1.0
 test_balanced_accuracy             0.0
         test_f1                    0.0
        test_loss           0.5471592545509338
     test_precision                 1.0
       test_recall                  0.5
     test_threshold          0.578847348690033
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


/opt/conda/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Predicting: |          | 0/? [00:00<?, ?it/s]

[tensor([[0.5503],
         [0.5826],
         [0.5876],
         [0.5949]])]

Load the best model from wandb artifact registry.

In [25]:
checkpoint_reference = f"yelin-zhang-hslu/{WANDB_PROJECT}/model-{runs[0].id}:best"

artifact = api.artifact(checkpoint_reference).download()
wandb.init(project=WANDB_PROJECT, id=runs[0].id, resume="allow")

classifier_final = GRUQClassifier.load_from_checkpoint(Path(artifact) / "model.ckpt")
eval_logger = WandbLogger(project=WANDB_PROJECT, log_model="all")
eval_trainer = L.Trainer(
    logger=eval_logger,
    accelerator="gpu",
    devices=1,
)

[34m[1mwandb[0m: Downloading large artifact model-y5epoh0i:best, 809.12MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.8


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Implement confustion matrix calculation and logging.

In [31]:
def create_confusion_matrix(trainer, classifier, dataset, name):
    dataloader = utils.data.DataLoader(
        dataset, batch_size=64, num_workers=2, pin_memory=True
    )
    pred_batches = trainer.predict(classifier, dataloaders=dataloader)
    preds = np.array([])
    for batch in pred_batches:
        preds = np.concatenate((preds, batch.squeeze().numpy()))
    conf = wandb.plot.confusion_matrix(
        y_true=dataset["answer"],
        preds=np.round(preds),
        class_names=["Incorrect", "Correct"],
    )
    wandb.log({"valid_conf_matrix": conf})

Run evaluation of final model with test and validation dataset.

In [32]:
eval_trainer.validate(classifier_final, dataloaders=valid_loader)
create_confusion_matrix(eval_trainer, classifier_final, valid, "valid_conf_matrix")

test_loader = utils.data.DataLoader(test, batch_size=64, num_workers=2, pin_memory=True)
eval_trainer.test(classifier_final, dataloaders=test_loader)
create_confusion_matrix(eval_trainer, classifier_final, test, "test_conf_matrix")
wandb.finish()

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation: |          | 0/? [00:00<?, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  val_balanced_accuracy     0.5576755404472351
         val_f1             0.4544157087802887
        val_loss             6.581948757171631
      val_precision         0.6660197377204895
       val_recall           0.6403718590736389
      val_threshold         0.6956357359886169
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy          0.639755368232727
 test_balanced_accuracy     0.5482339262962341
         test_f1            0.4425245523452759
        test_loss            7.132151126861572
     test_precision         0.6943557262420654
       test_recall          0.6556441783905029
     test_threshold         0.6782917976379395
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

# Interpretation
Expectation:
- 55% balanced accuracy with test dataset. As this would be better than just randomly guessing if the answer to the question is true or false.
- 65% accuracy with test dataset. As this would be better than the test label imbalance of 62.2% true labels.
## Results
Overall the model performed just about as expected with test balanced accuracy being 54.82% and accuracy 63.97%. The model managed to learn more of the complexity if the question is not answered by the passage (minority class) than when only using linear layers. Using only 2 RNN layers and 2 Linear layer is not able to completly capture the complexity of the question answering task, as many of the better performing runs seem to be overfitting, as both loss and accuracy are rising in the validation step.

The final performance is slightly worse than expected, but the amount of epochs used to train were also reduced from 500 to 50. In that context the performance is very satisfactory as less time was needed than expected to reach this level of performance.

Precision and recall seem balanced as the confusion matrix reflects.

Interpretation of the experiments are following: 

One key point is that the maximum epochs were set to 50, therefore many of the steps to avoid overfitting were contributing negativly towards the accuracy in this scenario.
If the more epochs were used dropout and weight decay would become more important.

The learning rate is also affected by the low epoch amount, as the smaller learning rate requires more time to reach the same accuracy as larger learning rate.

The hidden size of the model did not seem to matter as the input to the model could not capute the complexity of the question and passage.

## Learning
Most of the decision turned out to be decent.

Using GloVe instead of fasttext seemed to be better, as I did not need the subword abilities of fasttext.

The formulation for the preprocessing can be improved as the question passage seperator was misinterpreted as literal 0 characters not 0 vectors to be used as seperator.

I misread the project task and forgot define the linear layers in stage 1, which was a blunder but not very devastating.

One point where I experienced problems was with the checkpoints which are created. Eventhough the amount of checkpoints were restricted in comparision to project 1. I managed to fill up the 100GB Weights & Biases storages with 2 full sweeps.