# Project 4: Pretrained Transformer BoolQ

The documentation is split into small chunks following the suggestion in class and from feedback for previous projects.

# Introduction

Classification of BoolQ with Pretrained Transformers.


W&B Link: TODO

# Setup
Preliminary steps for setting getting the project running.

## Tools used
- GPUHub JupyterLab
- No AI tools used, as they do not help with reading API documentation and GitHub issues 
- Previous projects documentation

## Dependencies
The notebook was created with:
Python 

Install all necessary dependencies
- Pytorch: `torch`
- Hugging Face: `huggingface_hub transformers datasets`
- Weights & Biases: `wandb`
- numpy: `numpy`
- scikit-learn: `scikit-learn`
- Lint and Formatting: `ruff`

Versions of dependencies are pinned for reproducibility.

In [None]:
%pip install torch==2.3.1 huggingface_hub==1.25.2 transformers datasets==3.0.1 wandb==0.18.3 numpy==1.26.4 scikit-learn==1.5.2 ruff==0.6.9

## Notebook setup
Import all necessary libraries.

In [None]:
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
import numpy as np
from datasets import Value
from datasets import load_dataset
import wandb

Log into Hugging Face and Weights & Biases.

In [None]:
WANDB_PROJECT = "nlp-project-4"
os.environ["WANDB_PROJECT"]=WANDB_PROJECT
os.environ["WANDB_NOTEBOOK_NAME"]="project4-stage2"
wandb.login()

# Preprocessing

Predefined requirements:
- Download the BoolQ dataset with `datasets` and split it in the predefined way.
- Train / Validation / Test split

Used features:
- `question` and `passage` as input to the model
- `answer` as label

Input format:
- concatenated `question` and `passage` strings
- with a special seperator token from the model vocabulary in the middle to differentiate between them
- question before passage because to be able to answer the question it first hast to be known

Label format:
- convert `answer` boolean to 1 or 0
- Model output is probability of 1

Batch size: 64 for faster training than with individual samples

A lot of preprocessing steps are not needed, because the predefined tokenizer for the model does most of the work. The input format is the raw text without any changes.
The tokenizer does not do any stemming, stopword removal, lower casing, format cleaning. For unknown words a special token `[UNK]` is used.

## Correctness tests
- Check processed passages and questions if they still make sense 

## Implementation
TODO

Download and split dataset in predefined way

In [None]:
train_raw = load_dataset("google/boolq", split="train[:-1000]")
valid_raw = load_dataset("google/boolq", split="train[-1000:]")
test_raw = load_dataset("google/boolq", split="validation")

print(len(train_raw), len(valid_raw), len(test_raw))

- Concat `question` and `passage` into `query`
    - Add special seperator token `[SEP]` between them to distinguish both texts from another
- Tokenize sentence with `DeBERTAV2Tokenizer`
    - It handles the tokenization of the text with `SentencePiece` and the conversion to word vectors as well.
    - `SentencePiece` is a subword tokenizer, which learns how to split the text into subwords
    - Padding to the maximum input length of each batch is done by the `DataCollatorWithPadding` later
    - Truncation should not be needed, as the maximum input length is quite large (over the usual 512)

Convert answer boolean to 1 or 0, because the model output is a probability of 1.

In [None]:
train_casted = train_raw.cast_column("answer", Value("float32"))
valid_casted = valid_raw.cast_column("answer", Value("float32"))
test_casted = test_raw.cast_column("answer", Value("float32"))

In [None]:
MODEL = "deberta-v3-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def preprocess_dataset(row):
    query = row["question"] + "[SEP]" + row["passage"]
    tokenized = tokenizer(query)
    tokenized["labels"] = row["answer"]
    return tokenized

train_concatenated = train_casted.map(preprocess_dataset)
valid_concatenated = valid_casted.map(preprocess_dataset)
test_concatenated = test_casted.map(preprocess_dataset)

Remove unnecssary `question` and `passage` columns, as they are represented in `query`

In [None]:
train = train_concatenated.remove_columns(["question", "passage"])
valid = valid_concatenated.remove_columns(["question", "passage"])
test = test_concatenated.remove_columns(["question", "passage"])

# Model
Predefined requirements:
- pretrained Transformer encoder (from Hugging Face; it must not be finetuned on the BoolQ dataset yet)
- Classifier
    - 2 Layers
    - ReLu

## Network Architecture
- Transformer encoder
    - `DeBERTa-v3-large`
    - Input Dimension: Batch size x 128100
    - Output Dimension: Batch size x 768
    - It was chosen as the latest version of the `DeBERTa` series, which have improvement over the original `BERT` model.
        - With improvements in attention and pre training over `BERT` and `RoBRTa`.
        - the `large` variant is used because I assume it will fit into memory and performs better than the base variant
        - The `v2` tokenizer is compatible with the `v3` model
- Classifier
    - `nn.Linear`
        - Input Dimension: 768 (Pooled output of `DeBERTa`)
        - Output Dimension: 256
        - Activation: `torch.nn.ReLu`
        - Output shape is smaller than input, to promote the model to learn more abstract features for the classification
    - `nn.Dropout`
        - Dropout rate: 0.1
        - Dropout is used to prevent overfitting, which happens easily with such a small dataset
    - `nn.Linear`
        - Input Dimension: 256
        - Output Dimension: 1
            - Output is probability of class (1 = 100% true, 0 = 0% true)
        - Final Activation: `torch.nn.Sigmoid`

- Normalization: Done in the `DeBERTa` model with their Masked Layer Normalization
- Regularization: Optimizer `AdamW` applies L2 regularization to loss, no regularization layer is in `DeBERTa`

### Loss function
Default by transformers library: Binary Cross-Entropy with logit loss:
- Not changed because it is the best choice for binary classification problems
- and with logits can be better than only Binary cross entropy because it is supposedly more numerically stable

### Optimizer
Default by transformers library: `AdamW`
- Not changed because it performs well and the original `DeBERTa` was also trained with a version of `AdamW`

## Correctness test
Test run of training, validation, test and prediction with 1 input
Check transformer encoder output shapes

## Implementation

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL, problem_type="multi_label_classification", num_labels=1)
model.classifier = torch.nn.Sequential(
    torch.nn.Linear(768, 526),
    torch.nn.ReLU(),
    torch.nn.Dropout(0.1),
    torch.nn.Linear(258, 1),
    torch.nn.Sigmoid()
)

Correctness test of the model definition, by running the model with one batch.

### Checkpoints
Save checkpoints at end of training with `transformers.integrations.WandbCallback` configuration and further configuration later in `TrainingArguments`.

In [None]:
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

## Experiments
Various Learning rates (1 - 1e-10) in a uniform distribution
- Extensive sweep to check which learning rate is optimal
- No learning rate scheduler is needed as AdamW handles adjusting learning rates dynamically on its own with the passed learning rate being the maximum

No other experiments are done, because there are not many other parameters which can be changed, as we are using a pre trained model. Additionally finding the optimal learning rate is the most important part of training the model.
- The classifier hidden sizes are not changed as they are intended to be decreasing. For the model to pre compress the information before the binary classifcation.
- No Weight decay because it might affect the pre trained model negativly
- Learning rate warm up is not done, because max epochs are relativly low  


### Early stop
Early stopping is done by wandb sweeps.
- Compare to previous epochs validation loss
- wandb sweeps use the Hyperband algorithm
- Max epochs 50
- Check every 10 epochs

In [None]:
experiments = {
    "method": "bayesian",
    "metric": {"goal": "minimize", "name": "val_loss"},
    "parameters": {
        "lr": {"values": [1e-4, 1e-5]},
    },
    "early_terminate": {"type": "hyperband", "max_iter": 50, "s": 10, "eta": 3},
}

# Training
Training is done with the `Trainer` class from the `transformers` library.
Configure training and evaluation with `TrainingArguments`. 
- set `seed` for reproducibility
- `logging_strategy = 'epoch'` to log metrics after each epoch
- `eval_strategy = 'epoch'` to evaluate after each epoch
- `save_strategy = 'steps'` to save after ever 500 steps
- `save_total_limit = 3` to save only the last 3 checkpoints
- `label_names = ['answer']` to set the label name to the key we use
- `report_to = 'wandb'` to log metrics to wandb
- `run_name` to set the name of the run to an informative name
- `dataloader_num_workers = 4` to speed up data loading
- `per_device_train_batch_size = 64` to set the desired batch size for training
- `per_device_eval_batch_size = 64` to set the desired batch size for evaluation
- `num_train_epochs = 50` to set the desired max epochs

Metrics for training and validation:
- Accuracy, because we are interested in both correct true and false predictions
- Loss, to see how confident the model is in its predictions
- Metrics are logged every epoch. Because logging per step is very noisy and does not have a benefit.

Loss is the main metric for all decisions, as it is the most important metric for the model. Accuracy should follow loss in a correct model. Therefore, it is not necessary to optimize for accuracy.

As discussed in class no other metrics are needed for training and validation. As accuracy and loss are sufficient to evaluate which model is the best.

Accuracy has to be implemented seperatly for training and evaluation, because `Trainer` from `transformers` only logs loss per default. 

## Implementation

In [None]:
train_args = TrainingArguments(
    seed=42,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="steps",
    save_total_limit=3,
    report_to="wandb",
    dataloader_num_workers=4,
)

- Use `wandb.sweep` for hyperparameter tuning.
- Bayesian search will be used, because there is only one hyperparameter choices and it is continous

In [None]:
def sweep():
    with wandb.init(project=WANDB_PROJECT) as run:
        name = f"lr:{wandb.config['lr']}"
        run.name = name

        train_args = TrainingArguments(
            learning_rate=wandb.config['lr'],
            seed=42,
            logging_strategy="epoch",
            evaluation_strategy="epoch",
            save_strategy="steps",
            save_total_limit=3,
            report_to="wandb",
            dataloader_num_workers=4,
        )
        trainer = Trainer(model, train_args, train_dataset=train, eval_dataset=valid, tokenizer=tokenizer, compute_metrics="")

        name = f"pos:{wandb.config['positional']}/heads:{wandb.config['heads']}/lr:{wandb.config['lr']}"
        try:
            trainer.train()
            trainer.evaluate()
        except Exception as e:
            print(e)
        finally:
            wandb.finish()
        wandb.finish()


sweep_id = wandb.sweep(sweep=experiments, project=WANDB_PROJECT)

wandb.agent(sweep_id, function=sweep, project=WANDB_PROJECT)
wandb.teardown()

After all experiments have run select best runs based on the smallest loss as the final model to be evaluated.

In [None]:
# check wandb for sweep id
SWEEP_ID = "zmgqc6oi"

api = wandb.Api()
sweep = api.sweep(f"yelin-zhang-hslu/{WANDB_PROJECT}/{SWEEP_ID}")
runs = sorted(
    sweep.runs,
    key=lambda run: run.summary.get("val_loss", 99),
    reverse=False,
)
val_loss = runs[0].summary.get("val_loss", 99)
print(f"Best run {runs[0].name} with {val_loss} validation loss")

# Evaluation
Metrics:
- Accuracy
    - to be able to compare the model to the previous projects
    - As well as to check how it compares to the dataset imbalance
    - `torchmetrics.functional.classification.accuracy(preds, target, task='binary')`
- Confusion matrix
    - To be able to see where the model tends to make mistakes.
    - `torchmetrics.functional.confusion_matrix(preds, target, num_classes=2)`
    - As discussed in class: use scikit-learn instead of wandb, as it is easier to interpret
- Total false predictions
    - To see how many false predictions the model made

The averaging of the metrics is the default of `micro` which means the metrics are caculated without weighting of the classes.

Evaluation will also be done with the `Trainer` class, just using the `evaluate` method and the test dataset.

## Implementation

## Result


In [None]:
def create_confusion_matrix(trainer, classifier, dataset):
    dataloader = utils.data.DataLoader(
        dataset, batch_size=64, num_workers=2, pin_memory=True, collate_fn=batch_collate
    )
    pred_batches = trainer.predict(classifier, dataloaders=dataloader)
    preds = np.array([])
    for batch in pred_batches:
        preds = np.concatenate((preds, batch.squeeze().numpy()))
    matrix = ConfusionMatrixDisplay.from_predictions(
        y_true=dataset["answer"],
        y_pred=np.round(preds),
    )
    return matrix

Check if the implementation for test and predict are correct by running it once and checking the output

In [None]:
checkpoint_reference = f"yelin-zhang-hslu/{WANDB_PROJECT}/model-{runs[0].id}:best"

artifact = api.artifact(checkpoint_reference).download()
wandb.init(project=WANDB_PROJECT, id=runs[0].id, resume="allow")

# classifier_final = TransformerClassifier.load_from_checkpoint(Path(artifact) / "model.ckpt")

Load the best model from wandb artifact registry.

Implement confusion matrix calculation.

Run evaluation of final model with test and validation dataset.

# Interpretation
Expectation:
70% accuracy with test dataset. The expectation is that by using a good pretrained transformer encoder model (e.g. BERT family) that it has learned the semantics of words and sentences already. Therefore the classification of the questions should be easier.

## Results

## Learning