# Project 5: LLM

The documentation is split into small chunks following the suggestion in class and from feedback for previous projects.

# Introduction



W&B Link: TODO

# Setup
Preliminary steps for setting getting the project running.

## Tools used
- GPUHub JupyterLab
- No AI tools used, as they do not help with reading API documentation and GitHub issues 
- Previous projects documentation

## Dependencies
The notebook was created with:
Python 

Install all necessary dependencies
- Pytorch: `torch`
- Hugging Face: `huggingface_hub transformers datasets peft`
- Weights & Biases: `wandb`
- numpy: `numpy`
- scikit-learn: `scikit-learn`
- Lint and Formatting: `ruff`

Versions of dependencies are pinned for reproducibility.

## Notebook setup
Import all necessary libraries.

Log into Hugging Face and Weights & Biases.

# Preprocessing

Predefined requirements:
- Download the BoolQ dataset with `datasets` and split it in the predefined way.
- Train / Validation / Test split

Used features:
- `question` and `passage` as input to the model
- `answer` as label

Input format:
- concatenated `question` and `passage` strings
- with a special seperator token from the model vocabulary in the middle to differentiate between them
- question before passage because to be able to answer the question it first hast to be known

Label format:
- convert `answer` boolean to 1 or 0
- Model output is probability of 1

Batch size: 64 for faster training than with individual samples

A lot of preprocessing steps are not needed, because the predefined tokenizer for the model does most of the work. The input format is the raw text without any changes.
The tokenizer does not do any stemming, stopword removal, lower casing, format cleaning. For unknown words a special token `[UNK]` is used.

## Correctness tests
- Check processed passages and questions if they still make sense 

## Implementation
TODO

Download and split dataset in predefined way

Convert answer boolean to 1 or 0, because the model output is a probability of 1.

- Concat `question` and `passage` into `query`
    - Add special seperator token `[SEP]` between them to distinguish both texts from another
- Tokenize sentence with `LLAMA`
    - It handles the tokenization of the text with `TikToken` and the conversion to word vectors as well.
    - Padding to the maximum input length of each batch is done by the `DataCollatorWithPadding` later
    - Truncation should not be needed, as the maximum input length is quite large ~8000 tokens

Remove unnecssary `question` and `passage` columns, as they are represented in `query`

# Model
Predefined requirements:
- LLM (≥ 1B parameters)
- Use a quantized version as the base model

## Network Architecture


- Normalization: Done in the `DeBERTa` model with their Masked Layer Normalization
- Regularization: Optimizer `AdamW` applies L2 regularization to loss, no regularization layer is in `DeBERTa`

### Loss function
Default by transformers library: Binary Cross-Entropy with logit loss:
- Not changed because it is the best choice for binary classification problems
- and with logits can be better than only Binary cross entropy because it is supposedly more numerically stable

### Optimizer
Default by transformers library: `AdamW`
- Not changed because it performs well and the original `DeBERTa` was also trained with a version of `AdamW`

## Correctness test
Test run of training, validation, test and prediction with 1 input
Check transformer encoder output shapes

## Implementation

Correctness test of the model definition, by running the model with one batch.

Preliminary evaluation with 5 diverse prompts.

### Checkpoints
Save checkpoints at end of training with `transformers.integrations.WandbCallback` configuration and further configuration later in `TrainingArguments`.

## Experiments



# Training
Predefined requirements:
- Then train it with parameter-efficient fine-tuning (I suggest LoRA, see e.g. the HF blog post or quicktour).


Training is done with the `Trainer` class from the `transformers` library.
Configure training and evaluation with `TrainingArguments`. 
- set `seed` for reproducibility
- `logging_strategy = 'epoch'` to log metrics after each epoch
- `eval_strategy = 'epoch'` to evaluate after each epoch
- `save_strategy = 'steps'` to save after ever 500 steps
- `save_total_limit = 3` to save only the last 3 checkpoints, otherwise limited wandb storage will overfill
- `dataloader_num_workers = 2` to speed up data loading
- `num_train_epochs = 20` use a low number of epochs, because every epoch will take a long time

Metrics for training and validation:
- Accuracy, because we are interested in both correct true and false predictions
- Loss, to see how confident the model is in its predictions
- Metrics are logged every epoch. Because logging per step is very noisy and does not have a benefit.

Loss is the main metric for all decisions, as it is the most important metric for the model. Accuracy should follow loss in a correct model. Therefore, it is not necessary to optimize for accuracy.

As discussed in class no other metrics are needed for training and validation. As accuracy and loss are sufficient to evaluate which model is the best.

Accuracy has to be implemented seperatly for training and evaluation, because `Trainer` from `transformers` only logs loss per default.
- Create a `TrainerCallback` for training accuracy
- Define `compute_metrics` method for validation accuracy


## Implementation

- Use `wandb.sweep` for hyperparameter tuning.
- Bayesian search will be used, because there is only one hyperparameter choices and it is continous

After all experiments have run select best runs based on the smallest loss as the final model to be evaluated.

# Evaluation
Metrics:
- Accuracy
    - to be able to compare the model to the previous projects
    - As well as to check how it compares to the dataset imbalance
    - `torchmetrics.functional.classification.accuracy(preds, target, task='binary')`
- Confusion matrix
    - To be able to see where the model tends to make mistakes.
    - `torchmetrics.functional.confusion_matrix(preds, target, num_classes=2)`
    - As discussed in class: use scikit-learn instead of wandb, as it is easier to interpret
- Total false predictions
    - To see how many false predictions the model made

The averaging of the metrics is the default of `micro` which means the metrics are caculated without weighting of the classes.

Evaluation will also be done with the `Trainer` class, just using the `evaluate` method and the test dataset.

## Implementation

## Result


Check if the implementation for test and predict are correct

Load the best model from wandb artifact registry.

Run evaluation of final model with test dataset.

# Interpretation

## Results

## Learning