# Project 3: Transformer BoolQ

The documentation is split into small chunks following the suggestion in class and from feedback for previous projects.

# Introduction

Classification of BoolQ with Transformers.


W&B Link: TODO

# Setup
Preliminary steps for setting getting the project running.

## Tools used
- GPUHub JupyterLab
- Pytorch Lightning documentation
- No AI tools used, as they do not help with reading API documentation and GitHub issues 
- Previous projects documentation

## Dependencies
The notebook was created with:
Python 

Install all necessary dependencies
- PyTorch: `torch lightning`
- Hugging Face: `huggingface_hub datasets`
- Weights & Biases: `wandb`
- nltk: `nltk`
- numpy: `numpy`
- scikit-learn: `scikit-learn`
- Lint and Formatting: `ruff`

Versions of dependencies are pinned for reproducibility.

## Notebook setup
Import all necessary libraries.

Log into Hugging Face and Weights & Biases.

# Preprocessing

Predefined requirements:
- Download the BoolQ dataset with `datasets` and split it in the predefined way.
- Train / Validation / Test split

Used features:
- `question` and `passage` as word vectors
- `answer` as label

Input format:
- concatenated `question` and `passage` word vectors
- with a special seperator token in the middle to differentiate between them
- the ordering should not matter as transformers do not have sequential processing like RNNs

Label format:
- convert `answer` boolean to 1 or 0
- Model output is probability of 1
- `dataset.cast_column("answer", Value("int32"))`

Batch size: 64 for faster training than with individual samples

Why not:
- Stemming/ Lemmatization
    - Removes potential information about how the word is used. Which is important for answering questions.
- Removal of other words/ stopwords
    - Stopwords are important to answering the question, as negations and other important words are counted as stopwords.
    - Some words might be worth removing (wikipedia parsing errors, tooltip text in paragraph), but the required effort for minimal gains are not work.
- Format cleaning
    - Other than removing non ascii words there no cleaning is needed.
    - Because looking through some examples of the data it reads like the intended text.
- Truncation
    - As discussed in class: information will be lost. As input sizes are not a problem, this is not needed.
- Padding
    - As discussed in class: Padding will be done for each batch indivdually, instead of padding all passages to the same length.


## Correctness tests
- Check processed passages and questions before embedding if they still make sense 
- Check embedding lengths

## Implementation
TODO

Download and split dataset in predefined way

- Lower case text
    - Case of words is not very important for answering questions, has potential to reduce vocabulary size
    - question is also only lowercased, therefore lowercasing the passage brings them closer together in terms of format
    - With `.lower()`
- Remove special characters (punctuation)
    - Punctuation is not relevant for answering questions. Question marks are implicit for the question. Passage does not contain important punctuation
    - Could reduce the needed context for a sentence, improving the performance of the model
    - Remove by checking against `string.punctuation` 
- Tokenize sentence with `nltk`
    - Use `punkt_tokenize` as we are interested in every word
    - Instead of `word_tokenize` as the tokenization was not very good in the last projects
- Remove words with non ascii (phoenetics etc.)
    - Non ascii words are not as important for answering questions, as there are not enough of them to be relevant
    - Example: `Persian (/ˈpɜːrʒən, -ʃən/)` only the first part is important
    - Remove non ascii by checking `.isascii()`
- Concat `question` and `passage` into `query`
    - The Transformers implementation will only work on one sequence and not multiple 
    - Add special seperator token between them to distinguish both texts from another

Check processed passages and questions if they still make sense. The question must still be answerable with the passage even after the processing. 

Remove unnecssary `question` and `passage` columns, as they are represented in `query`

Convert answer boolean to 1 or 0, because the model output is a probability of 1.

Generate vocabulary for embedding layer.
- Use all words present, to not lose information
- Introduce special tokens for padding and seperation
- `torchtext.vocab.build_vocab_from_iterator`

# Model
Predefined requirements:
- nn.Embedding
- nn.TransformerEncoder
    - 6-layers randomly initialized
- Classifier
    - 2 Layers
    - ReLu

## Network Architecture
- Input layer
    - `nn.Embedding`
    - Input Dimension: Vocabulary size
    - Output Dimension: Embedding Dimension
    - Intialized with random weights
    - Embedding Dimension: 256
- Positional Encoding
    - `RotaryPositionalEmbeddings` or `PositionalEncoding`
    - As transformers does not have positional information, this has to be added the input
    - Absolute is used as a baseline
    - Rotary is used instead of absolute or relatve as it combines the best of both
- Hidden layers
    - `nn.TransformerEncoder`
    - Input Dimension: Embedding Dimension
    - Output Dimension: Embedding Dimension
    - No input masking will be done, as that could obscure important information for the task
- Output layer
    - `nn.Linear`
    - Input Dimension: Embedding Dimension
    - Output Dimension: 1
        - Output is probability of class (1 = 100% true, 0 = 0% true)
    - Activation: `torch.nn.ReLu`
    - Final Activation: `torch.sigmoid`

- Normalization: Done in `TransformerEncoder` with `LayerNorm`
    - It seems that `BatchNorm` is not optimal for transformers [source](https://stats.stackexchange.com/questions/474440/why-do-transformers-use-layer-norm-instead-of-batch-norm)
- Regularization: done by optimizer


### Loss function
Binary Cross-Entropy: 
- Is used because it is the best choice for binary classification problems
- `torch.nn.BCELoss`

### Optimizer
AdamW:
- Chosen because it should be better with less hyperparamater tuning than SGD and the default Adam
- `torch.optim.AdamW`

## Correctness test
Test run of training, validation, test and prediction with 1 input

## Implementation

Correctness test of the model definition, by running the model with one batch.

### Checkpoints
Best epochs based on smallest validation loss:
- Uses loss because it is the most important metric for the model
- Save few checkpoints (top 3) to not bloat the storage, because previous project managed to fill wandb storage with too many checkpoints
- uploaded to wandb for later use
- `ModelCheckpoint`

## Experiments
- Positional Encoding (Absolute, Rotary)
    - To check if rotary is better than absolute
- Number of attention heads (4, 6, 8)
    - To check if more or less attention heads are needed
- Learning rate (1e-3, 1e-4, 1e-5)
    - To check which learning rate is optimal
    - No learning rate scheduler is needed as AdamW handles adjusting learning rates dynamically on its own with the passed learning rate being the maximum


### Early stop
Compare to previous epochs validation accuracy
- wandb sweeps use the Hyperband algorithm
- Max epochs 60
- Check every 10 epochs

# Training
Metrics for training and validation:
- Accuracy, because we are interested in both correct true and false predictions
- Loss, to see how confident the model is in its predictions

Loss is the main metric for all decisions, as it is the most important metric for the model. Accuracy should follow loss in a correct model. Therefore, it is not necessary to optimize for accuracy.

As discussed in class no other metrics are needed for training and validation. As accuracy and loss are sufficient to evaluate which model is the best.

Log training and validation metrics to wandb after every epoch. Logging per step would be too noisy and have no benefit.
## Implementation

- Run validation after every epoch (done automatically by pytorch lightning)
- To check how the model is doing on unseen data
- Also needed to be able to create model checkpoints and early stopping

Check if the train and validation was defined correctly

Use `DataLoader` with `collate_fn` to create batches with the defined padding token to maximum length of concatenated question and passage for each batch
- As this was suggested to be done in class instead of padding all inputs to the same length 

- Use `wandb.sweep` for hyperparameter tuning.
    - Grid search will be used, as the hyperparameter choices are discrete and the search space is not too large (3x3x3 = 27 experiments)
    - Manually doing various experiments is tedious therefore use automated sweeps
    - Best integration into wandb instead of other libraries such as optuna, ray

After all experiments have run select best runs based on the smallest loss as the final model to be evaluated.

# Evaluation
A few additional metrics are implemented for evaluation for better interpretation of the model results.

Metrics:
- Accuracy
    - to be able to compare the model to the previous projects
    - As well as to check how it compares to the dataset imbalance
    - `torchmetrics.functional.classification.accuracy(preds, target, task='binary')`
- Confusion matrix
    - To be able to see where the model tends to make mistakes.
    - `torchmetrics.functional.confusion_matrix(preds, target, num_classes=2)`
    - As discussed in class: use scikit-learn instead of wandb, as it is easier to interpret
- Recall and Specifcity
    - For error anlysis of the predictions of the classes
    - Suggested in class to see how the model performs on the different classes
    - Recall for true labels, specificity for false lables
    - `torchmetrics.functional.recall(preds, target, task='binary')`
    - `torchmetrics.functional.specificity(preds, target, task='binary')` 

The averaging of the metrics is the default of `micro` which means the metrics are caculated without weighting of the classes.

## Implementation

## Result


Check if the implementation for test and predict are correct

Load the best model from wandb artifact registry.

Implement confusion matrix calculation.

Run evaluation of final model with test and validation dataset.

# Interpretation
Expectation:
70% accuracy with test dataset. As this would be better than the test label imbalance of 62.2% true labels.
The expectation is higher than past projects as transformers have the potential to be better than the previous methods.
Further the expectation is that most of the accuracy comes from the majority class and not a balanced correct prediction of both classes. This is because we just select by minmum loss and not by balanced accuracy.
The model with `RotaryPositionEmbedding` should outperform `PositionEncoding` as it is an improvement to the position encodings.
More attention heads should also perform better than less, as the model should be able to generalize better.  

## Results

## Learning