# Project 2: RNN Boolq



# Introduction



W&B Link: TODO

# Setup
## Notebook setup
- Install all dependencies. `pip install`
- Import all necessary libraries.
- Log into Hugging Face and Weights & Biases.
- Download GloVe Wikipedia embeddings from HuggingFace and unzip them in the `data` folder.
- Load embeddings into `embeddings` dictionary variable.

## Dependencies
Install all necessary dependencies
- PyTorch: `torch lightning`
- Hugging Face: `huggingface_hub datasets`
- Weights & Biases: `wandb`
- nltk: `nltk`
- numpy: `numpy`

Optional
- Lint and Formatting: `ruff`

## Tools used
- GPUHub JupyterLab
- Pytorch Lightning documentation
- Project 1 as skeleton for texts which do not need to change
- No AI tools used, as 90% of their generated code is either broken or I do not understand and therefore useless

# Preprocessing

Predefined requirements:
- Train / Validation / Test split
- Existing word embedding model: word2vec, GloVe, fastText
- Download the BoolQ dataset with `datasets` and split it in the predefined way.

Data treatment steps:
- Lower case `text.lower()`
    - Reason being the used GloVe pretrained embeddings on Wikipedia are uncased
- Tokenize with nltk
    - Use `word_tokenize` as we are interested in every word for word embedding with GloVe
- Remove punctuation and non ascii characters (phoenetics etc.)
    - Punctuation and non ascii characters are not relevant for answering questions
    - Stop words are not removed as they are relevant for answering the question
    - Remove ascii by encoding and decoding with `ascii`. `text.encode('ascii', 'ignore').decode('ascii')`
    - Remove punctuation by checking against `string.punctuation` 
- Word embedding with GloVe (pretrained on Wikipedia)
    - Perfered to word2vec because GloVe works with co-occurence and answering questions is about context
    - Prefered to fastText because through previous processing no subword embeddings are needed
    - Skip the word if it is not in the vocabulary of `embeddings`
- Truncating by averaging of passage
    - This is needed as in the dataset there are a few very long outliers, which would bloat the input to the model
    - Enforce a maximum length, where ~99% of remaining passages not truncated `np.percentile(passages_lengths, 99)`
    - Take the average of what should be truncated and add it to the end of the passage vector
- Padding with 0 for question and passage for minimum length `np.pad`
    - This is needed for the concatenation of question and passage, as they need to have the same length
    - Pad all questions to maximum length of all questions
    - Pad all passages to maximum length of all passages determined previously
- Concatenate question and passage as the input for the model
    - `np.concatenate` is used to have a single input for the model, which is not in an extra dimension as when using `np.stack`
    - Add a seperator of 8 zeros between question and passage 
        - This is needed to be able to differentiate between question and passage
- Remove `question` and `passage` columns from the dataset
    - They are not needed anymore as they are now part of the input
    - `dataset.remove_columns(["question", "passage"])`

Used features:
- `question` and `passage` as word vectors
- `answer` as label

Input format: concatenated `question` and `passage` word vectors (max length of question + seperator + max length of passage x embedding size)

Label format:
- convert `answer` boolean to 1 or 0
- Model output is probability of 1
- `dataset.cast_column("answer", Value("int32"))`

Batch size: 64 for faster training

## Correctness tests
- Check processed passages and questions before embedding if they still make sense 
- Check embedding lengths
- Check how many words are not in the vocabulary and maybe adjust which pretrained GloVe emebeddings are used based on that

## Implementation
TODO

# Model

Predefined requirements:
- RNN
    - LSTM or GRU
- Classifier
    - 2 Layers
    - ReLu

## Network Architecture
- Input layer
    - `torch.nn.GRU`
    - Input Dimension: query + question word vector dimension  
    - Output Dimension: hidden layer dimension
    - Activation: `torch.nn.ReLu`
- Output layer
    - `torch.nn.GRU`
    - Input Dimension: hidden layer dimension
    - Output Dimension: 1
        - Output is probability of class (1 = 100% true, 0 = 0% true)
    - Activation: `torch.sigmoid`
- Normalization: [GloVe word vectors are already normalized](https://github.com/JungeAlexander/GloVe/blob/master/eval/python/evaluate.py#L29-L33) 
- Regularization: done by optimizer

Using GRU because it is simpler than LSTM and has similar performance, while being faster to train because it has one gate less.

### Loss function
Binary Cross-Entropy: 
- Best choice for binary classification problems
- `torch.nn.BCELoss`

### Optimizer
AdamW:
- Better with less hyperparamater tuning than SGD and the default Adam
- `torch.optim.AdamW`

## Experiments
- Hidden layers dimension (128, 256, 512)
    - To check if more complex models are needed
- Dropout (0, 1e-1, 2e-1)
    - To check how much regularization is needed (avoid under/overfitting)
- Learning rate (1e-3, 1e-4, 1e-5)
    - To check which learning rate is optimal
    - No learning rate scheduler is needed as AdamW handles adjusting learning rates dynamically on its own with the passed learning rate being the maximum
- Weight Decay (0, 1e-1, 1e-2)
    - To check how much regularization is needed (avoid under/overfitting)

### Checkpoints
Best epochs based on validation balanced accuracy:
- uploaded to wandb for later use
- `ModelCheckpoint(save_top_k=10, monitor="val_balanced_acc", mode="max")`

### Early stop
Compare to previous epochs validation balanced accuracy
- wandb sweeps use the Hyperband algorithm
- Max epochs 500
- Check every 10 epochs

## Correctness test
Test run of training, validation, test and prediction with 1 input

## Implementation
TODO

# Training

- Use wandb sweeps for hyperparameter tuning.
    - Grid search will be used, as the hyperparameter choices are discrete and the search space is not too large (3x3x3x3 = 81 experiments)
    - Manually doing many experiments is tedious therefore use wandb sweeps
    - Best integration into wandb instead of other libraries such as optuna, ray
    - `wandb.sweep`

- Log training and validation metrics to wandb after every epoch
    - Log at end of epoch by using `training_epoch_end` and `validation_epoch_end`
    - Balanced accuracy
        - Accuracy is not a good metric for imbalanced datasets, as it can be misleading
        - `torchmetrics.functional.classification.accuracy(preds, target, task='multiclass', num_classes=2, average='macro')`
    - Loss
        - Loss should decrease over time
    - Precision Recall curve
        - Show the tradeoff between precision and recall
        - `torchmetrics.functional.classification.precision_recall_curve(pred, target, task='multiclass', num_classes=2)`
    - F1
        - F1 is a better performance measure than accuracy in imbalanced datasets 
        - `torchmetrics.functional.classification.f1_score(preds, target, task='multiclass', num_classes=2, average='macro')`
- Run validation after every epoch
    - To check how the model is doing on unseen data

After all experiments have run select best runs based on the balanced accuracy as the final model to be evaluated.

Balanced accuracy is the decision metric as it also includes the negative predictions, unlike F1. We are also interested in the negatives because they also have to be predicted correctly for question answering.

## Implementation
TODO

# Evaluation
Most metric implementation will reuse the code from the training phase, as they are the same.

Additionally the accuracy and confusion matrix will also be examined. Both will only be implemented for the evaluation step. 
- Accuracy is to be able to compare the model to the previous project
    - As well as to check how it compares to the dataset imbalance
    - Additionally because accuracy is easier to understand as a metric than balanced accuracy
    - `torchmetrics.functional.classification.accuracy(preds, target, task='multiclass', num_classes=2, average='micro')`
- The confusion matrix is to be able to see where the model tends to make mistakes.
    - If it only predicts one class or of it mixes in predictions of the other class
    - `torchmetrics.functional.confusion_matrix(preds, target, num_classes=2)`

Metrics used for evaluation:
- Accuracy
- Balanced Accuracy
- Precision
- Recall
- F1
- Confusion matrix


Load the best model from wandb artifact registry.

Run evaluation of final model with test and validation dataset.

There will be no changing of parameters after the final model has been evaluated. As that would be train-test leakage.

## Implementation
TODO

## Result
TODO

# Interpretation
Expectation:
- 55% balanced accuracy with test dataset. As this would be better than just randomly guessing if the answer to the question is true or false.
- 65% accuracy with test dataset. As this would be better than the test label imbalance of 62.2% true labels.


## Result and learnings
TODO
