# Large Language Models for BoolQ reading comprehension

## Sources

The primary resources for this project include the Hugging Face documentation and the Meta Llama 3.2 model repository. AI tools such as ChatGPT were utilized to refine the writing and grammar in Stage 1 and will assist in debugging during Stage 2.

## Setup

### **Importing Python Packages**
Making sure the notebook is reproducible and runs without error, I will install the necessary libraries in a pip cell below.

### **Data Loading and Split**
The BoolQ dataset contains binary question-answer pairs. Each entry consists of a question, a passage, and the corresponding binary answer (yes/no). The dataset is split as required by the course materials:
- **Train Split:** The first 8427 entries of the training data.
- **Validation Split:** The last 1000 entries of the training data.
- **Test Split:** The validation split provided in the BoolQ dataset (3270 entries).

### **Seeding for Reproducibility**
A seed value of 42 is used to ensure reproducibility of results across different runs.

### **Batch size**
Setting the batch size in the beginning of the notebook for use throughout the code.

## Preprocessing

### Tokenization
Utilize the `AutoTokenizer` from Hugging Face's Transformers library, corresponding to the Llama 3.2 1B model. This tokenizer will handle the conversion of text into token IDs, managing aspects like whitespace, special characters, and subword tokenization.

##### Text Normalization
Retain the original casing to preserve semantic nuances, as the Llama 3.2 tokenizer is case-sensitive. Stemming, lemmatization, and stopword removal are unnecessary, as the model's tokenizer and embeddings are designed to handle such variations.

#### Padding and Truncation
The Llama 3.2 1B model supports a maximum sequence length of 2048 tokens [GitHub](https://github.com/meta-llama/llama/issues/148). Sequences exceeding this length will be truncated, and shorter sequences will be padded to create uniform input lengths within each batch.


#### Input Format
Each input will consist of a concatenation of the question and passage, separated by a special token (e.g., `[SEP]`). This format allows the model to process both components simultaneously.


#### Label Format
The model will predict binary labels: `1` for 'yes' and `0` for 'no'.


#### Planned Correctness Tests
Verify that tokenization produces expected token IDs and that input sequences are correctly formatted and padded.

## Model

### Architecture
- **Pretrained Transformer Encoder:**
  - Hugging Face’s `bert-base-cased` processes tokenized inputs.
  - The output corresponding to the `[CLS]` token is extracted as a fixed-size representation.
  - *The `bert-base-cased` model is pretrained using a masked language modeling and next sentence prediction objective and not question answering. This is why I deem it usable for this project where fine-tuned pretrained models are not allowed.*
- **Classifier:**
  - A two-layer fully connected network processes the `[CLS]` token embedding:
    - **First Layer:** Projects the embedding to the hidden dimension with ReLU activation.
    - **Dropout Layer:** Introduced after ReLU to reduce overfitting.
    - **Second Layer:** Maps the hidden representation to a single binary output using Sigmoid activation.


### Loss Function
The Binary Cross-Entropy Loss (BCE) function is used to calculate the difference between predicted and true labels for binary classification.


### Optimizer
*Learning rates stated here are for testing model functionality. Hyperparameters for experiments stated in `Experiments`.*
- **AdamW Optimizer:**
  - A learning rate of `2e-5` is used for the Transformer encoder.
  - A higher learning rate of `2e-4` is applied to the classifier layers to allow faster convergence.


### Checkpointing and Early Stopping
- **Checkpointing:** Save the model with the best validation accuracy. Criteria for this will be the maximum validation accuracy.
- **Early Stopping:** Terminates training if validation loss does not improve for 10 consecutive epochs.


### Correctness Tests
- **Tokenization**:
    - Ensure the tokenized output does not exceed 512 tokens.
    - Verify alignment between `input_ids` and `attention_mask` dimensions.

- **DataLoader**:
    - Verify batch size consistency during data loading.
    - Check that the output tensors for `input_ids` and `attention_mask` match the expected batch size and sequence length.

- **Model Input/Output**:
    - Confirm the input to the Transformer encoder has the shape `(batch_size, sequence_length)`.
    - Validate that the output of the Transformer encoder has the shape `(batch_size, sequence_length, hidden_dim)`.

- **Classifier Dimensions**:
    - Check that the input to the classifier corresponds to the `[CLS]` token embedding with shape `(batch_size, hidden_dim)`.
    - Ensure the output of the classifier has the shape `(batch_size, 1)`.

- **Reproducibility**:
    - Validate consistent results across multiple runs with the same random seed.


## Experiments
**Batch Size:** I will start with a batch_size of 16 and increase it to the maximum my hardware can handle then leaving it fixed as it is not a hyperparameter.


### Hyperparameters
The following hyperparameter ranges were explored during tuning:
- **Learning Rate:** `[1e-3, 1e-6]` -> The learning rate for the classifier will be 10x the transformer learning rate, as described in the optimizer section.
- **Classifier Hidden Dimension:** `[64, 512]`
- **Dropout Rate:** `[0.1, 0.3]`
- **Weight Decay:** `[1e-4, 1e-6]`
- **Warmup Steps:** `[0.0, 0.1]` in % of total number of steps


### Training Strategy
For testing the model will be run with manually set hyperparameters. In a second stage the model will utilize optuna to automatically find the optimal hyperparameter combination.
- **Epochs:** A maximum of 100 epochs is set, with early stopping enabled. *This will be adjusted based on the runtime per epoch.*
- **Warmup Steps:** 0-10% warmup steps improved convergence in prior transformer projects during training. Will test with and without warmup.


### Metrics
- **Validation Accuracy:** To evaluate model performance across different hyperparameter configurations, I will use validation accuracy as the primary metric.
- **Confusion Matrix:** This will give a comprehensive view of true positives, true negatives, false positives, and false negatives, allowing me deeper insight into the model’s performance.


### Logging
Weights and Biases (WandB) is used for experiment tracking, logging metrics such as train and validation loss, accuracy, and confusion matrices.


## Evaluation
The percentage of yes answers in each data split is: Train; 62.64%, Val; 59.50%, Test;62.17%
Seeing how difficult it was in past projects to reach a much better accuracy than the baseline majority class I am setting my goal for the pretrained BERT model at 64% accuracy on the test set.


### Error Analysis
To understand why the model may fail on certain predictions, I will conduct an error analysis investigating weather miss classifications are related to the confidence score the model has in its predictions. Low confidence on correct answers or high confidence on wrong answers may indicate areas where the model is uncertain or overconfident.


### Confusion Matrix
After the validation step, a confusion matrix is computed to assess true positives, false positives, true negatives, and false negatives. This provides insights into the model's prediction performance.


## Planned Correctness Tests
- Visually checking for decreasing loss during training.
- Verifying predictions with a confusion matrix.

## Interpretation

My last project went decently well, beating the majority class accuracy of 62.17% on the test set. Before writing this interpretation i toyed around with the `bert-large-cased` model, implementing and running it as quickly as possible just to see what it could do. With 333 Million parameters in the transformer model I had to use a `batch_size` of 16 to not run out of memory. Giving it a single run over the weekend, with "looks about right" choice for hyperparameters, it managed to reach a test accuracy of 72.63% after over 23 hours of runtime. Impressed by this result I am setting my expectations for the properly implemented and fine-tuned `bert-base-cased` model to reach a test accuracy of 69%. Nice.