# Transformers for BoolQ reading comprehension

## Sources

My sources for this project are linked in the respecting sections of the notebook. I used AI tools such as ChatGPT to correct my writing and grammar in stage 1 of this project and plan on using it for debugging during stage 2.

## Setup

**Importing Python Packages**
Making sure the notebook is reproducible and runs without error, I will install the necessary libraries in a pip cell below.

**Data Loading and Split**
The data consists of the questions, a passage and the answer. In total there are 12'697 entries in the dataset. Splitting them according to the lecture slides into train (8427), validation (1000) and test (3270).

**Seeding for Reproducibility**
Setting the random Seed to 42 for reproducibility.

## Preprocessing
### Tokenizer
In past projects I always did some sort of manual preprocessing of the data. In this project I deliberately refrain from any manual preprocessing and will let the built-in features of the AutoTokenizer with the from_pretrained("bert-base-cased") model handle the following steps for me:
- Whitespace and Special Character removal (e.g. emojis or phonetic pronunciations)
- Case Sensitivity
- Padding and Truncation (pad automatically, truncate to max: 512 tokens - amount of pretrained position embeddings)

I only now found out about this from the Hugging Face Transformer [Preprocessing Data Documentation](https://huggingface.co/transformers/v3.0.2/preprocessing.html).

### Lowercase / Case Sensitivity
From my feedback I will now keep case sensitivity instead of lower-casing all text. Example of case sensitivity: the word "US" would become "us" and could thus change the meaning of a sentence drastically. <br>
*Source*: Feedback from Project 2 (LSTM)

### Padding / Truncation
I rely on the built-in padding and truncation functions of the AutoTokenizer from Hugging Face to manage sequence lengths efficiently:
- Questions are limited to a maximum of 21 tokens, based on the length of the longest question in the dataset.
- Passages are padded to a maximum of 488 tokens, ensuring that when the question (21 tokens), start token, end token, and separator token are included, the total length remains within the 512-token limit supported by the Transformer’s positional embeddings.

### Stemming / Lemmatization / Stopword removal
From a past lecture I took away that stemming or lemmatization is not the right choice for a reading comprehension task. It removes valuable meaning
No stemming or lemmatization will be done in my preprocessing as to keep the most amount of information possible in my sequences. Stopwords will also not be removed for the same reason.

### Embedding Layer
In this project, the embedding layer is implemented using PyTorch's nn.Embedding class. The embeddings are trained end-to-end alongside the rest of the model, allowing them to adapt to the specific nuances of the BoolQ dataset.
- **Vocabulary Size**: Determined by the tokenizer
- **Embedding Dimension**: Set to 300 as this is widely used by large pretrained embedding models like fastText or word2vec.
- **Training**: Initialized randomly and updated during training through backpropagation.

### Absolute Position Embeddings
Since the nn.TransformerEncoder does not by default have positional embeddings I will be implementing them through absolute position embeddings. Choosing the embeddings over the encoding because it is more widely used in practice.
Adding the learned absolute positional embeddings to the word embeddings before feeding the input into the transformer model. The position embeddings are initialized randomly and are trained with the model through backpropagation.
*Source*: Lecutre on positional encodings

### Input / Output / Label format
Each data point in the dataset is made up of a questions, passage and the respective binary label. The preprocessing steps transform these into the following formats for my model inputs:
- Embedding Layer:
    - *input*: Tensor of (batch_size, sequence_length) containing token IDs.
    - *output*: Tensor for (batch_size, sequence_length, embedding_dim) with each token ID mapped to a dense vector of size embedding_dim.

- 6-Layer Transformer Encoder:
    - *input*: The embeddings with shape (batch_size, sequence_length, embedding_dim).
    - *output*: A Tensor of shape (batch_size, sequence_length, embedding_dim).

- Pooling Layer:
    - *input*: The output of the last transformer layer, with shape (batch_size, sequence_length, embedding_dim)
    - *output*: A Tensor of shape (batch_size, embedding_dim), representing the aggregated sequence information.

- 2-Layer Classifier:
    - *input*: The pooled output, with shape (batch_size, embedding_dim)
    - *output*: A tensor of shape (batch_size, hidden_dim) for the first layer and shape (batch_size, num_classes) for the final layer.

- Label format:
    - The labels will be encoded as boolean values, enabling the model to predict either 0 or 1 (False/True).

## Model
### Architecture
- **Input Layer**:
    - The input to my model is the nn.Embedding layer that will be trained on the dataset with the network.
    - Each input sequence consists of a concatenated question and passage with a [SEP] token between them, marking the boundary. The separator token allows the model to distinguish between the two segments.
    - The resulting shape of the input tensor after embedding is (batch_size, sequence_length, embedding_dim).
- **6-Layer Transformer Encoder**:
    - Using the PyTorch implementation of the Transformer Encoder. The input to this model will be the output of the embedding layer with shape (batch_size, sequence_length, embedding_dim). Using six layers to learn contextual representations of the concatenated questino-passage sequence.
- **Pooling Layer**:
    - Apply *mean pooling* across the sequence length to reducing the output from (batch_size, sequence_length, embedding_dim) to (batch_size, embedding_dim). This provides a fixed-size single vector that summarizes the entire sequence for the classifier which provides the advantage of efficient memory use in training with varying sequence lengths and a fixed-sized input for my classifier.
- **2-Layer Classifier with ReLU**
    - I will implement a two-layer classifier network as defined in the project assignment. The first layer will take the output from the pooling layer of size (batch_size, embedding_dim) as its input and provide an output shape of (batch_size, hidden_dim). Using a ReLU for non-linearity. The second layer has output dimensions of (batch_size, num_classes) with num_classes=2. The output layer will use a softmax as the activation function as it is preferable over a sigmoid function for binary classification.

### Loss and Optimizer
For this binary classification task I'm using Binary Cross-Entropy Loss. BCE is widely used in binary classification problems, as it provides a probabilistic interpretation of the model's outputs, making it convenient for distinguishing between two classes. <br>
*Source*: [Binary Cross-Entropy/Log Loss for Binary Classification](https://www.geeksforgeeks.org/binary-cross-entropy-log-loss-for-binary-classification/)

For my optimizer I choose the Adam Optimizer for its adaptive learning rates and efficient handling of sparse gradients. It is well suited for deep learning tasks, provides fast convergence and has worked well in prior projects. <br>
*Source*: [Introduction to the Adam Optimizer](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)

### Experiments
*Batch Size*: I will start with a batch_size of 16 and increase it to the maximum my hardware can handle then leaving it fixed as it is not a hyperparameter.

To tune my models' hyperparameters I will be experimenting with the following ranges:
- Learning Rate: [1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
- Embedding Dimension: [128, 256, 300]
- Hidden Dimension for Classifier: [64, 128, 256]
- Number of Attention Heads: [4, 8, 12, 16]
- Dropout Rate: [0.1, 0.2, 0.3]
- Weight Decay: [1e-4, 1e-5, 1e-6]

### Training
I do not expect any run to take longer than 25 epochs. Thus limiting the maximum number of epochs to 25 and implement the early stopping criteria like in past projects.

### Checkpointing and Early Stopping
**Checkpointing**: I will implement checkpointing to save the model with the best validation accuracy. Criteria for this will be the maximum validation accuracy.

**Early Stopping**: Early stopping the run if the validation loss does not decreas within 15 epochs.

### Planned Correctness Tests
- Testing input shape to ensure the model receives a valid input format
- Testing output shape to verify the model produces the expected output shape
- Visually check the loss is decreasing while training
- Visually check the output for overfitting
- Visually check predictions using a confusion matrix
- Ensure reproducibility by setting the random seed.


## Evaluation
The percentage of yes answers in each data split is: Train; 62.64%, Val; 59.50%, Test;62.17%
Seeing how difficult it was in past projects to reach a much better accuracy than the baseline majority class I am setting my goal for the transformer model at 64% accuracy on the test set.

### Metrics
**Accuracy**: To evaluate model performance across different hyperparameter configurations, I will use validation accuracy as the primary metric.
**Confusion Matrix**: This will give a comprehensive view of true positives, true negatives, false positives, and false negatives, allowing me deeper insight into the model’s performance.

### Error Analysis
To understand why the model may fail on certain predictions, I will conduct an error analysis investigating weather missclassifications are related to the confidence score the model has in it's predictions. Low confidence on correct answers or high confidence on wrong answers may indicate areas where the model is uncertain or overconfident.

## Interpretation
My expectation for this project are to beat the majority class baseline of 62.17% on the test set. My last project wasn't very successufl in that it only predicted the majority class every time. The feedback on that project was plenty and I hope I can improve on a lot of points for this project.

Given the results form the LSTM implementation I am setting my expecation for the Transformer architecture to reach an accuracy of 63% to 65% on the test set.
