# Transformers for BoolQ reading comprehension
*All changes and additions compared to Stage 1 are marked in <span style="color: orange;">orange</span>*

## Sources

My sources for this project are linked in the respecting sections of the notebook. I used AI tools such as ChatGPT to correct my writing and grammar in stage 1 of this project and plan on using it for debugging during stage 2.

## <span style="color: orange;">TLDR</span>



## Setup

**Importing Python Packages**
Making sure the notebook is reproducible and runs without error, I will install the necessary libraries in a pip cell below.

**Data Loading and Split**
The data consists of the questions, a passage and the answer. In total there are 12'697 entries in the dataset. Splitting them according to the lecture slides into train (8427), validation (1000) and test (3270).

**Seeding for Reproducibility**
Setting the random Seed to 42 for reproducibility.

In [60]:
# TODO: make the pip install for used libraries and packages !!!
# %pip install -q 

In [61]:
import wandb
import random
from datasets import load_dataset
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.loggers import WandbLogger
from torchmetrics.classification import BinaryAccuracy, BinaryF1Score, BinaryConfusionMatrix
import matplotlib.pyplot as plt
import seaborn as sns

import optuna
import os

In [62]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 128
print(DEVICE)

cpu


In [63]:
pl.seed_everything(42, workers=True)

Seed set to 42


42

In [64]:
# Loading the dataset based on lecture slides
train_data = load_dataset('google/boolq', split='train[:-1000]')
validation_data = load_dataset('google/boolq', split='train[-1000:]')
test_data = load_dataset('google/boolq', split='validation')

In [65]:
test_question = train_data[5]['question']
test_passage = train_data[5]['passage']
print(train_data[5])
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(validation_data)}")
print(f"Number of validation samples: {len(test_data)}")

train_yes_count = sum(1 for label in train_data['answer'] if label == 1)
train_no_count = sum(1 for label in train_data['answer'] if label == 0)
train_total = train_yes_count + train_no_count

validation_yes_count = sum(1 for label in validation_data['answer'] if label == 1)
validation_no_count = sum(1 for label in validation_data['answer'] if label == 0)
validation_total = validation_yes_count + validation_no_count

test_yes_count = sum(1 for label in test_data['answer'] if label == 1)
test_no_count = sum(1 for label in test_data['answer'] if label == 0)
test_total = test_yes_count + test_no_count

print(f"Train set (yes/no) Ratio: {round(train_yes_count / train_no_count, 2)}, Percent Yes: {round(train_yes_count / train_total * 100, 2)}%")

print(f"Validation set (yes/no) Ratio: {round(validation_yes_count / validation_no_count, 2)}, Percent Yes: {round(validation_yes_count / validation_total * 100, 2)}%")

print(f"Test set (yes/no) Ratio: {round(test_yes_count / test_no_count, 2)}, Percent Yes: {round(test_yes_count / test_total * 100, 2)}%")

{'question': 'can you use oyster card at epsom station', 'answer': False, 'passage': "Epsom railway station serves the town of Epsom in Surrey. It is located off Waterloo Road and is less than two minutes' walk from the High Street. It is not in the London Oyster card zone unlike Epsom Downs or Tattenham Corner stations. The station building was replaced in 2012/2013 with a new building with apartments above the station (see end of article)."}
Number of training samples: 8427
Number of validation samples: 1000
Number of validation samples: 3270
Train set (yes/no) Ratio: 1.68, Percent Yes: 62.64%
Validation set (yes/no) Ratio: 1.47, Percent Yes: 59.5%
Test set (yes/no) Ratio: 1.64, Percent Yes: 62.17%


## Preprocessing
### Tokenizer
In past projects I always did some sort of manual preprocessing of the data. In this project I deliberately refrain from any manual preprocessing and will let the built-in features of the AutoTokenizer with the from_pretrained("bert-base-cased") model handle the following steps for me:
- Whitespace and Special Character removal (e.g. emojis or phonetic pronunciations)
- Case Sensitivity
- Padding and Truncation (pad automatically, truncate to max: 512 tokens - amount of pretrained position embeddings)

I only now found out about this from the Hugging Face Transformer [Preprocessing Data Documentation](https://huggingface.co/transformers/v3.0.2/preprocessing.html).

### Lowercase / Case Sensitivity
From my feedback I will now keep case sensitivity instead of lower-casing all text. Example of case sensitivity: the word "US" would become "us" and could thus change the meaning of a sentence drastically. <br>
*Source*: Feedback from Project 2 (LSTM)

### Padding / Truncation
I rely on the built-in padding and truncation functions of the AutoTokenizer from Hugging Face to manage sequence lengths efficiently:
- Questions are limited to a maximum of 21 tokens, based on the length of the longest question in the dataset.
- Passages are padded to a maximum of 488 tokens, ensuring that when the question (21 tokens), start token, end token, and separator token are included, the total length remains within the 512-token limit supported by the Transformer’s positional embeddings.

### Stemming / Lemmatization / Stopword removal
From a past lecture I took away that stemming or lemmatization is not the right choice for a reading comprehension task. It removes valuable meaning
No stemming or lemmatization will be done in my preprocessing as to keep the most amount of information possible in my sequences. Stopwords will also not be removed for the same reason.

### Embedding Layer
In this project, the embedding layer is implemented using PyTorch's nn.Embedding class. The embeddings are trained end-to-end alongside the rest of the model, allowing them to adapt to the specific nuances of the BoolQ dataset.
- **Vocabulary Size**: Determined by the tokenizer
- **Embedding Dimension**: Set to 300 as this is widely used by large pretrained embedding models like fastText or word2vec.
- **Training**: Initialized randomly and updated during training through backpropagation.

### Absolute Position Embeddings
Since the nn.TransformerEncoder does not by default have positional embeddings I will be implementing them through absolute position embeddings. Choosing the embeddings over the encoding because it is more widely used in practice.
Adding the learned absolute positional embeddings to the word embeddings before feeding the input into the transformer model. The position embeddings are initialized randomly and are trained with the model through backpropagation.
*Source*: Lecutre on positional encodings

### Input / Output / Label format
Each data point in the dataset is made up of a questions, passage and the respective binary label. The preprocessing steps transform these into the following formats for my model inputs:
- Embedding Layer:
    - *input*: Tensor of (batch_size, sequence_length) containing token IDs.
    - *output*: Tensor for (batch_size, sequence_length, embedding_dim) with each token ID mapped to a dense vector of size embedding_dim.

- 6-Layer Transformer Encoder:
    - *input*: The embeddings with shape (batch_size, sequence_length, embedding_dim).
    - *output*: A Tensor of shape (batch_size, sequence_length, embedding_dim).

- Pooling Layer:
    - *input*: The output of the last transformer layer, with shape (batch_size, sequence_length, embedding_dim)
    - *output*: A Tensor of shape (batch_size, embedding_dim), representing the aggregated sequence information.

- 2-Layer Classifier:
    - *input*: The pooled output, with shape (batch_size, embedding_dim)
    - *output*: A tensor of shape (batch_size, hidden_dim) for the first layer and shape (batch_size, num_classes) for the final layer.

- Label format:
    - The labels will be encoded as boolean values, enabling the model to predict either 0 or 1 (False/True).

In [66]:
# initialize tokenizer with bert-base-cased model

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
VOCAB_SIZE = tokenizer.vocab_size


def get_max_question_len(dataset):
    max_len = 0
    for item in dataset:
        question = item['question']
        tokenized_question = tokenizer.encode(question)
        max_len = max(max_len, len(tokenized_question))
    return max_len

max_question_len = get_max_question_len(train_data)
print(max_question_len) # max len of BPE tokenized question (including the CLS and SEP tokens) 


29


In [67]:
def tokenize_batch(batch):
    questions = batch['question']
    passages = batch['passage']
    
    encodings = tokenizer(
        questions,
        passages,
        max_length=512,  # Combined max length within Transformer limit
        padding=True,
        truncation=True,
        return_tensors="pt"
    )
    
    return {'input_ids': encodings['input_ids'], 'labels': torch.tensor(batch['answer'])}  # output of input_ids and labels

In [68]:
# Tokenize the datasets
train_data = train_data.map(tokenize_batch, batched=True).with_format("torch", device=DEVICE)
validation_data = validation_data.map(tokenize_batch, batched=True).with_format("torch", device=DEVICE)
test_data = test_data.map(tokenize_batch, batched=True).with_format("torch", device=DEVICE)

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [69]:
# Define collate function for dynamic padding in DataLoader
def collate_fn(batch):
    input_ids = [torch.tensor(item['input_ids']) for item in batch]
    labels = torch.tensor([item['labels'] for item in batch])
    
    # Pad to the longest sequence in the batch
    input_ids = nn.utils.rnn.pad_sequence(input_ids, batch_first=True)
    return {'input_ids': input_ids, 'labels': labels}

## Model
### Architecture
- **Input Layer**:
    - The input to my model is the nn.Embedding layer that will be trained on the dataset with the network.
    - Each input sequence consists of a concatenated question and passage with a [SEP] token between them, marking the boundary. The separator token allows the model to distinguish between the two segments.
    - The resulting shape of the input tensor after embedding is (batch_size, sequence_length, embedding_dim).
- **6-Layer Transformer Encoder**:
    - Using the PyTorch implementation of the Transformer Encoder. The input to this model will be the output of the embedding layer with shape (batch_size, sequence_length, embedding_dim). Using six layers to learn contextual representations of the concatenated questino-passage sequence.
- **Pooling Layer**:
    - Apply *mean pooling* across the sequence length to reducing the output from (batch_size, sequence_length, embedding_dim) to (batch_size, embedding_dim). This provides a fixed-size single vector that summarizes the entire sequence for the classifier which provides the advantage of efficient memory use in training with varying sequence lengths and a fixed-sized input for my classifier.
- **2-Layer Classifier with ReLU**
    - I will implement a two-layer classifier network as defined in the project assignment. The first layer will take the output from the pooling layer of size (batch_size, embedding_dim) as its input and provide an output shape of (batch_size, hidden_dim). Using a ReLU for non-linearity. The second layer has output dimensions of (batch_size, num_classes) with num_classes=2. The output layer will use a softmax as the activation function as it is preferable over a sigmoid function for binary classification.

### Loss and Optimizer
For this binary classification task I'm using Binary Cross-Entropy Loss. BCE is widely used in binary classification problems, as it provides a probabilistic interpretation of the model's outputs, making it convenient for distinguishing between two classes. <br>
*Source*: [Binary Cross-Entropy/Log Loss for Binary Classification](https://www.geeksforgeeks.org/binary-cross-entropy-log-loss-for-binary-classification/)

For my optimizer I choose the Adam Optimizer for its adaptive learning rates and efficient handling of sparse gradients. It is well suited for deep learning tasks, provides fast convergence and has worked well in prior projects. <br>
*Source*: [Introduction to the Adam Optimizer](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)

### Experiments
*Batch Size*: I will start with a batch_size of 16 and increase it to the maximum my hardware can handle then leaving it fixed as it is not a hyperparameter.

To tune my models' hyperparameters I will be experimenting with the following ranges:
- Learning Rate: [1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
- Embedding Dimension: [128, 256, 300]
- Hidden Dimension for Classifier: [64, 128, 256]
- Number of Attention Heads: [4, 8, 12, 16]
- Dropout Rate: [0.1, 0.2, 0.3]
- Weight Decay: [1e-4, 1e-5, 1e-6]

### Training
I do not expect any run to take longer than 25 epochs. Thus limiting the maximum number of epochs to 25 and implement the early stopping criteria like in past projects.

### Checkpointing and Early Stopping
**Checkpointing**: I will implement checkpointing to save the model with the best validation accuracy. Criteria for this will be the maximum validation accuracy.

**Early Stopping**: Early stopping the run if the validation loss does not decreas within 15 epochs.

### Planned Correctness Tests
- Testing input shape to ensure the model receives a valid input format
- Testing output shape to verify the model produces the expected output shape
- Visually check the loss is decreasing while training
- Visually check the output for overfitting
- Visually check predictions using a confusion matrix
- Ensure reproducibility by setting the random seed.


In [70]:
# DataLoaders
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_data, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

In [75]:
import torch
from torch import nn
import pytorch_lightning as pl
from torch.optim.lr_scheduler import LambdaLR

class TransformerClassifier(pl.LightningModule):
    def __init__(
            self,
            vocab_size,
            embedding_dim,
            num_heads,
            hidden_dim,
            num_layers=6,
            num_classes=2,
            dropout_rate=0.1,
            learning_rate=1e-5,
            warmup_steps=0
    ):
        super().__init__()
        
        # Store hyperparameters
        self.learning_rate = learning_rate
        self.warmup_steps = warmup_steps
        
        # Embedding Layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Positional Embedding layer
        self.position_embedding = nn.Embedding(512, embedding_dim)  # 512 max length
        
        # Linear layer to project from embedding_dim to hidden_dim (input dim of Transformer)
        self.input_projection = nn.Linear(embedding_dim, hidden_dim)
        
        # Transformer encoder with hidden_dim as input dimension
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim,
            nhead=num_heads,
            dim_feedforward=hidden_dim * 2,  # Adjusting the feedforward layer dimension as needed
            dropout=dropout_rate,
            activation="relu"
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Pooling layer (mean pooling)
        self.pooling = nn.AdaptiveAvgPool1d(1)
        
        # Classifier: 2-layer MLP with dropout
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim // 2, num_classes)
        )
        
        # Loss function
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids):
        # Apply embedding layer
        embedded = self.embedding(input_ids)  # Shape: (batch_size, seq_length, embedding_dim)
        
        # Generate positional embeddings and add to token embeddings
        positions = torch.arange(input_ids.size(1), device=input_ids.device).unsqueeze(0)
        pos_embeddings = self.position_embedding(positions)
        embedded = embedded + pos_embeddings
        
        # Project to hidden_dim for input to the Transformer
        projected = self.input_projection(embedded)  # Shape: (batch_size, seq_length, hidden_dim)
        
        # Pass through Transformer encoder
        projected = projected.permute(1, 0, 2)  # Transformer expects (seq_length, batch_size, hidden_dim)
        encoded = self.transformer_encoder(projected)
        encoded = encoded.permute(1, 0, 2)  # Back to (batch_size, seq_length, hidden_dim)
        
        # Apply pooling to get a fixed-size vector
        pooled = self.pooling(encoded.transpose(1, 2)).squeeze(-1)  # (batch_size, hidden_dim)
        
        # Classify
        logits = self.classifier(pooled)
        return logits

    def training_step(self, batch, batch_idx):
        input_ids = batch['input_ids'].to(self.device)
        labels = batch['labels'].float().to(self.device)
        
        # Forward pass
        outputs = self(input_ids)
        loss = self.loss_fn(outputs.squeeze(), labels.long())
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch['input_ids'].to(self.device)
        labels = batch['labels'].float().to(self.device)
        
        outputs = self(input_ids)
        loss = self.loss_fn(outputs.squeeze(), labels.long())
        self.log('val_loss', loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        # AdamW optimizer
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)

        # Learning rate scheduler with warmup
        def lr_lambda(current_step):
            if current_step < self.warmup_steps:
                return float(current_step) / float(max(1, self.warmup_steps))
            return 1.0

        scheduler = {
            'scheduler': LambdaLR(optimizer, lr_lambda=lr_lambda),
            'interval': 'step',  # Update the learning rate after each batch
            'name': 'learning_rate'
        }
        
        return [optimizer], [scheduler]


In [76]:
BATCH_SIZE = 32
SEQ_LEN = 512
DIM = 300


model = TransformerClassifier(
    vocab_size=VOCAB_SIZE,
    embedding_dim=DIM,
    num_heads=8,
    hidden_dim=256,
)

x = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LEN)).to(DEVICE)

# Run the forward pass and check the output shape
assert model.forward(x).shape == torch.Size([BATCH_SIZE, 2])

In [77]:
# Configuration
CONFIG = {
    "embedding_dim": 300,
    "num_heads": 8,
    "hidden_dim": 512,
    "dropout_rate": 0.1,
    "learning_rate": 1e-5,
    "warmup_steps": 1000,
    "num_classes": 2,
    
    "vocab_size": VOCAB_SIZE,
    "num_layers": 6
}

# Initialize model with config
TransformerModel = TransformerClassifier(
    vocab_size=CONFIG["vocab_size"],
    embedding_dim=CONFIG["embedding_dim"],
    num_heads=CONFIG["num_heads"],
    hidden_dim=CONFIG["hidden_dim"],
    num_layers=CONFIG["num_layers"],
    num_classes=CONFIG["num_classes"],
    dropout_rate=CONFIG["dropout_rate"],
    learning_rate=CONFIG["learning_rate"],
    warmup_steps=CONFIG["warmup_steps"]
)

run_name = (
    f"embedding_dim_{CONFIG['embedding_dim']}-"
    f"num_heads_{CONFIG['num_heads']}-"
    f"hidden_dim_{CONFIG['hidden_dim']}-"
    f"dropout_{CONFIG['dropout_rate']}-"
    f"learning_rate_{CONFIG['learning_rate']}-"
    f"warmup_steps_{CONFIG['warmup_steps']}-"
    f"num_classes_{CONFIG['num_classes']}-"
    f"batch_size_{BATCH_SIZE}-"
    f"num_layers_6"
)

print("Run Name:", run_name)

wandb_logger = WandbLogger(
    project='nlp_p3_transformer',
    name=run_name,        
)

for key, value in CONFIG.items():
    wandb_logger.experiment.config[key] = str(value)

wandb_logger.log_hyperparams(TransformerModel.hparams)

trainer = pl.Trainer(
    logger=wandb_logger,
    max_epochs=60,
    log_every_n_steps=10
)

trainer.fit(TransformerModel, train_loader, validation_loader)

wandb.finish()

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


Run Name: embedding_dim_300-num_heads_8-hidden_dim_512-dropout_0.1-learning_rate_1e-05-warmup_steps_1000-num_classes_2-batch_size_32-num_layers_6



  | Name                | Type               | Params | Mode 
-------------------------------------------------------------------
0 | embedding           | Embedding          | 8.7 M  | train
1 | position_embedding  | Embedding          | 153 K  | train
2 | input_projection    | Linear             | 154 K  | train
3 | transformer_encoder | TransformerEncoder | 12.6 M | train
4 | pooling             | AdaptiveAvgPool1d  | 0      | train
5 | classifier          | Sequential         | 131 K  | train
6 | loss_fn             | CrossEntropyLoss   | 0      | train
-------------------------------------------------------------------
21.8 M    Trainable params
0         Non-trainable params
21.8 M    Total params
87.020    Total estimated model params size (MB)
72        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  input_ids = [torch.tensor(item['input_ids']) for item in batch]
/Users/blackbook/anaconda3/envs/nlp/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

wandb-core(51633) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51639) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51645) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51646) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51649) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51651) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51652) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51654) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.

Detected KeyboardInterrupt, attempting graceful shutdown ...
wandb-core(51657) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
wandb-core(51659) Ma

NameError: name 'exit' is not defined

## Evaluation
The percentage of yes answers in each data split is: Train; 62.64%, Val; 59.50%, Test;62.17%
Seeing how difficult it was in past projects to reach a much better accuracy than the baseline majority class I am setting my goal for the transformer model at 64% accuracy on the test set.

### Metrics
**Accuracy**: To evaluate model performance across different hyperparameter configurations, I will use validation accuracy as the primary metric.
**Confusion Matrix**: This will give a comprehensive view of true positives, true negatives, false positives, and false negatives, allowing me deeper insight into the model’s performance.

### Error Analysis
To understand why the model may fail on certain predictions, I will conduct an error analysis investigating weather missclassifications are related to the confidence score the model has in it's predictions. Low confidence on correct answers or high confidence on wrong answers may indicate areas where the model is uncertain or overconfident.

## Interpretation
My expectation for this project are to beat the majority class baseline of 62.17% on the test set. My last project wasn't very successufl in that it only predicted the majority class every time. The feedback on that project was plenty and I hope I can improve on a lot of points for this project.

Given the results form the LSTM implementation I am setting my expecation for the Transformer architecture to reach an accuracy of 63% to 65% on the test set.
