# Project 1: Reading comprehension



# Introduction

Tutorial: https://lightning.ai/docs/pytorch/stable/starter/introduction.html#

W&B Link: TODO

# Setup

## Dependencies
Install all necessary dependencies
- PyTorch: `torch lightning`
- Hugging Face: `huggingface_hub datasets`
- Weights & Biases: `wandb`
- nltk: `nltk`
- optuna
https://docs.ray.io/en/latest/tune/examples/tune-pytorch-lightning.html
https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py

Optional
- Lint and Formatting: `ruff`

Dependencies are pinned to the version the code was created with. 

## Notebook setup
Log into Hugging Face and Weights & Biases.

## Tools used
- Visual Studio Code
- GitHub Copilot

In [8]:
%pip install torch lightning huggingface_hub datasets wandb nltk fasttext ruff

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.3-cp312-cp312-linux_x86_64.whl size=5206042 sha256=48d5807696dcfa98e3778a78c989edf5a2e83ccc50cfcd438757be84f7243e63
  Stored in directory: /home/codespace/.cache/pip/wheels/20/27/95/a7baf1b435f1cbde017cabdf1e9688526d2b0e929255a359c6
Successfully built fasttext
Installing collected packages: pybind11, fasttext
[0mSuccessfully installed fasttext-0.9.3 pybind11-2.13.6
Note: you may need to restart the kernel to use updated packages.


In [1]:
from datasets import load_dataset


from huggingface_hub import login, hf_hub_download
import wandb
from lightning.pytorch.loggers import WandbLogger
from torch import optim, nn, utils
import lightning as L
import nltk
import torch
import fasttext

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
login()

In [None]:
wandb.login()

# Preprocessing

Predefined requirements:
- Train / Validation / Test split
- Existing word embedding model: word2vec, GloVe, fastText

Download the BoolQ dataset with `datasets` and split it in the predefined way.

Data treatment steps:
- Tokenize
- Lower case
- Stop word removal
- lemmatization 
- truncating of passage (enforce maximum length, ~99% not truncated)
- Embedding with fasttext

Used features: `question`, `answer`, `passage` (all of them)

Input format: `question` and `passage` vectors
Label format: `answer` 1 or 0

Batching: None, dataset is small enough (6.5k rows in train)

Correctness tests:
Check texts before embedding if they still make sense 

In [2]:
nltk.download('stopwords')
model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
train_raw = load_dataset("google/boolq", split="train[:-1000]")
valid_raw = load_dataset("google/boolq", split="train[-1000:]")
test_raw = load_dataset("google/boolq", split="validation")

print(len(train_raw), len(valid_raw), len(test_raw))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


8427 1000 3270


: 

In [None]:
embedding_model = fasttext.load_model(model_path)

In [None]:
stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words = stop_words | {'\"', '\'', '\'\'', '`', '``', '\'s'}

lemmatizer = nltk.stem.WordNetLemmatizer()

embeddings = None


text_values = ["question", "passage"]
def preprocess_dataset(dataset):
    for row in dataset:
        for key in text_values:
            embedded_value = [embedding_model[lemmatizer.lemmatize(word)] for word in nltk.word_tokenize(row[key].lower()) if word not in stop_words]
            row[key] = embedded_value

    return dataset

#train = preprocess_dataset(train_raw)
valid = preprocess_dataset(valid_raw)
#test = preprocess_dataset(test_raw)
valid[0]

# Model

Predefined requirements:
- Classifier
    - 2 Layers
    - ReLu

network architecture
- input layer
    - dim: max len of input x 2
- 2 hidden layers
    - 0.5 * max x 2
    - ReLu activation
- output layer
    - output of class (1 true, 0 false)
        - dim: 1x1
        - not probability output because the question should be answered with yes or no, not 60% yes
    - sigmoid activation
- normalization: done in preprocessing
- regularization: done by optimizer

Loss function either:
- HingeEmbeddingLoss: measures whether two inputs are similar or not
- Binary Cross-Entropy: separate to classes
Optimizer either:
- Adam: good default choice
- AdamW: supposed improvements to Adam

Experiments:
- Different Loss and Optimizer combiniations
- Size of input & hidden layers
- Epochs, Learning rate 

Checkpoints: every few epochs 
early stops: if loss does not improve

correctness test:
?

In [None]:
class QClassifier(L.LightningModule):
    def __init__(self, loss, optimizer, in_size=64, lr=1e-3):
        super().__init__()
        self.lr = lr
        self.encoder = nn.Sequential(
            nn.Linear(in_size, 2), nn.ReLU(), nn.Linear(in_size / 2, 2)
        )
        self.hidden = nn.Sequential(
            nn.Linear(in_size / 2, 2), nn.ReLU(), nn.Linear(in_size / 2, 2)
        )
        self.decoder = nn.Sequential(
            nn.Linear(in_size / 2, 2), nn.Sigmoid(), nn.Linear(1, 1)
        )

        self.loss = loss
        self.optimizer = optimizer

    def training_step(self, batch, batch_idx):
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = self.loss(x_hat, x)
        return loss

    def configure_optimizers(self):
        return self.optimizer(self.parameters(), lr=self.lr)


loss = nn.functional.mse_loss
optimizer = optim.Adam
classifier = QClassifier(loss, optimizer)
logger = WandbLogger(log_model="all")

# Training

Initialize Weights & Biases project for the project

Use k fold cross validation to avoid overfitting

1. Define experiement with different hyperparameters
2. Train model with train dataset split
3. Check model performance with validation dataset split
4. Log training run to Weights & Biases
5. Repeat

After all experiments have run select best runs hyperparameters for the final model.

In [None]:
trainer = L.Trainer(limit_train_batches=100, max_epochs=1)
train_loader = utils.data.DataLoader(train)
trainer.fit(model=classifier, train_dataloaders=train_loader, logger=logger)

# Evaluation

Metrics:
- F1 Score
- AOC
Averaging: ?
Error analysis:
- Confusion matrix

# Interpretation

Compare results of final model to expectations.
Run evaluation of final model with test dataset split.
But only at the very end, otherwise data leakage can happen.

Expectation: 70% accuracy with test dataset.