# Introduction

This project aims to determine whether a given question can be answered based on a provided context. The answers are binary, either True or False.

For example, a question like "Do you always have to say check in chess?" is paired with a passage of varying length. Each question can be answered by the context provided. In this case, the answer would be "False," since the passage indicates that saying "check" is not mandatory.

## Executive Summary

**Data Description**: <br>
The dataset was split into 8,430 training samples, 1,000 for validation, and 3,270 for testing.

**Description of Methods**: <br>
The data was preprocessed using several steps: tokenization to break text into individual tokens, removal of special characters, URL and phoentic pronounciation. All passages were padded to an equal length of 162 tokens and the question were padded to the length of the longest question which was 21 tokens. FastText word embeddings with a dimension of 300 was used to represent each word in the dataset. The final input for the model consisted of concatenated question and passage tensors, each represented by 300-dimensional embeddings.

**Model Architecture**: <br>
The model was built using an LSTM-based architecture. It consisted of an input layer for the FastText embeddings, two LSTM layers and a feed-forward classifier. The classifier used a two-layer architecture, with a ReLU activation function between the layers and a sigmoid activation in the final layer to predict a binary outcome (True/False). Binary Cross-Entropy Loss was used for training, and Adam Optimizer was chosen for optimization.

**Experiments**: <br>
Experiments were conducted using various configurations, including different LSTM hidden dimensions, feed-forward classifier hidden dimensions, dropout rates, and the use of bidirectional LSTMs. Hyperparameters were tuned using Optuna to find the optimal settings. 

**Results**: <br>
The model with the highest validation accuracy and the model with the highest validation accuracy among the bidirectional LSTM was selected for evaluation. The bidirectional LSTM achieved the better accuracy of 0.6314 while the classic LSTM achieved an accuracy of 0.6064 on the test set.  

Despite extensive tuning, the model continued to struggle with class imbalance, often favoring the majority class (True). Nevertheless, the introduction of bidirectional LSTMs and an increased dimension of the fasttext model to 300 provided some improvements.

In conclusion, while the model achieved a slight improvement over a simple dummy classifier (which just predicts the most frequent label), the performance did not meet my target of 65-70% accuracy. I assume that a transformer technology would help to further increase the accuracy. 

Weights&Biases Report: https://api.wandb.ai/links/nlp_luca_gafner/l84vqf45 <br>
Weights&Biases Runs: https://wandb.ai/nlp_luca_gafner/Project_2_RNN/table?nw=nwuserlukii

# Set Up

**Python Packages**
To ensure the notebook runs smoothly from start to finish, I will install all necessary dependencies using pip in this cell.

**Load Dataset**
Following the instructions from the lecture notes, I will also load the dataset in this setup section. The train/validation/test split is already performed during the dataset import.

In [1]:
%pip install -q datasets torch pytorch-lightning nltk fasttext huggingface_hub wandb optuna

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset, Dataset
from huggingface_hub import hf_hub_download

import numpy as np
import nltk
from nltk.tokenize import RegexpTokenizer
import numpy as np
import re
import fasttext
import fasttext.util

# Model Architecture
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.loggers import WandbLogger
from torchmetrics.classification import BinaryAccuracy, BinaryF1Score, BinaryConfusionMatrix
import matplotlib.pyplot as plt
import seaborn as sns


import wandb
import random
import optuna
from pathlib import Path
import os

In [3]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)


pl.seed_everything(42, workers=True)

Seed set to 42


cuda


42

In [4]:
#Downlaod the FastText Embedding model
model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
ft = fasttext.load_model(model_path)
print(ft.get_dimension())
#fasttext.util.reduce_model(ft, 50) # Better performance with Embeddings Dimension of 300

FT_DIMENSION = ft.get_dimension() # Colab crashes since the RAM is limited therefore I couldn't reduce it.


300


In [5]:
train = load_dataset("google/boolq", split="train[:-1000]")
valid = load_dataset("google/boolq", split="train[-1000:]")
test = load_dataset("google/boolq", split="validation")
print(len(train), len(valid), len(test))

8427 1000 3270


# Preprocessing

### Tokenization:
For tokenization, I will utilize the nltk library as it provides a reliable tokenizer. Specifically, I plan to use nltk.tokenize.RegexpTokenizer since it allows filtering out special characters like [!, ?].<br>*Source*: [NLTK RegexpTokenizer](https://www.geeksforgeeks.org/python-nltk-tokenize-regexp/)

### Lowercase / Case Preservation
According to the Feedback from the first project, I will no longer lowercased the words since they could become a different meaning during the word Embedding with Fasttext. Example: If I lowercase the word "US" it would become "us" which has a significantly different meaning and therefore lowercasing in combination with FastText is not a useful choice. <br>
*Source*: Feedback provided from Project 1

### Stemming / Lemmatization
Stemming and Lemmatization are not the right choice as preprocessing steps for this reading comprehension task while using fastText. These methods can lead to information loss and potentially change the meaning of a sentene. Example: When I would lemmatize the word "leaves" it becomes "leaf" (*Source: Feedback Project 1*) and that could totally change the meaning of a complete sentences. E.g. from "The person leaves the room" to "The person leaf the room." Additionally, fastText's ability to handle out-of-vocabulary words and capture subword information makes it less necessary to reduce words to their base forms.

*Source*: [FastText Doc](https://fasttext.cc/docs/en/faqs.html) /
[Word Embeddings Using FastText (Geeksforgeeks)](https://www.geeksforgeeks.org/word-embeddings-using-fasttext/) / Feedback from Project 1

### Format Cleaning / Removal of other words
I will remove / clean specific elements that don't contribute to the semantic content:
* URLs will be removed.
* Phonetic pronunciations (e.g., "/ˈpɜːrʒən, -ʃən/") will be removed to avoid distracting the model. I plan to use a Regex to filter out these phonetic pronunciations.

### Padding and Truncation
To ensure uniform input dimensions:
* Question: Since the maximum length of a question has 21 tokens, I will pad all the questions to a length of 21 . For that I will fill the missing tokens with Zero Vectors as suggested in this source. <span style="color: green;">I will use the PyTroch Implementation to do that in order to not make a manual mistale. I will append the Zero Vectors with a dimension of 300 to the end of the question. </span>(Sources: [DataScience StackExchange](https://datascience.stackexchange.com/questions/32345/initial-embeddings-for-unknown-padding), [PyTorch Documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html))

* ~~Across all splits the maximum token length of a passage is 508. Therefore I'm padding the passages to a length of 512 tokens. Similar to the question, the padding will be done using zero vectors of the same dimension as the fastText embedding (Dimension: 300). I changed the dimension from the default value of 300 from FastText to a dimension of 50.~~

* <span style="color: green;">After some considerations I decided that I will not truncate 90% of all the datapoints. That means I will start to truncate at the 90th percentile. This means I will truncate if a passage is longer than 163 tokens. For all the other passages which are short I will append Zero Tokens at the end of the passage.</span>


### Embedding
To create the embeddings for each word, I will use the fastText embedding. The default dimension from fasttext is 300 ~~but I will change that to a dimension of 50 for each word~~. I chose fastText because it captures subword information, allowing it to produce better representations for rare and out-of-vocabulary words. <span style="color: green;">I experimented with both Dimension but I got better results with a dimension off 300</span>
<br>
*Sources*: [Word Embeddings Using FastText](https://www.geeksforgeeks.org/word-embeddings-using-fasttext/) / [FastText Doc - Default Dimension](https://fasttext.cc/docs/en/crawl-vectors.html#adapt-the-dimension)

### Input Format
The final input format for each data point will be:
* A tensor of shape (184, 300), representing the concatenated question (21 tokens) and passage (163 tokens). If we add the batch Size we will have the following Input for each Batch [Batch Size, Max Sequence Length, Embeddings Dimension] --> [256, 184, 300]
* Each token will be represented by its 300-dimensional fastText embedding.

### Label format
The labels (answers) will be encoded as boolean values (True/False). Therefore, the goal of the model will be to predict either True or False represented in 0/1.

### Train / Valid / Test split
These splits are already done while importing the dataset and were given by the professor.

### Batching
When using PyTorch Lightning, batching can be easy implemented through the DataLoader class. This has the advantage that it's super simple to create batches and it offers the ability for shuffling and parallel data loading. My goal was to set the Batch Size as high as possible which was 256. <br>
Sources: [PyTorch Lightning Doc](https://pytorch-lightning.readthedocs.io/en/1.5.10/guides/data.html)

# Planned Correctness tests
To ensure the correctness of my preprocessing steps:
1. I will manually inspect a sample of preprocessed data points to verify that:
  * Important information is not lost during tokenization or cleaning.
  * Padding is applied correctly.
  * Every datapoint has a length of 184 and a dimension of 300.


In [6]:
#Preprocessing Steps:

def preprocess_data(dataset):
  passage = dataset['passage']
  question = dataset['question']

  # Remove URL from the text
  url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|www\.\S+'
  passage = re.sub(url_pattern, '', passage)
  question = re.sub(url_pattern, '', question)

  tokenizer = RegexpTokenizer(r'\w+')
  passage = tokenizer.tokenize(passage)
  question = tokenizer.tokenize(question)

  # Remove Words that contain special characters (Using Regex)
  passage = [token for token in passage if re.match(r'^[A-Za-z0-9]+$', token)]
  question = [token for token in question if re.match(r'^[A-Za-z0-9]+$', token)]

  return {'passage': passage, 'question': question}

def get_word_embeddings(dataset):
  passage = np.array([np.array(ft.get_word_vector(word)) for word in dataset['passage']])
  question = np.array([np.array(ft.get_word_vector(word)) for word in dataset['question']])

  return {'passage': passage, 'question': question}



def padding_and_truncate(dataset):
    passage = dataset['passage']
    question = dataset['question']

    passage = torch.tensor(passage) if not isinstance(passage, torch.Tensor) else passage
    question = torch.tensor(question) if not isinstance(question, torch.Tensor) else question

    # Truncate or pad the passage
    if passage.shape[0] > 163:
        passage_truncated = passage[:163]
    else:
        pad_length = 163 - passage.shape[0]
        passage_truncated = F.pad(passage, (0, 0, 0, pad_length), mode='constant', value=0)

    # Truncate or pad the question
    if question.shape[0] > 21:
        question_truncated = question[:21]
    else:
        pad_length = 21 - question.shape[0]
        question_truncated = F.pad(question, (0, 0, 0, pad_length), mode='constant', value=0)

    return {'passage': passage_truncated, 'question': question_truncated}

def concatenate_passage_quetsion(dataset):
  passage = dataset['passage']
  question = dataset['question']

  concatenated = torch.cat((passage, question), dim=0)
  return {'concatenated': concatenated}

In [7]:
#Test Case --> Check whether the phoenetic pronounciation gets filtered out. 
print(train[0]["passage"])
data_dict = preprocess_data(train[0])
print(data_dict['passage'])

Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.
['Persian', 'also', 'known', 'by', 'its', 'endonym', 'Farsi', 'listen', 'is', 'one', 'of', 'the', 'Western', 'Iranian', 'languages', 'within', 'the', 'Indo', 'Iranian', 'branch', 'of', 'the', 'Indo', 'European', 'language', 'family', 'It', 'is', 'primarily', 'spoken', 'in', 'Iran', 'Afghanistan', 'officially', 'known', 'as', 'Dari', 'since', '1958', 'and', 'Tajikistan', 'officially', 'known', 'as', 'Tajiki', '

In [8]:
#Mapping
train = train.map(preprocess_data).with_format("torch", device=DEVICE)
valid = valid.map(preprocess_data).with_format("torch", device=DEVICE)
test = test.map(preprocess_data).with_format("torch", device=DEVICE)

Map:   0%|          | 0/8427 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [9]:
# Data Exploration
passage_lengths = [len(x) for x in train["passage"]]
mean_length = np.mean(passage_lengths)
print("Passage:")
print(f"Mean passage length: {mean_length}")
print(f"Max passage length: {max(passage_lengths)}")
print(f"90th percentile: {np.percentile(passage_lengths, 90)}")
print(f"99th percentile: {np.percentile(passage_lengths, 99)}")

print("Question:")
question_length = [len(x) for x in train["question"]]
mean_length = np.mean(question_length)
print(f"Mean question length: {mean_length}")
print(f"Max question length: {max(question_length)}")
print(f"90th percentile: {np.percentile(question_length, 90)}")
print(f"99th percentile: {np.percentile(question_length, 99)}")

del passage_lengths, question_length, mean_length

Passage:
Mean passage length: 95.37059451762192
Max passage length: 774
90th percentile: 163.0
99th percentile: 278.0
Question:
Mean question length: 8.86566987065385
Max question length: 21
90th percentile: 10.0
99th percentile: 14.0


In [10]:
# Word Embeddings
train = train.map(get_word_embeddings)
valid = valid.map(get_word_embeddings)
test = test.map(get_word_embeddings)
del ft



Map:   0%|          | 0/8427 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [None]:
#Padding
train = train.map(padding_and_truncate)
valid = valid.map(padding_and_truncate)
test = test.map(padding_and_truncate)

Map:   0%|          | 0/8427 [00:00<?, ? examples/s]

In [None]:
# Test the shape of the tensors
assert torch.tensor(train[0]["passage"]).shape == torch.Size([163, FT_DIMENSION]), "Passage shape mismatch"
assert torch.tensor(train[0]["question"]).shape ==  torch.Size([21, FT_DIMENSION]), "Question shape mismatch"

assert torch.tensor(train[21]["passage"]).shape == torch.Size([163, FT_DIMENSION]), "Passage shape mismatch"
assert torch.tensor(train[21]["question"]).shape ==  torch.Size([21, FT_DIMENSION]), "Question shape mismatch"


In [13]:
# Concatination
train = train.map(concatenate_passage_quetsion)
valid = valid.map(concatenate_passage_quetsion)
test = test.map(concatenate_passage_quetsion)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [14]:
# Test Case / Test the Shape after Concatenation
assert train[0]["concatenated"].shape == torch.Size([184, FT_DIMENSION]), "Concatenated shape mismatch"
assert train[21]["concatenated"].shape == torch.Size([184, FT_DIMENSION]), "Concatenated shape mismatch"

# Test if the dataset is on the cuda device
assert train[0]["concatenated"].device.type == "cuda", "Dataset not on CUDA device"

# Model

### Network Architecture
* Input Layer: As the input into my model, I use the pretrained FastText Embeddings which I set to a dimension of 50 per token.
* RNN Layers: I use the LSTM PyTorch Implementation as the architecture. The input dimension will be the embeddings dimension which is set to 50. For the hidden dimension I choose the 256, however, I will run experiments to find the best hidden dimension for my model.
* Classifier: I will implement a Two-Layer Feed Forward Network as written in the project description. The first Layer will have a dimension of Linear(hidden_dim, hidden_dim) where hidden_dim is set to 256 as written above. In between the two-Classifer Layer I will use the ReLU activation function.
The second Layer will be as following: Linear(hidden_dim, 1). The output layer will use the sigmoid as the activation function.

### Loss and Optimizer
For the loss I choose the Binary Cross-Entropy Loss which is the good choice for binary classification problems. BCE provides a probabilistic interpretation for binary outcomes, which is ideal for yes/no questions in this task.
For the optimizer I choose the Adam Optimizer which is compatible with my neural network architecture and is often used in deep learnign task because Adam offers a good performance across different problems including this task. <br>
*Source*: [Binary Cross Entropy for Binary Classification](https://www.geeksforgeeks.org/binary-cross-entropy-log-loss-for-binary-classification/)

### Experiments to run
For this project I plan to run several experiments to find the best hyperparameter for my model.
* Learning Rate: <span style="color: green;">Range of: </span>[1e-2, 1e-6] 
* Hidden Dimension LSTM: <span style="color: green;">Range of: </span> [64, 1024]
* Hidden Dimension FC: <span style="color: green;">Range of: </span> [64, 1024]
* Drop Out Probability: <span style="color: green;"> Range of:[0, 0.5] # I know that the Value of 0.5 is very high on the first glance however I experimented with that value in order to break my model's habit from just predicting the majority class. </span> 
* ~~Batch Sizes: [16, 32, 64, 128]~~ <span style="color: green;">Batch Size was set to the maximum available RAM which was 256</span>
* <span style="color: green;">Bidirectional: [True, False]</span>

These Experiments were all run with Optuna in order to find the best settings. The objective function is written below. I decided to track the validation accuracy because I want to find the best model which has the highest accuracy and don't want a model which overfits to the training set. Therefore I set the direction parameters to maximize. 



### Training Process
An experiment should not run longer than 30 epochs. Therefore, I will limit the max_epochs to 30. However I will implement an early Stop criterion.

### Checkpointing and Early Stopping
**Checkpointing**: I will implement checkpointing in order to retrieve a model later to measure the test set performance. These are the parameters I will implement:
* Monitor: Validation Accuracy
* Top One: Only save the one model. The best model on the valdiation accuracy.
* Mode: max (Maximazing Val Accuracy)
* Filename: Save with the same name as the run name.

**Early Stopping**: I will also implement early stopping when the validation loss does not get any smaller over 15 epochs.
* Monitor: Validation Loss
* Patience: ~~5~~ 15 epochs
* Mode: min

### Planned Correctness Tests
1. Input shape test: Ensure the model accepts the correct input dimensions: <span style="color: green;">Done, please see the implementation below</span>
2. Output shape test: Verify the model produces the expected output shape <span style="color: green;">Done, please see the implementation below</span>
3. Verify that the loss decreases during training. <span style="color: green;">Done via Weights&Biases</span>
4. Verify that the model does not overfit. For that I observe the validation loss. <span style="color: green;">Done via Weights&Biases</span>
5. Reproducibility test: Ensure results are consistent with fixed random seed. <span style="color: green;">Done, see at the beginnging of the code with Seed_everything</span>



In [15]:
train_loader = DataLoader(TensorDataset(train['concatenated'], train['answer']), shuffle=True, batch_size=256)
valid_loader = DataLoader(TensorDataset(valid['concatenated'], valid['answer']), shuffle=True, batch_size=256)
test_loader = DataLoader(TensorDataset(test['concatenated'], test['answer']), shuffle=False, batch_size=256)

In [16]:
assert train["concatenated"].shape == torch.Size([8427, 184, FT_DIMENSION]) # [DataPoints, Sequence_Length, Dimension]

In [17]:
for batch in train_loader:
    concatenated, answer = batch
    assert concatenated.shape == torch.Size([256, 184, FT_DIMENSION])
    assert answer.shape == torch.Size([256])
    break

In [18]:
#Correctness Test: Check the Input Shape of the first datapoint of the train_loader object
tens, _ = train_loader.dataset[0]
assert tens.shape == torch.Size([184, FT_DIMENSION])

In [19]:

class LSTMClassifier(pl.LightningModule):
    def __init__(self,
                 input_size,
                 lstm_hidden_size,
                 fc_hidden_size,
                 output_size,
                 pos_weight=1,
                 num_layers=2,
                 learning_rate=1e-3,
                 dropout_prob=0.2,
                 bidirectional=False):  

        super(LSTMClassifier, self).__init__()
        self.save_hyperparameters()

        # RNN Layer
        self.lstm = nn.LSTM(
            input_size,
            lstm_hidden_size,
            num_layers=num_layers,
            bidirectional=bidirectional,  
            batch_first=True)

        lstm_output_size = lstm_hidden_size * 2 if bidirectional else lstm_hidden_size
        self.fc1 = nn.Linear(lstm_output_size, fc_hidden_size)
        self.fc2 = nn.Linear(fc_hidden_size, output_size)

        self.dropout = nn.Dropout(dropout_prob)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

        self.train_accuracy = BinaryAccuracy()
        self.val_accuracy = BinaryAccuracy()
        self.test_accuracy = BinaryAccuracy()
        self.train_f1 = BinaryF1Score()
        self.val_f1 = BinaryF1Score()
        self.test_f1 = BinaryF1Score()
        self.test_confusion_matrix = BinaryConfusionMatrix()

        self.pos_weight = torch.tensor(pos_weight, device=self.device) # For weighted Loss


    def forward(self, x):
        lstm_out, (hn, cn) = self.lstm(x)

        if self.hparams.bidirectional:
            out = torch.cat((hn[2], hn[3]), dim=-1)
        else:
            out = hn[-1] 
            
        out = self.dropout(out)
        out = self.relu(self.fc1(out)) 
        out = self.dropout(out)
        out = self.sigmoid(self.fc2(out))  
        return out.squeeze(dim=1)


    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)        
        loss = F.binary_cross_entropy(y_hat, y.float())
        preds = y_hat > 0.5

        accuracy = self.train_accuracy(preds, y.int())
        f1_score = self.train_f1(preds.int(), y.int())
        
        self.log('train_loss', loss, on_epoch=True, on_step=False)
        self.log('train_accuracy', accuracy, on_epoch=True, on_step=False)
        self.log('train_f1', f1_score, on_epoch=True, on_step=False)
                
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.binary_cross_entropy(y_hat, y.float())
        preds = y_hat > 0.5

        accuracy = self.val_accuracy(preds, y.int())
        f1_score = self.val_f1(preds.int(), y.int())

        self.log('val_loss', loss, on_epoch=True, on_step=False)
        self.log('val_accuracy', accuracy, on_epoch=True, on_step=False)
        self.log('val_f1', f1_score, on_epoch=True, on_step=False)    

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.binary_cross_entropy(y_hat, y.float())
        preds = y_hat > 0.5

        accuracy = self.test_accuracy(preds, y.int())
        f1_score = self.test_f1(preds.int(), y.int())

        #Pytorch
        self.test_confusion_matrix.update(preds.int(), y.int())


        self.log('test_loss', loss, on_epoch=True, on_step=False)
        self.log('test_accuracy', accuracy, on_epoch=True, on_step=False)
        self.log('test_f1', f1_score, on_epoch=True, on_step=False)

    def on_test_epoch_end(self):
        # Create Confusion Matrix
        cm1 = self.test_confusion_matrix.compute()
        #print('Confusion Matrix:\n', cm1)
        self.plot_confusion_matrix(cm1)
                
        self.test_confusion_matrix.reset()

    def plot_confusion_matrix(self, cm):
        """
        Display the confusion matrix as a heatmap in the notebook.
        """
        plt.figure(figsize=(6, 6))
        sns.heatmap(cm.cpu().numpy(), annot=True, fmt='d', cmap='Blues', cbar=False, annot_kws={"size": 16})  # Increase font size with annot_kws
        plt.title('Confusion Matrix')
        plt.xlabel('Predicted Labels')
        plt.ylabel('True Labels')
        plt.show()

    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)


In [20]:
# Correctness Test: Test the Output Shape
BATCH_SIZE = 32
SEQ_LENGTH = 200
DIMENSION = 50

model = LSTMClassifier(
    input_size=50,
    lstm_hidden_size=128,
    fc_hidden_size=128,
    output_size=1, 
    dropout_prob=0.2, 
    bidirectional = True,
    learning_rate=1e-4).to(DEVICE)
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, DIMENSION).to(DEVICE)

assert model.forward(x).shape == torch.Size([BATCH_SIZE])

In [None]:
CONFIG = {"input_size": FT_DIMENSION,
"lstm_hidden_size": 512,
"fc_hidden_size": 256, 
"dropout_prob": 0.1, 
"bidirectional": False, 
"learning_rate":1e-4,
"pos_weight": 1, # Here I could weight the loss according to the label distribution
} 

LSTMmodel = LSTMClassifier(
    input_size=CONFIG["input_size"],
    lstm_hidden_size=CONFIG["lstm_hidden_size"],
    fc_hidden_size=CONFIG["fc_hidden_size"],
    output_size=1, 
    dropout_prob=CONFIG["dropout_prob"], 
    bidirectional = CONFIG["bidirectional"],
    learning_rate=CONFIG["learning_rate"], 
    pos_weight=CONFIG["pos_weight"])

run_name = f'''input_size_{CONFIG["input_size"]}_lstm_hidden_{CONFIG["lstm_hidden_size"]}
_fc_hidden_{CONFIG["fc_hidden_size"]}_bidirectional-{CONFIG["bidirectional"]}_{656}'''

wandb_logger = WandbLogger(project='Project_2_RNN', name=run_name)
for key, value in CONFIG.items():
  wandb_logger.experiment.config[key] = str(value)

wandb_logger.log_hyperparams(LSTMmodel.hparams)


checkpoint_callback = ModelCheckpoint(
    monitor='val_accuracy',
    mode='max',
    save_top_k=1,
    dirpath='checkpoints/',
    filename=run_name,
    verbose=True
)

early_stopping_loss = EarlyStopping(
    monitor='val_loss',
    patience=20,
    min_delta = 0.01,
    verbose=True,
    mode='min'
)


trainer = pl.Trainer(callbacks=[checkpoint_callback, early_stopping_loss], max_epochs=30, logger=wandb_logger)
trainer.fit(model=LSTMmodel, train_dataloaders=train_loader, val_dataloaders=valid_loader)
wandb.finish()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mlukii[0m ([33mnlp_luca_gafner[0m). Use [1m`wandb login --relogin`[0m to force relogin


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A16') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/opt/conda/lib/python3.11/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /home/jovyan/NLP/NLP_Project2/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name                  | Type                  | Params | Mode 
-------------------------------------------------------------------------
0  | lstm                  | LSTM                  | 3.8 M  | train
1  | fc1                   | Linear                | 131 K  | train
2  | fc2                   | Li

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/conda/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:475: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/opt/conda/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.
/opt/conda/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.
/opt/conda/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (33) is smaller than the logging interval Traine

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Metric val_loss improved. New best score: 0.678
Epoch 0, global step 33: 'val_accuracy' reached 0.59500 (best 0.59500), saving model to '/home/jovyan/NLP/NLP_Project2/checkpoints/input_size_300_lstm_hidden_512\n_fc_hidden_256_bidirectional-False_656.ckpt' as top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 1, global step 66: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 2, global step 99: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 3, global step 132: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 4, global step 165: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 5, global step 198: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 6, global step 231: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 7, global step 264: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 8, global step 297: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 9, global step 330: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 10, global step 363: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 11, global step 396: 'val_accuracy' was not in top 1


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch 12, global step 429: 'val_accuracy' reached 0.59600 (best 0.59600), saving model to '/home/jovyan/NLP/NLP_Project2/checkpoints/input_size_300_lstm_hidden_512\n_fc_hidden_256_bidirectional-False_656.ckpt' as top 1


In [None]:
# Data Exploration about the distribution of the Labels --> Used for weighted Loss

true_label_train = torch.sum(train['answer'] == True).item() 
false_label_train = torch.sum(train['answer'] == False).item()
true_count_valid = torch.sum(valid['answer'] == True).item() 
false_count_valid = torch.sum(valid['answer'] == False).item()

accuracy_train = true_label_train / (true_label_train + false_label_train)
accuracy_valid = true_count_valid / (true_count_valid + false_count_valid)

print(f'{true_label_train=}')
print(f'{false_label_train=}')
print(f"Accuracy Train: {accuracy_train:.4f}")
print(f"Accuracy Valid: {accuracy_valid:.4f}")



In [None]:
N_TRAILS = 100

def objective(trial):
    # Hyperparameters to tune
    lstm_hidden_size = trial.suggest_int('lstm_hidden_size', 64, 1024)
    fc_hidden_size = trial.suggest_int('fc_hidden_size', 64, 1024)
    dropout_prob = trial.suggest_float('dropout_prob', 0.0, 0.5)
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-6, 1e-2)
    bidirectional = trial.suggest_categorical('bidirectional', [False, True]) 

    model = LSTMClassifier(
        input_size=FT_DIMENSION, 
        lstm_hidden_size=lstm_hidden_size,
        fc_hidden_size=fc_hidden_size,
        output_size=1,  
        learning_rate=learning_rate,
        dropout_prob=dropout_prob,
        bidirectional=bidirectional
    )

    run_name = f'''input_size_{FT_DIMENSION}_lstm_hidden_{lstm_hidden_size}
    _fc_hidden_{fc_hidden_size}_bidirectional-{bidirectional}_{random.randint(1, 1000)}'''

    wandb_logger = WandbLogger(project='Project_2_RNN', name=run_name, group="optunaV2")
    
    
    wandb_logger.log_hyperparams(model.hparams)

    trainer = pl.Trainer(callbacks=[checkpoint_callback], max_epochs=30, logger=wandb_logger)
    trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=valid_loader)
    wandb.finish()
    return trainer.callback_metrics['val_accuracy'].item()  # Return validation accuracy to maximize

study = optuna.create_study(direction='maximize')  
#study.optimize(objective, n_trials=N_TRAILS)  

#print("Best hyperparameters: ", study.best_params)

# Evaluation Stage 1
After exploring the data, I found that the training set contains 62% True labels. If I would train a Dummy Classifier that always predicts the most frequent label, it would achieve about 62% accuracy. My goal is at least to outperform this baseline model. For the first project I achieved an accuracy of 64%.

### Metrics
**Accuracy**: To measure the performance of my model including the different hyperparameters, I will use the accuracy on the validation set. Different to the first project, I will no longer use the F1 score which is also a good metrics but not less suitable for this project since the true/false categories do not have recognizable patterns.

**Confusion Matrix**: This will provide a detailed breakdown of true positives, true negatives, false positives, and false negatives.


### Error Analysis
To gain some insights why the model fails to predict the correct label and I want to do an error analysis. I will examine if the mislabeling has something to do with the length of the passage and the padding. I will analyize whether the amount of padding has a negative impact on the model performance.

# NEWLY ADDED: Evaluation Stage 2

Comment on the F1 score. I am aware (See Paragraph above) that this metric does not make much sense for this type of task. However, at times during the implementation I was so desperate that I started to think that I might have implemented the Accuracy incorrectly. That's why I implemented the F1 score as a test case, but all results are compared using the accuracy.

## Models to be Evaluated: 
Most of my runs achieved a validation accuracy of 0.595. After analyzing the results obtained, I found that this corresponds exactly to the accuracy when the model only predicts “True”. I checked this accordingly and indeed my model only predicted True most of the time. I tried to combat this with various methods. On the one hand, I added two dropout layers and implemented a weighted loss, which weights the loss differently based on the label distribution. But even these two methods didn't really help. That's why I decided to increase the embeddings dimension to 300 again. After some experiments with Optuna, there were always models that learned to predict more than just true over time. However, I had to increase the Early Stopping Patience from 5 to 15 to give the model more time to learn. This worked well and the Valid Accuracy increased. For the evaluation on the test set, I decided to select the model with the best Valid Accuracy. In addition, I also evaluated the model with the best valid accuracy which was trained as bidirectional. 

### Best overall Performance
I want to evalute the model which has the highest accuracy on the validation set.<br> 
Run Name: input_size_300_lstm_hidden_1020\n_fc_hidden_221_bidirectional-False_656 
* Validation Accuracy on best Epoch: 0.628

***Results of this model:***<br>
Accuracy on Testset: 0.6064<br>
(F1 on Testset: 0.6948) <br>
True Positive Count: 1467 <br>
True Negative Count: 516

### Model with best Accuracy on Validation and is bidirectional:
In addition I also evaluated the model which had the best accuracy (on the validation set) and is bidirectional. <br>
Run Name: input_size_300_lstm_hidden_865\n_fc_hidden_183_bidirectional-True_760 
* Validation Accuracy on best Epoch: 0.609

***Results of this model:*** <br>
Accuracy on Testset: 0.6314 <br>
(F1 on Testset: 0.7545) <br>
True Positive Count: 1857 <br>
True Negative Count: 208

The bidirectional model achieved the best test accuracy with 0.6314 compared to the other model (0.6064) which is a classical LSTM without bidirectionality. 
The Model with Bidirectionality has a higher True Positive Count. However this is due to the fact that it predicts more of the majority class (True) and therefore it also has the higher accuracy as the best model with out bidirectionality. Thus, the True Negative Count is lower compared to the model without bidirectionality. 


In [None]:
# Evaluation - Best Overall Performance
base_path = Path("../../NLP/NLP_Project2/checkpoints")
run_name_overall = "input_size_300_lstm_hidden_1020\n_fc_hidden_221_bidirectional-False_656"
os.path.exists(base_path)

overall = base_path / run_name_overall
overall = str(overall) + ".ckpt" 

model = LSTMClassifier.load_from_checkpoint(overall)
wandb_logger = WandbLogger(project='Project_2_RNN', name=run_name_overall, group="Evaluation")

trainer = pl.Trainer(logger=wandb_logger)  
trainer.test(model, test_loader) 
wandb.finish()

In [None]:
run_name_bidirectional = "input_size_300_lstm_hidden_865\n_fc_hidden_183_bidirectional-True_760"

bidirectional = base_path / run_name_bidirectional
bidirectional = str(bidirectional) + ".ckpt" 

model = LSTMClassifier.load_from_checkpoint(bidirectional)
wandb_logger = WandbLogger(project='Project_2_RNN', name=run_name_bidirectional, group="Evaluation")

trainer = pl.Trainer(logger=wandb_logger)  
trainer.test(model, test_loader) 
wandb.finish()

# Interpretation
My expectations for the LSTM Project: A dummy classifier achieves an accuracy of 62% on the BoolQ dataset (While predicting the most frequent label), while my model from the first project reached 64% in accuracy.

Given that LSTM layers can better capture important information, I expect to improve on this performance. Therefore, I set a baseline expectation of at least 65% accuracy and aim for 70% on the test set since the LSTM's ability is to understand the relationships between questions and passages better than simpler models and can memorize important information form earlier in the passage.


# Interpretation from Stage 2
In this project, my initial expectation was to surpass the baseline performance of 62% accuracy achieved by a dummy classifier on the BoolQ dataset and the 64% accuracy from my previous project. Given that LSTM layers are known for their ability to memorize things better with the hidden state I anticipated a performance improvement. I set a target of achieving at least 65% accuracy, with an optimistic goal of 70%. 

However, my best-performing model, using a bidirectional LSTM, reached an accuracy of 63%. A major challenge was avoiding the model's tendency to predict the most frequent label (True) for all examples. Through hyperparameter tuning with Optuna, I discovered that a higher LSTM hidden dimension and a lower classification hidden dimension achieved the best results. The model model in the end had a LSTM Hidden dim of 865 and a Classifier Hidden Dim of 183 and a Learning Rate of 0.00015


While I successfully developed a model that does not simply predict True, the overall performance was still slightly below my previous model and the expectation I set in Stage 1.

# Sources
The sources are directly linked in each paragraph. Furthermore, I used AI-Tools such as ChatGPT from OpenAI and Gemini from Google for debugging.
