# NLP - Word Embeddings - Pascal Thürig

## Introduction
Starting point for this project is the following key requirements:
1. Use the BoolQ Dataset from Hugging Face
2. Use pre-trained model for word embeddings (word2vec, GloVe or fastText)
3. Train a 2-layer classifier with ReLU non-linearity

In this project I will be using pre-trained embeddings from word2vec and a simple 2-layer neural network to do the reading comprehension task on the BoolQ dataset.
I will document every decision made, from preprocessing to model training and evaluation. The goal is to classify each BoolQ question-answer pair as either 'Yes' or 'No'.



## TLDR; Here are the key decisions and justifications:
- BoolQ Dataset: Provided by Project berief
- Task: Classify BoolQ questions as either "yes" or "no" using pre-trained embeddings and a simple neural network
- Pre-trained embeddings: word2vec - Google News 300; for simplicity and already have a bit of experience with it
- Model: 2-layer NN with ReLU activation: Provided by Project brief
- Tokenizing: Yes, using a subword tokenizer.
- Lowercasing: Yes, all text will be lowercased.
- Stemming: No, stemming will not be applied.
- Lemmatizing: No, lemmatizing is not used initially but could be tried later.
- Stopword removal: No, stopwords are not removed to retain key information.
- Removal of other words: No, no other word removal is planned.
- Format cleaning: No further cleaning required, the dataset is already clean.
- Truncation: Yes, input text is truncated to a maximum of 512 tokens.
- Feature selection: None, relying on word2vec embeddings directly.
- Input format: Tokenized and padded sequences of word2vec embeddings.
- Label format: Binary (1 for "yes", 0 for "no").
- train/valid/test splits: 66% train, 8% validation, 26% test.
- Padding: Yes, sequences are padded for uniform input length.
- Embedding: Pre-trained word2vec embeddings are used for simplicity.
- Planned correctness tests: Shape consistency checks, binary label correctness, and validation of truncation and padding.
- Hyperparameters:
    - Learning Rate: 1e-2 – 1e-5
    - Batch Size: 16 - 64 (choosing maximum possible that my GPU can handle)
    - Epochs 10 - 50 in 5-/10-step increments
    - Hidden size: 64 - 512 
    - Early Stopping: Patience of 3 - 10 Epochs of non-improvement (depending on total epoch number)
- Evaluation: Accuracy and F1-Score: Accuracy for general performance and F1 to handle class imbalances
- Error Analysis: Investigating False Positives and False negatives to understand where the model fails

## Setup
Importing necessary libraries:
- datasets
- gensim
- nltk
- transformers
- numpy
- torch
- wandb
- sklearn

In [44]:
%pip install datasets gensim transformers numpy torch wandb scikit-learn

from datasets import load_dataset
import gensim.downloader as api
import gensim
from transformers import AutoTokenizer
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import wandb
import sklearn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


First up the BoolQ dataset is loaded

In [45]:
train_data = load_dataset('google/boolq', split='train[:-1000]')
validation_data = load_dataset('google/boolq', split='train[-1000:]')
test_data = load_dataset('google/boolq', split='validation')


test_question = train_data[5]['question']
print(train_data[5])
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(validation_data)}")
print(f"Number of validation samples: {len(test_data)}")

{'question': 'can you use oyster card at epsom station', 'answer': False, 'passage': "Epsom railway station serves the town of Epsom in Surrey. It is located off Waterloo Road and is less than two minutes' walk from the High Street. It is not in the London Oyster card zone unlike Epsom Downs or Tattenham Corner stations. The station building was replaced in 2012/2013 with a new building with apartments above the station (see end of article)."}
Number of training samples: 8427
Number of validation samples: 1000
Number of validation samples: 3270


For easy access during experiments I like to define the hyperparameters at the top of my notebooks

In [46]:
# Hyperparameters
EMBEDDING_DIM = 300
max_seq_length = 512

run_number = 1
batch_size = 32
n_epochs = 20
learning_rate = 1e-5
hidden_dim = 64
patience = 5

wandb_project_name = "nlp-word_embeddings-pascal_thuerig"
wandb_run_name = f"run_{run_number}-batch_size_{batch_size}-n_epochs_{n_epochs}-lr_{learning_rate}-hidden_dim_{hidden_dim}"

sequence_length = max_seq_length * 2  # Concatenate question and passage
input_dim = sequence_length * EMBEDDING_DIM
output_dim = 2  # Binary classification

print(f"Seq Length: {sequence_length}")
print(f"Input dim: {input_dim}")

Seq Length: 1024
Input dim: 307200


Now the pre-trained embeddings from word2vec - word2vec-google-news-300

In [47]:
model_name = "fasttext-wiki-news-subwords-300"
model_path = "fasttext-wiki-news-subwords-300.model"

# Check if the model file exists
try:
    # Load the model if it exists locally
    embeddings_model = gensim.models.KeyedVectors.load(model_path)
    print("Model loaded from local storage.")
except FileNotFoundError:
    # Download and save the model if it doesn't exist
    print("Downloading Word2Vec model...")
    embeddings_model = api.load(model_name)
    embeddings_model.save(model_path)  # Save the model locally
    print("Model downloaded and saved to local storage.")

Model loaded from local storage.


## Preprocessing

The BoolQ data will be processed in the following way:
1.  Tokenizing: the input questions and passages using a subword tokenizer
2.  Lowercasing: the text for simplicity and to reduce the total vocabulary size
3.  Stemming: No, will not stem the words as to not lose information
4.  Lemmatizing: No, will try if it improves performance
5.  Stopword removal: No, will not be removed to not lose potentially critical information [research](https://datascience.stackexchange.com/questions/31048/pros-cons-of-stop-word-removal)
6.  Removal of other words: No, will not be removing any other words
7.  Format cleaning: The dataset is already sufficiently clean, it shouldn't impact performance
8.  Truncation: the input text is truncated to a maximum of 512 tokens
9.  Feature selection: Not applicable as we focus on raw text as input and leveraging the pre-trained word embeddings no further feature extraction is needed.
10. Input format: Will take the form of the tokenized and padded sequences of word embeddings
11. Label format: Binary labels "yes" or "no"
12. train/valid/test splits: Prerequisite to project (66/8/26)
13. Padding: the sequences is padded to ensure all inputs have the same length in each batch
14. Embedding: Using word2vec, solely for simplicity as I already know it.
15. Planned correctness tests: Check for shape mismatches between tokenized text and word embeddings. - Ensure that input sequences are properly truncated and padded. - Verify that binary labels are correctly assigned and match the expected outputs.


1. Preprocess text (lowercasing)

In [48]:
def lowercase_text(text):
    return text.lower()

print(lowercase_text(test_question))

can you use oyster card at epsom station


2. Tokenize with AutoTokenizer from Hugging Face

In [49]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_text(text):
    return tokenizer(text,
                     padding='max_length',    
                     truncation=True,         
                     max_length=max_seq_length,
                     return_tensors='pt')

tokenized_output = tokenize_text(test_question)

# Print tokens
print(tokenizer.convert_ids_to_tokens(tokenized_output['input_ids'][0]))

# Print input_ids and their corresponding tokens
test_input_ids = tokenized_output['input_ids'][0]
test_tokens = tokenizer.convert_ids_to_tokens(test_input_ids)
for token, id in zip(test_tokens, test_input_ids):
    print(f"Token: {token} - ID: {id.item()}")



['[CLS]', 'can', 'you', 'use', 'oyster', 'card', 'at', 'eps', '##om', 'station', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '



3. Truncate or add padding -> *found out I can already do this in the tokenizer*

4. Preprocess Pipeline

In [50]:
def preprocess_pipeline(text):
    text = lowercase_text(text)
    tokens = tokenize_text(text)
    
    # Ensure tokenized length is correct
    assert tokens['input_ids'].shape[1] == max_seq_length, \
        f"Tokenized length is not equal to max_seq_length: {tokens['input_ids'].shape[1]}"
    
    token_strings = tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])
    
    return token_strings

preprocessed_text = preprocess_pipeline(test_question)
print("Tokens:", preprocessed_text)


Tokens: ['[CLS]', 'can', 'you', 'use', 'oyster', 'card', 'at', 'eps', '##om', 'station', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[

5. Embed tokens using word2vec (word2vec-google-news-300)

In [51]:
def tokens_to_embeddings(tokens, embedding_model=embeddings_model, embedding_dim=EMBEDDING_DIM):
    embeddings = []
    for t in tokens:
        if t in embedding_model:
            embeddings.append(embedding_model[t])
        else:
            embeddings.append(np.zeros(embedding_dim))
    return np.array(embeddings)

embedded_text = tokens_to_embeddings(preprocessed_text)
print(embedded_text)

[[ 0.        0.        0.       ...  0.        0.        0.      ]
 [-0.029857 -0.034625 -0.038095 ... -0.018202  0.049474  0.021932]
 [-0.047138  0.043438  0.024106 ... -0.011364 -0.037262 -0.031316]
 ...
 [ 0.        0.        0.       ...  0.        0.        0.      ]
 [ 0.        0.        0.       ...  0.        0.        0.      ]
 [ 0.        0.        0.       ...  0.        0.        0.      ]]


6. Create a custom BoolQ dataset class to:
    - get the data into a compatible format for the pyTorch dataloader.
    - organize question-answer pairs and apply the preprocessing pipeline.
    - easily batch, shuffle, and load the data during training.

In [52]:
class BoolQDataset(Dataset):
    def __init__(self, data, word2vec_model, max_seq_length=max_seq_length):
        self.data = data
        self.word2vec_model = word2vec_model
        self.max_seq_length = max_seq_length
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        question = self.data[idx]['question']
        passage = self.data[idx]['passage']
        
        label = 1 if self.data[idx]['answer'] else 0
        
        question_tokens = preprocess_pipeline(question)
        passage_tokens = preprocess_pipeline(passage)
        
        question_embeddings = tokens_to_embeddings(question_tokens)
        passage_embeddings = tokens_to_embeddings(passage_tokens)
        
        embeddings = np.concatenate((question_embeddings, passage_embeddings), axis=0).flatten()
        
        return torch.tensor(embeddings, dtype=torch.float32), torch.tensor(label, dtype=torch.long)

7. Dataloaders as required by pyTorch

In [53]:
train_dataset = BoolQDataset(train_data, embeddings_model)
validation_dataset = BoolQDataset(validation_data, embeddings_model)
test_dataset = BoolQDataset(test_data, embeddings_model)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

8. Initialize weights and biases for experiment tracking

In [54]:
wandb.init(project=wandb_project_name, name=wandb_run_name, config={
    "learning_rate": learning_rate,
    "epochs": n_epochs,
    "batch_size": batch_size,
    "hidden_size": hidden_dim,
})


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Model

The model architecture for this project is already fixed in the project brief as follows:
- **Network Architecture:** 2-Layer with ReLu non-linearity.
- **Loss / Optimizer:** Loss: CrossEntropyLoss / Optimizer: Adam (potentially trying SGD with or without momentum in experiments)
- **Experiments to run**: Mentioned in Training section below
- **Number of training runs**: Will depend on number of experiments
- **Checkpointing / Early stopping:** 3 - 10 epochs of non-improvement of the validation loss
- **Planned correctness tests:** Shape and Dimension consistency tests, Gradient Check, Sanity Check & Prediction Testd

1. Creating the neural network class:

In [55]:
class TwoLayerNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(TwoLayerNN, self).__init__()
        
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

2. Create instance of model and move it to the GPU

In [56]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = TwoLayerNN(input_dim, hidden_dim, output_dim).to(device)

Using device: cpu


3. Loss (nn.CrossEntropyLoss) and optimizer (optim.Adam)

In [57]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

4. Training loop

In [58]:
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs):
    early_stop_counter = 0
    best_val_loss = np.inf

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for val_inputs, val_labels in validation_loader:
                val_inputs = val_inputs.to(device)
                val_labels = val_labels.to(device)
                val_outputs = model(val_inputs)
                val_loss += criterion(val_outputs, val_labels).item()
        
        avg_val_loss = val_loss / len(validation_loader)
        avg_train_loss = running_loss / len(train_loader)
        
        print(f"Epoch [{epoch+1}/{epochs}], Training loss: {avg_train_loss:.4f}, Validation loss: {avg_val_loss:.4f}")
        wandb.log({"epoch": epoch + 1, "average_training_loss": avg_train_loss, "average_validation_loss": avg_val_loss})
        
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            early_stop_counter = 0
        else:
            early_stop_counter += 1
            
        if early_stop_counter >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

    print("Finished Training")

## Training
Train the model with the following different hyperparameters:
- Learning rate: 1e-2 – 1e-5
- Batch size: 16 - 64
- Epochs: 10 - 50
- Hidden size: 64 - 512
- Early Stopping: Patience of 3 - 10 Epochs of non-improvement


In [59]:
train_model(model, train_loader, validation_loader, criterion, optimizer, epochs=n_epochs)
wandb.finish()

Epoch [1/10], Training loss: 0.6608, Validation loss: 0.6726
Epoch [2/10], Training loss: 0.6384, Validation loss: 0.6656
Epoch [3/10], Training loss: 0.6084, Validation loss: 0.6614
Epoch [4/10], Training loss: 0.5790, Validation loss: 0.6570
Epoch [5/10], Training loss: 0.5532, Validation loss: 0.6539
Epoch [6/10], Training loss: 0.5290, Validation loss: 0.6587
Epoch [7/10], Training loss: 0.5061, Validation loss: 0.6578
Epoch [8/10], Training loss: 0.4846, Validation loss: 0.6540
Early stopping at epoch 8
Finished Training


VBox(children=(Label(value='0.006 MB of 0.006 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
average_training_loss,█▇▆▅▄▃▂▁
average_validation_loss,█▅▄▂▁▃▂▁
epoch,▁▂▃▄▅▆▇█

0,1
average_training_loss,0.48455
average_validation_loss,0.65395
epoch,8.0


## Evaluation
The model will be evaluated for the key metrics of:
- Accuracy
- Precision
- Recall
- F1 Score

The results will be averaged using micro averaging because I care about the total number of correct prediction regardless of the class ("yes" or "no"). 

Errors will be evaluated by making a confusion matrix and giving me the distribution of ture positives, false positives, true negatives and false negatives. Helping me figure out where the model is making most of it's mistakes.

## Finish the WandB run
Closing the WandB run

In [60]:
wandb.finish()

## Interpretation

To set concrete expectations for my model I take into account a couple of key benchmarks:
- **Accuracy:** Given the task of binary classification an accuracy of ~50% can be achieved with random guesses.
    - Expecting my model to hit an accuracy of ~60-75%.
- **F1 Score:** For this dataset I expect the F1 score to be similar to the accuracy of ~60-75%