# NLP - Word Embeddings - Pascal Thürig

## Introduction
Starting point for this project is the following key requirements:
1. Use the BoolQ Dataset from Hugging Face
2. Use pre-trained model for word embeddings (word2vec, GloVe or fastText)
3. Train a 2-layer classifier with ReLU non-linearity

In this project I will be using pre-trained embeddings from word2vec and a simple 2-layer neural network to do the reading comprehension task on the BoolQ dataset.
I will document every decision made, from preprocessing to model training and evaluation. The goal is to classify each BoolQ question-answer pair as either 'Yes' or 'No'.



## TLDR; Here are the key decisions and justifications:
- BoolQ Dataset: Provided by Project berief
- Task: Classify BoolQ questions as either "yes" or "no" using pre-trained embeddings and a simple neural network
- Pre-trained embeddings: word2vec - Google News 300; for simplicity and already have a bit of experience with it
- Model: 2-layer NN with ReLU activation: Provided by Project brief
- Tokenizing: Yes, using a subword tokenizer.
- Lowercasing: Yes, all text will be lowercased.
- Stemming: No, stemming will not be applied.
- Lemmatizing: No, lemmatizing is not used initially but could be tried later.
- Stopword removal: No, stopwords are not removed to retain key information.
- Removal of other words: No, no other word removal is planned.
- Format cleaning: No further cleaning required, the dataset is already clean.
- Truncation: Yes, input text is truncated to a maximum of 512 tokens.
- Feature selection: None, relying on word2vec embeddings directly.
- Input format: Tokenized and padded sequences of word2vec embeddings.
- Label format: Binary (1 for "yes", 0 for "no").
- train/valid/test splits: 66% train, 8% validation, 26% test.
- Padding: Yes, sequences are padded for uniform input length.
- Embedding: Pre-trained word2vec embeddings are used for simplicity.
- Planned correctness tests: Shape consistency checks, binary label correctness, and validation of truncation and padding.
- Hyperparameters:
    - Learning Rate: 1e-2 – 1e-5
    - Batch Size: 16 - 64 (choosing maximum possible that my GPU can handle)
    - Epochs 10 - 50 in 5-/10-step increments
    - Hidden size: 64 - 512 
    - Early Stopping: Patience of 3 - 10 Epochs of non-improvement (depending on total epoch number)
- Evaluation: Accuracy and F1-Score: Accuracy for general performance and F1 to handle class imbalances
- Error Analysis: Investigating False Positives and False negatives to understand where the model fails

## Setup
Importing necessary libraries:
- datasets
- gensim
- nltk
- transformers
- numpy
- torch
- wandb
- sklearn

In [7]:
%pip install datasets gensim nltk transformers numpy torch wandb scikit-learn

from datasets import load_dataset
import gensim.downloader as api
import gensim
import nltk
from transformers import AutoTokenizer
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import wandb
import sklearn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


First up the BoolQ dataset is loaded

In [8]:
train_data = load_dataset('google/boolq', split='train[:-1000]')
validation_data = load_dataset('google/boolq', split='train[-1000:]')
test_data = load_dataset('google/boolq', split='validation')

print(train_data[0])
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(validation_data)}")
print(f"Number of validation samples: {len(test_data)}")

{'question': 'do iran and afghanistan speak the same language', 'answer': True, 'passage': 'Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.'}
Number of training samples: 8427
Number of validation samples: 1000
Number of validation samples: 3270


For easy access during experiments I like to define the hyperparameters at the top of my notebooks

In [9]:
# Hyperparameters
w2v_model_name = "word2vec-google-news-300"
w2v_model_path = "word2vec-google-news-300.model"
EMBEDDING_DIM = 300
max_seq_length = 100

batch_size = 10
n_epochs = 10
learning_rate = 0.0001
hidden_dim = 64

run_number = 1
wandb_project_name = "nlp-word_embeddings-pascal_thuerig"
wandb_run_name = f"run_{run_number}-batch_size_{batch_size}-n_epochs_{n_epochs}-lr_{learning_rate}-hidden_dim_{hidden_dim}"

sequence_length = max_seq_length * 2  # Contatenate question and passage
input_dim = sequence_length * EMBEDDING_DIM
output_dim = 2  # Binary classification


Now the pre-trained embeddings from word2vec - word2vec-google-news-300

In [10]:
# Check if the model file exists
try:
    # Load the model if it exists locally
    word2vec_model = gensim.models.KeyedVectors.load(w2v_model_path)
    print("Model loaded from local storage.")
except FileNotFoundError:
    # Download and save the model if it doesn't exist
    print("Downloading Word2Vec model...")
    word2vec_model = api.load(w2v_model_name)
    word2vec_model.save(w2v_model_path)  # Save the model locally
    print("Model downloaded and saved to local storage.")

Model loaded from local storage.


## Preprocessing

The BoolQ data will be processed in the following way:
1.  Tokenizing: the input questions and passages using a subword tokenizer
2.  Lowercasing: the text for simplicity and to reduce the total vocabulary size
3.  Stemming: No, will not stem the words as to not lose information
4.  Lemmatizing: No, will try if it improves performance
5.  Stopword removal: No, will not be removed to not lose potentially critical information [research](https://datascience.stackexchange.com/questions/31048/pros-cons-of-stop-word-removal)
6.  Removal of other words: No, will not be removing any other words
7.  Format cleaning: The dataset is already sufficiently clean, it shouldn't impact performance
8.  Truncation: the input text is truncated to a maximum of 512 tokens
9.  Feature selection: Not applicable as we focus on raw text as input and leveraging the pre-trained word embeddings no further feature extraction is needed.
10. Input format: Will take the form of the tokenized and padded sequences of word embeddings
11. Label format: Binary labels "yes" or "no"
12. train/valid/test splits: Prerequisite to project (66/8/26)
13. Padding: the sequences is padded to ensure all inputs have the same length in each batch
14. Embedding: Using word2vec, solely for simplicity as I already know it.
15. Planned correctness tests: Check for shape mismatches between tokenized text and word embeddings. - Ensure that input sequences are properly truncated and padded. - Verify that binary labels are correctly assigned and match the expected outputs.


1. Preprocess text (lowercasing)

In [11]:
def lowercase_text(text):
    return text.lower()

2. Tokenize with AutoTokenizer from Hugging Face

In [12]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_text(text):
    return tokenizer(text, return_tensors='pt')

sample_text = "This is an example sentence to be tokenized by AutoTokenizer."
tokenized_output = tokenize_text(sample_text)

print(tokenized_output.tokens())
print(tokenized_output['input_ids'])
print(tokenized_output['attention_mask'])

['[CLS]', 'this', 'is', 'an', 'example', 'sentence', 'to', 'be', 'token', '##ized', 'by', 'auto', '##tok', '##eni', '##zer', '.', '[SEP]']
tensor([[  101,  2023,  2003,  2019,  2742,  6251,  2000,  2022, 19204,  3550,
          2011,  8285, 18715, 18595,  6290,  1012,   102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])




3. Truncate or add padding

In [13]:
# Trunkate tokens or add padding depending on max_length
def pad_or_truncate(tokens, max_length, pad_token='[PAD]'):
    if len(tokens) > max_length:
        return tokens[:max_length]
    else:
        return tokens + [pad_token] * (max_length - len(tokens))

4. Preprocess Pipeline

In [14]:
def preprocess_pipeline(text, word2vec_model):
    text = lowercase_text(text)
    tokens = tokenize_text(text)
    tokens = pad_or_truncate(tokens, max_seq_length)
    return tokens

5. Embed tokens using word2vec (word2vec-google-news-300)

In [15]:
def tokens_to_embeddings(tokens, word2vec_model, embedding_dim=EMBEDDING_DIM):
    embeddings = []
    for token in tokens:
        if token in word2vec_model:
            embeddings.append(word2vec_model[token])
        else:
            embeddings.append(np.zeros(embedding_dim))
    return np.array(embeddings)

6. Create a custom BoolQ dataset class to:
    - get the data into a compatible format for the pyTorch dataloader.
    - organize question-answer pairs and apply the preprocessing pipeline.
    - easily batch, shuffle, and load the data during training.

In [16]:
class BoolQDataset(Dataset):
    def __init__(self, data, word2vec_model, max_seq_length=max_seq_length):
        self.data = data
        self.word2vec_model = word2vec_model
        self.max_seq_length = max_seq_length
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, id):
        question = self.data[id]['question']
        passage = self.data[id]['passage']
        
        label = 1 if self.data[id]['answer'] else 0
        
        question_tokens = preprocess_pipeline(question, self.word2vec_model)
        passage_tokens = preprocess_pipeline(passage, self.word2vec_model)
        
        question_embeddings = tokens_to_embeddings(question_tokens, self.word2vec_model)
        passage_embeddings = tokens_to_embeddings(passage_tokens, self.word2vec_model)
        
        embeddings = np.concatenate((question_embeddings, passage_embeddings), axis=0)
        
        return torch.tensor(embeddings, dtype=torch.float32), torch.tensor(label, dtype=torch.long)

7. Dataloaders as required by pyTorch

In [17]:
train_dataset = BoolQDataset(train_data, word2vec_model)
validation_dataset = BoolQDataset(validation_data, word2vec_model)
test_dataset = BoolQDataset(test_data, word2vec_model)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

8. Initialize weights and biases for experiment tracking

In [18]:
wandb.init(project=wandb_project_name, name=wandb_run_name)
wandb.config.learning_rate = learning_rate
wandb.config.epochs = n_epochs
wandb.config.batch_size = batch_size

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33maintnoair[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if p

## Model

The model architecture for this project is already fixed in the project brief as follows:
- **Network Architecture:** 2-Layer with ReLu non-linearity.
- **Loss / Optimizer:** Loss: CrossEntropyLoss / Optimizer: Adam (potentially trying SGD with or without momentum in experiments)
- **Experiments to run**: Mentioned in Training section below
- **Number of training runs**: Will depend on number of experiments
- **Checkpointing / Early stopping:** 3 - 10 epochs of non-improvement of the validation loss
- **Planned correctness tests:** Shape and Dimension consistency tests, Gradient Check, Sanity Check & Prediction Testd

1. Creating the neural network class:

In [19]:
class NN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(NN, self).__init__()
        
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

2. Create instance of model and move it to the GPU

In [21]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = NN(input_dim, hidden_dim, output_dim).to(device)

Using device: cpu


3. Loss (nn.CrossEntropyLoss) and optimizer (optim.Adam)

4. Training loop

5. Evaluation function

## Training
Train the model with the following different hyperparameters:
- Learning rate: 1e-2 – 1e-5
- Batch size: 16 - 64
- Epochs: 10 - 50
- Hidden size: 64 - 512
- Early Stopping: Patience of 3 - 10 Epochs of non-improvement


## Evaluation
The model will be evaluated for the key metrics of:
- Accuracy
- Precision
- Recall
- F1 Score

The results will be averaged using micro averaging because I care about the total number of correct prediction regardless of the class ("yes" or "no"). 

Errors will be evaluated by making a confusion matrix and giving me the distribution of ture positives, false positives, true negatives and false negatives. Helping me figure out where the model is making most of it's mistakes.

## Finish the WandB run
Closing the WandB run

## Interpretation

To set concrete expectations for my model I take into account a couple of key benchmarks:
- **Accuracy:** Given the task of binary classification an accuracy of ~50% can be achieved with random guesses.
    - Expecting my model to hit an accuracy of ~60-75%.
- **F1 Score:** For this dataset I expect the F1 score to be similar to the accuracy of ~60-75%