## Introduction

* # TODO: INSERT LINK TO WANDB VIEW

## Setup

#### Install dependencies

* **torch**: PyTorch framework for the creation of neural networks
* **lightning**: Lightning wrapper for pytorch for simple network training
* **huggingface_hub**: HuggingFace hub for downloading word vectors
* **datasets**: HuggingFace datasets to download and load the data set
* **wandb**: Weights & Biases for experiment tracking
* **fasttext**: Word embedding library
* **nltk**: Natural Language Toolkit used for word tokenization

In [1]:
import sys

if sys.platform == 'win32': # Windows requires different fasttext implementation
    %pip install -q torch lightning huggingface_hub datasets wandb fasttext-wheel nltk evaluate
else: 
    %pip install -q torch lightning huggingface_hub datasets wandb fasttext nltk evaluate

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Load dataset

Use the pre-defined method to load the dataset and do the train and validation split

In [3]:
from datasets import load_dataset, Dataset

train: Dataset = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid: Dataset = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test: Dataset = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

8741 1000 1221


#### Setup Weights & Biases

Login to weights and biases to enable experiment tracking for later network training

In [None]:
import wandb
wandb.login()

## Preprocessing

#### Vocabulary/Embedding

* I decided to use the **FastText** library for this project, since in class it was said that FastText is superior to the other embedding models and there is no problem with embedding unknown words, because it can create word vectors from their subwords. Furthermore, I will be working with the **facebook/fasttext-en-vectors** word vectors from the HuggingFace hub. They embed words from the English language, which is the only relevant language.

* This choice influences decisions in the following pre-processing steps.

#### Format cleaning (e.g. html-extracted text)

* No format cleaning is performed, because we work with a carefully assembled and standardized dataset used in model benchmarking.

#### Tokenization

* *word_tokenizer* from the **nltk** library will be used. This tokenizer works well for the English language. It also splits punctuation from text, which matches the tokens the fasttext word vectors were trained on.

#### Lowercasing, stemming, lemmatizing, stopword/punctuation removal

* **Lowercasing**: Although the word vectors in use were trained on case-sensitive data, the tokenized words will be lowercased to reach a smaller vocabulary and minimize out-of-vocabulary words.
* **Stemming**: The word embedding model was not trained on word stems and therefore no stemming is carried out.
* **Lemmatizing**: The word tokens to be embedded will not be lemmatized, because the fasttext model was trained on un-lemmatized words and the n-gram encoding of the words used in fasttext preserves sub-word information.
* **Stopword/Punctudation removal**: Since the task is to answer common-sense questions, stopwords and punctuation will not be removed. Most of the questions are quite short and the loss of information if either a critical stopword in the question or punctuation that changes the meaning of the question is removed could be significant.

#### Removal of unknown/other words

* Since I am working with a fasttext model, the removal of unknown words is not necessary, because vectors for them can implicitly be built from their n-gram vectors. Also, the encounter of unknown words is not expected.

#### Truncation

* Input will not be truncated. After some data review, the question yielding the most embedded word vectors yields a tensor of shape **300x67**. Depending on the input format, the RNN model will have to perform significantly less than 100 time steps for the longest input, which is deemed to be feasible. Also, if padding is implemented correctly, for every timestep only the necessary amount of time steps will be executed.

#### Feature selection

* Of the available features, the **question**, the **choices** and the **answerKey** were chosen. While the *questionConcept* seemed like an interesting feature at first, after some data review it was determined, that this feature often simply contains a word from the question. In the end this feature was left out in order not to give too much emphasis to a single word that does likely not help answering the question at all.

#### Input format: how is data passed to the model?

###### Classifier

   * I chose the input for the *Classifier-Model* to be a vector of size **1800**. The first 300 elements are the averages of the embedded question tokens, next are 300 elements for every embedded answer vector from answer option 'A' to 'E'.
      * The average of the question vectors is a good tradeoff between information retention and input  dimension for the classifier.
      * The question vector is before the answer vectors, because "Q&A" also has question first, then answers.
      * The answers are arranged from 'A' to 'E' because of alphabetical order.

###### RNN + Classifier

   * # TODO !

#### Label format: what should the model predict?

* Both model architectures will predict a vector of length 5. Every answer choice ('A' through 'E') is encoded on an index in the vector (0 through 4). Since the classifier at the last stage of the model predicts the likelihood of each output, this output format seems the most reasonable. Also, with this kind of output, the model only needs to run once for every classification, which should increase compute performance.

#### Train/valid/test splits

* As seen in the *Introduction* section, the train/validation/test splits are performed as defined in the course

#### Batching, padding

* # TODO: test on gpu-hub for optimal batching. padding only necessary for RNN model?

### Tokenize

Create method to tokenize and lowercase a given text

In [4]:
import nltk

nltk.download("punkt_tab")

def tokenize(text: str) -> list[str]:
    return [w.lower() for w in nltk.word_tokenize(text, language="english")]

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dave_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Word embeddings

Download the english fasttext word vectors and load their model into the variable *wv_model*

Create a function to embed a list of tokenized words and return them as a list of pytorch tensors

In [24]:
import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("facebook/fasttext-en-vectors", "model.bin")
wv_model = fasttext.load_model(model_path)

def get_embeddings_for_tokens(tokens: list[str]):
    return torch.stack([torch.tensor(wv_model[t]) for t in tokens]).T
    

### Data Loading and Formatting

Create a **pytorch** *Dataset* class in which the HuggingFace dataset is loaded and preprocessed. This allows for an easy integration with a *DataLoader* afterward.

In [41]:
from typing import Callable

import torch
from torch.utils.data import Dataset
from datasets import Dataset as HFDataset

TransformMethod = Callable[[torch.Tensor, list[torch.Tensor]], torch.Tensor]

Create a separator token from a character that is known to the word vector model, but unused in the train, valid and test datasets 

In [26]:
def char_not_in_huggingface_dataset(char: str, dataset: HFDataset) -> bool:
    for datapoint in dataset:
        if char in datapoint["question"] or any(char in c for c in datapoint["choices"]["text"]):
            return False
    return True
    

placeholder = "¦"
assert placeholder in wv_model # Check if placeholder is a known token in the model
assert char_not_in_huggingface_dataset(placeholder, train)
assert char_not_in_huggingface_dataset(placeholder, valid)
assert char_not_in_huggingface_dataset(placeholder, test)

SEP_TOKEN = torch.tensor(wv_model[placeholder])

In [71]:
KEY_INDEX_MAPPING = {
    "A": 0,
    "B": 1,
    "C": 2,
    "D": 3,
    "E": 4,
}

PLACEHOLDER_TOKEN = wv_model[placeholder]

class CommonsenseQADataset(Dataset):
    _target_transform: TransformMethod
    
    def __init__(self, dataset: HFDataset):
        self.dataset: list[dict[str, torch.tensor | list[torch.tensor]]] = []
        self._transform_hugging_face_dataset(dataset)
        
    def set_target_transform(self, transform: TransformMethod):
        self._target_transform = transform
    
    def _transform_hugging_face_dataset(self, dataset: HFDataset):
        self.dataset.extend([{
            "question": get_embeddings_for_tokens(tokenize(entry["question"])),
            "choices": torch.hstack([get_embeddings_for_tokens(tokenize(choice)).mean(dim=1).unsqueeze(1) for choice in entry["choices"]["text"]]),
            "answer": torch.eye(5)[KEY_INDEX_MAPPING[entry["answerKey"]]],
        } for entry in dataset])
        if len(self.dataset) != len(dataset):
            raise RuntimeError("Converted dataset is not full reflection of source data")
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        data_point = self.dataset[idx]
        feature = self._target_transform(data_point["question"], data_point["choices"])
        target = data_point["answer"]
        return feature, target

Interchangeable transform function for **Classifier** network

* Function input: Question-tensor (300 x N) and answer-tensor (300 x 5)
* Function output: Classifier vector (1800)
    * Average of question vectors yields vector of size 300
    * Question vector and answer vectors are concatenated (5 * 300 = 1800)

In [76]:
def classifier_target_transform(question: torch.Tensor, answers: torch.Tensor) -> torch.Tensor:
    return torch.concat((question.mean(dim=1), answers.T.flatten()))

Interchangeable transform function for **RNN + Classifier** network

* Function input: Question-tensor (300 x N) and answer-tensor (300 x 5)
* Function output: RNN-tensor (300 x (N + 10)) with N question vectors, 5 answer vectors and 5 SEP_TOKEN vectors separating the question from the answer and the answers from each other.

In [73]:
def rnn_target_transform(question: torch.Tensor, answers: torch.Tensor) -> torch.Tensor:
    separated_answers = torch.cat([SEP_TOKEN.unsqueeze(1).expand(-1, 5), answers], dim=1) # TODO: fix 5 times SEP then 5 time answers -> interleave the tensors
    return torch.cat([question, separated_answers], dim=1)

In [74]:
data = CommonsenseQADataset(train)
data.set_target_transform(rnn_target_transform)
f, t = data[0]
f.shape

torch.Size([300, 34])

In [75]:
f

tensor([[-0.0517, -0.0448, -0.0492,  ..., -0.0336,  0.1913, -0.0805],
        [ 0.0740,  0.0531, -0.0340,  ..., -0.0322,  0.1969,  0.0047],
        [-0.0131,  0.0327,  0.0386,  ..., -0.0057,  0.0787, -0.0155],
        ...,
        [ 0.2370,  0.0433, -0.0201,  ...,  0.0604, -0.0383, -0.0197],
        [ 0.0004,  0.0266,  0.0398,  ...,  0.0176,  0.0425,  0.0461],
        [-0.0042, -0.0361,  0.0157,  ...,  0.0529, -0.0680, -0.0485]])

## Model

## Training

## Evaluation

## Interpretation