## Introduction

*INSERT LINK TO WANDB VIEW*

## Setup

#### Install dependencies

* **torch**: PyTorch framework for the creation of neural networks
* **lightning**: Lightning wrapper for pytorch for simple network training
* **huggingface_hub**: HuggingFace hub for downloading word vectors
* **datasets**: HuggingFace datasets to download and load the data set
* **wandb**: Weights & Biases for experiment tracking
* **fasttext**: Word embedding library
* **nltk**: Natural Language Toolkit used for word tokenization
* **evaluate**: ...

In [42]:
import sys

if sys.platform == 'win32': # Windows requires different fasttext implementation
    %pip install -q torch lightning huggingface_hub datasets wandb fasttext-wheel nltk evaluate
else: 
    %pip install -q torch lightning huggingface_hub datasets wandb fasttext nltk evaluate

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Load dataset

Use the pre-defined method to load the dataset and do the train and validation split

In [56]:
from datasets import load_dataset, Dataset

train: Dataset = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid: Dataset = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test: Dataset = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

8741 1000 1221


#### Setup Weights & Biases

Login to weights and biases to enable experiment tracking for later network training

In [None]:
import wandb
wandb.login()

## Preprocessing

#### Vocabulary/Embedding

* I decided to use the **FastText** library for this project, since in class it was said that FastText is superior to the other embedding models and there is no problem with embedding unknown words, because it can create word vectors from their subwords. Furthermore, I will be working with the **facebook/fasttext-en-vectors** word vectors from the HuggingFace hub. They embed words from the English language, which is the only relevant language.

* This choice influences decisions in the following pre-processing steps.

#### Format cleaning (e.g. html-extracted text)

* No format cleaning is performed, because we work with a carefully assembled and standardized dataset used in model benchmarking.

#### Tokenization

* *word_tokenizer* from the **nltk** library will be used. This tokenizer works well for the English language. It also splits punctuation from text, which matches the tokens the fasttext word vectors were trained on.

#### Lowercasing, stemming, lemmatizing, stopword/punctuation removal

* **Lowercasing**: Although the word vectors in use were trained on case-sensitive data, the tokenized words will be lowercased to reach a smaller vocabulary and minimize out-of-vocabulary words.
* **Stemming**: The word embedding model was not trained on word stems and therefore no stemming is carried out.
* **Lemmatizing**: The word tokens to be embedded will not be lemmatized, because the fasttext model was trained on un-lemmatized words and the n-gram encoding of the words used in fasttext preserves sub-word information.
* **Stopword/Punctudation removal**: Since the task is to answer common-sense questions, stopwords and punctuation will not be removed. Most of the questions are quite short and the loss of information if either a critical stopword in the question or punctuation that changes the meaning of the question is removed could be significant.

#### Removal of unknown/other words

* Since I am working with a fasttext model, the removal of unknown words is not necessary, because vectors for them can implicitly be built from their n-gram vectors. Also, the encounter of unknown words is not expected.

#### Truncation

* 

#### Feature selection

* 

#### Input format: how is data passed to the model?

* 

#### Label format: what should the model predict?

* 

#### Train/valid/test splits

* As seen in the *Introduction* section, the train/validation/test splits are performed as defined in the course

#### Batching, padding

* TODO: test on gpu-hub for optimal batching. padding only necessary for RNN model?

### Tokenize

Create method to tokenize and lowercase a given text

In [None]:
import nltk

nltk.download("punkt_tab")

def tokenize(text: str) -> list[str]:
    return [w.lower() for w in nltk.word_tokenize(text, language="english")]

### Word embeddings

Download the english fasttext word vectors and load their model into the variable *wv_model*

Create a function to embed a list of tokenized words and return them as a list of pytorch tensors

In [65]:
import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("facebook/fasttext-en-vectors", "model.bin")
wv_model = fasttext.load_model(model_path)

from torch import tensor

def get_embeddings_for_tokens(tokens: list[str]):
    return [tensor(wv_model[t]) for t in tokens]
    

### Data Loading and Formatting

Create a **pytorch** *Dataset* class in which the HuggingFace dataset is loaded and preprocessed. This allows for an easy integration with a *DataLoader* afterward.

In [78]:
import torch
from torch.utils.data import Dataset
from datasets import Dataset as HFDataset

KEY_INDEX_MAPPING = {
    "A": 0,
    "B": 1,
    "C": 2,
    "D": 3,
    "E": 4,
}

class CommonsenseQADataset(Dataset):    
    def __init__(self, dataset: HFDataset):
        self.dataset: list[dict[str, torch.tensor | list[torch.tensor]]] = []
        self._transform_hugging_face_dataset(dataset)
    
    def _transform_hugging_face_dataset(self, dataset: HFDataset):
        for entry in dataset:
            self.dataset.append({
                "question": get_embeddings_for_tokens(tokenize(entry["question"])),
                "choices": [get_embeddings_for_tokens(tokenize(choice)) for choice in entry["choices"]["text"]],
                "answer": torch.eye(5)[KEY_INDEX_MAPPING[entry["answerKey"]]],
            })
        if len(self.dataset) != len(dataset):
            raise RuntimeError("Converted dataset is not full reflection of source data")
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        pass

## Model

## Training

## Evaluation

## Interpretation