### Task overview <br>
**Goal**: Practice fine-tuning a pre-trained LM (GPT2-small) on the particular task of commonsense question answering (QA) <br>
**Dataset**: [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) <br>
**Description** <br>
- Evaluate the performance of the model on testset over training set. <br>
- Monitor: 
    - Whether the model’s performance is improving; 
    - Compare the performance of the base pretrained GPT-2 and the fine-tuned model 
- Steps:
    1. Data preparation. Simiar to [sheet 1.1](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/01-introduction.html#main-training-data-processing-steps) and [sheet 2.3](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02c-MLP-pytorch.html#preparing-the-training-data)
    2. Load the pretrained GPT-2 model
    3. Set up training pipiline. Steps similar to [sheet 2.5](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02e-intro-to-hf.html)
    4. Run the training while tracking the losses
    5. Save plot of losses for submission

In [1]:
# pkg preparation
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2Tokenizer, GPT2LMHeadModel
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
!pip3 install accelerate



### Step 1. Data Preparation
1. Acquiring data <br>
2. Minimally exploring dataset <br>
3. Cleaning / wrangling data (combines step 4 from sheet 1.1 and step 1.1 above) <br>
4. Splitting data into training and test set (we will not do any hyperparam tuning) (we don't need further training set wrangling) <br>
5. Tokenizing data and making sure it can be batched (i.e., conversted into 2d tensors), this will also happen in our custom Dataset class (common practice when working with text data) <br>

In [4]:
# downaload dataset from HF
dataset = load_dataset("tau/commonsense_qa")
# inspect dataset, print all keys
print("The keys are:\n", dataset.keys())
# print a sample from the dataset
print("A sample:\n", dataset["train"][0]) # CODE
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# set padding side to be left because we are doing causal LM
tokenizer.padding_side = "left"

The keys are:
 dict_keys(['train', 'validation', 'test'])
A sample:
 {'id': '075e483d21c29a511267ef62bedc0461', 'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?', 'question_concept': 'punishing', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']}, 'answerKey': 'A'}


In [5]:
def massage_input_text(example):
    """
    Helper for converting input examples which have 
    a separate qquestion, labels, answer options
    into a single string.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    input_text: str
        Formatted training text which contains the question,
        the forwatted answer options (e.g., 'A. <option 1> B. <option 2>' etc)
        and the ground truth answer.
    """
    # combine each label with its corresponding text
    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = " ".join([f"{label}. {text}" for label, text in answer_options_list]) # CODE
    # join the list of options with spaces into single string
    answer_options_string = " ".join(answer_options.split()) # CODE
    # combine question and answer options
    input_text = example["question"] + " " + answer_options_string
    # append the true answer with a new line, "Answer: " and the label
    input_text += "\nAnswer: " + example["answerKey"]

    return input_text

# process input texts of train and test sets
massaged_datasets = dataset.map(
    lambda example: {
        "text": massage_input_text(example)
    }
)

Map:   0%|          | 0/9741 [00:00<?, ? examples/s]

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

Map:   0%|          | 0/1140 [00:00<?, ? examples/s]

In [8]:
# inspect a sample from our preprocessed data
prep_data_sample = massaged_datasets["train"][0] # CODE, modified
print("A sample from the preprocessed data:\n", prep_data_sample) # CODE, modified

A sample from the preprocessed data:
 {'id': '075e483d21c29a511267ef62bedc0461', 'question': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?', 'question_concept': 'punishing', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']}, 'answerKey': 'A', 'text': 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change? A. ignore B. enforce C. authoritarian D. yell at E. avoid\nAnswer: A'}


In [10]:
class CommonsenseQADataset(Dataset):
    """
    Custom dataset class for CommonsenseQA dataset.
    """

    def __init__(
            self, 
            train_split, 
            test_split,
            tokenizer,
            max_length=64,
            dataset_split="train",
        ) -> None:
        """
        Initialize the dataset object.
        
        Arguments
        ---------
        train_split: dict
            Training data dictionary with different columns.
        test_split: dict
            Test data dictionary with different columns.
        tokenizer: Tokenizer
            Initialized tokenizer for processing samples.
        max_length: int
            Maximal length of inputs. All inputs will be 
            truncated or padded to this length.
        dataset_split: str
            Specifies which split of the dataset to use. 
            Default is "train".
        """
        self.train_split = train_split['text']
        self.test_split = test_split['text']
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.dataset_split = dataset_split

    def __len__(self):
        """
        Method returning the length of the training dataset.
        """
        return len(self.train_split) if self.dataset_split == "train" else len(self.test_split) # CODE
    
    def __getitem__(self, idx):
        """
        Method returning a single training example.
        Note that it also tokenizes, truncates or pads the input text.
        Further, it creates a mask tensor for the input text which 
        is used for causal masking in the transformer model.

        Arguments
        ---------
        idx: int
            Index of training sample to be retrieved from the data.
        
        Returns
        --------
        tokenized_input: dict
            Dictionary with input_ids (torch.Tensor) and an attention_mask
            (torch.Tensor).
        """
        # retrieve a training sample at the specified index idx
        # HINT: note that this might depend on self.dataset_split
        input_text = self.train_split[idx] if self.dataset_split == "train" else self.test_split[idx] # CODE
        tokenized_input = self.tokenizer(
            input_text,
            max_length = self.max_length, # CODE
            padding = "max_length",
            truncation = True,
            return_tensors = "pt"
        )
        tokenized_input["attention_mask"] = (tokenized_input["input_ids"] != tokenizer.pad_token_id).long()
        return tokenized_input

In [11]:
# move to accelerated device 
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

Device: cpu


### Step 2. Initialise Model

In [None]:
# load pretrained gpt2 for HF
model = AutoModelForCausalLM.from_pretrained("gpt2") # CODE
# print num of trainable parameters
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f} M parameters")

# Hint: If you run out of memory while trying to run the training, try decreasing the batch size.

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

### Step 3. Set up configurations required for the training loop

In [16]:
# instantiate dataset with the downloaded commonsense_qa data 
train_dataset = CommonsenseQADataset(
    train_split, test_split, tokenizer, 
    max_length = 64, 
    dataset_split = "train" # CODE
)
# instantiate test dataset with the downloaded commonsense_qa data
test_dataset = CommonsenseQADataset(
    train_split, test_split, tokenizer, 
    max_length = 64, 
    dataset_split = "test" # CODE
)
# create a DataLoader for the dataset
# the data loader will automatically batch the data
# and iteratively return training examples (question answer pairs) in batches
dataloader = DataLoader(
    train_dataset, 
    batch_size=32, 
    shuffle=True
)
# create a DataLoader for the test dataset
# reason for separate data loader is that we want to
# be able to use a different index for retreiving the test batches
# we might also want to use a different batch size etc.
test_dataloader = DataLoader(
    test_dataset, 
    batch_size=32, 
    shuffle=True
)

NameError: name 'train_split' is not defined

### Step 4. Run the training of the model

In [None]:
# Hint: for implementing the forward pass and loss computation, carefully look at the exercise sheets 
# and the links to examples in HF tutorials.

# put the model in training mode
model.train()
# move the model to the device (e.g. GPU)
model = model.to(device)

# trianing configutations 
# feel free to play around with these
epochs  = 1
train_steps =  len(train_dataset) // 32
print("Number of training steps: ", train_steps)
# number of test steps to perform every 10 training steps
# (smaller that the entore test split for reasons of comp. time)
num_test_steps = 5

# define optimizer and learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4) 
# define some variables to accumulate the losses
losses = []
test_losses = []

# iterate over epochs
for e in range(epochs):
    # iterate over training steps
    for i in tqdm(range(train_steps)):
        # get a batch of data
        x = next(iter(dataloader))
        # move the data to the device (GPU)
        x = {key: val.to(device) for key, val in x.items()}  # Move batch to device # CODE

        # forward pass through the model
        ### YOUR CODE HERE ###
        outputs = model(
            ### YOUR CODE HERE ####
        )
        # get the loss
        loss = outputs.loss # CODE
        # backward pass
        loss.backward() # CODE
        losses.append(loss.item())
        # update the parameters of the model
        optimizer.step() # CODE

        # zero out gradient for next step
        optimizer.zero_grad() # CODE

        # evaluate on test set every 10 steps
        if i % 10 == 0:
            print(f"Epoch {e}, step {i}, loss {loss.item()}")
            # track test loss for the evaluation iteration
            test_loss = 0
            for j in range(num_test_steps):
                # get test batch
                x_test = next(iter(test_dataloader))
                x_test = x_test.to(device)
                with torch.no_grad():
                    test_outputs = model(
                        **x_test # CODE
                    )
                test_loss += test_outputs.loss # CODE
                
            test_losses.append(test_loss / num_test_steps)
            print("Test loss: ", test_loss/num_test_steps)

### Step 5. Plot the fine-tuning loss and MAKE SURE TO SAVE IT AND SUBMIT IT

In [None]:
# plot training losses on x axis
plt.plot(loss) # CODE
plt.xlabel("Training steps")
plt.ylabel("Loss")

In [None]:
# print a few predictions on the eval dataset to see what the model predicts

# construct a list of questions without the ground truth label
# and compare prediction of the model with the ground truth

def construct_test_samples(example):
    """
    Helper for converting input examples which have 
    a separate qquestion, labels, answer options
    into a single string for testing the model.

    Arguments
    ---------
    example: dict
        Sample input from the dataset which contains the 
        question, answer labels (e.g. A, B, C, D),
        the answer options for the question, and which 
        of the answers is correct.
    
    Returns
    -------
    input_text: str, str
        Tuple: Formatted test text which contains the question,
        the forwatted answer options (e.g., 'A. <option 1> B. <option 2>' etc); 
        the ground truth answer label only.
    """

    answer_options_list = list(zip(
        example["choices"]["label"],
        example["choices"]["text"]
    ))
    # join each label and text with . and space
    answer_options = " ".join([f"{label}.{text}" for label, text in answer_options_list]) # CODE
    # join the list of options with spaces into single string
    answer_options_string = " ".join(answer_options.split()) # CODE
    # combine question and answer options
    input_text = example["question"] + " " + answer_options_string
    # create the test input text which should be:
    # the input text, followed by the string "Answer: "
    # we don't need to append the ground truth answer since we are creating test inputs
    # and the answer should be predicted.
    input_text += "\nAnswer: " # CODE

    return input_text, example["answerKey"]

test_samples = [construct_test_samples(dataset["validation"][i]) for i in range(10)]
test_samples

### One More Step: Test the model 

In [None]:
# set it to evaluation mode
model.eval()

predictions = []
for sample in test_samples:
    input_text = sample[0]
    input_ids = tokenizer(input_text, return_tensors="pt").to(device)
    output = model.generate(
        input_ids.input_ids,
        attention_mask = input_ids.attention_mask,
        max_new_tokens=2,
        do_sample=True,
        temperature=0.4,
    )
    prediction = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append((input_text, prediction, sample[1]))

print("Predictions of trained model ", predictions)