**Optional**

Run the first cell if you want to interact with huggingface_hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

**DEALING WITH DATA FIRST**

The first step is to import the data and preprocess it to a format that can be used by the model

We use the Situations With Adversarial Generations (SWAG) dataset - https://rowanzellers.com/swag/ 

Here we import the dataset from the Datasets library of hugging face. 

Use 'pip install datasets' to install the library.

In [None]:
from datasets import load_dataset

In [None]:
swag = load_dataset("swag", "regular", trust_remote_code=True)

**TOKENIZE DATA**

We have to convert the dataset to a format that a Language Model can understand. Very similar to how we learn grammar, a language model has its own grammar and vocabulary based on the architecture.

Since we are using BERT, we will use the Tokenizer of bert-base-uncased.

I find it easy to use the transformer library of Hugging Face where they provide AutoTokenizer function that will automatically pick up the tokenizer depending on the model_id we choose. 

Use 'pip install transformers' to install the library.

<font color=red>**WARNING** </font>

If you get a warning that PyTorch/TensorFlow is not installed in your system, first install the cuda supported version of them.

I have used pytorch here. One way to install pytorch would be to open the anaconda prompt and do the following: -

conda create -n [enter your virtual environment name here] python=[enter the version of python you want to create your virtual environment for]

conda activate [virtual env name]

Install pytorch with cuda support (**very important if you have gpu and want to use it**) - https://pytorch.org/get-started/locally/

conda deactivate (To exit the virtual environment)

Use the virtual env in your IDE/Terminal for the next steps

In [None]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMultipleChoice.from_pretrained('bert-base-uncased')

In [None]:
swag

**HOW DOES THE DATASET LOOK AND HOW TO PREPROCESS?**

If you have run the above cell, you will see that the swag dataset has several id's, startphrase, sent1, sent2, ending0, ending1, ending2, ending3, and label.

This is where we have to be careful, depending on the task we have to preprocess our data. The problem that I want to solve is to fine tune BERT to predict an ending for sent2. The way I want to train BERT is the following: 

1) Input - sent1+sent2+ending_i , sent1+sent2+ending_j, we do this for all pair of endings but manually make sure that the correct answer exists in the pair.

2) Loss - We then calculate loss between the prediction of BERT and the true label and update gradients.

So we preprocess the dataset accordingly as follows: 

I have taken the preprocess function directly from Hugging Face - https://huggingface.co/docs/transformers/en/tasks/multiple_choice 

Note that it will work with BERT based architecture but you might need to check for other architectures.


In [None]:
def preprocess_function(examples, tokenizer):
    ending_names = ["ending0", "ending1", "ending2", "ending3"]
    
    first_sentences = []
    for context in examples["sent1"]:
        current_first_sentences = [context] * 4
        first_sentences.append(current_first_sentences)
    
    question_headers = examples["sent2"]
    
    second_sentences = []
    for i, header in enumerate(question_headers):
        current_second_sentences = []
        for end in ending_names:
            current_second_sentences.append(f"{header} {examples[end][i]}")
        second_sentences.append(current_second_sentences)

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    
    result = {}
    for keys, values in tokenized_examples.items():
        split_values = []
        for i in range(0, len(values), 4):
            split_values.append(values[i : i + 4])
        result[keys] = split_values
    
    return result


In [None]:
tokenized_swag = swag.map(lambda examples: preprocess_function(examples, tokenizer), batched=True)

In [None]:
tokenized_swag["train"].format

You would notice additional fields of 'input_ids', 'token_type_ids', and 'attention_mask', which denotes the dataset has been tokenized. To know what each of these id's mean refer - https://huggingface.co/docs/transformers/en/glossary 

Hugging Face Transformers doesn’t have a data collator for multiple choice, so you’ll need to adapt the DataCollatorWithPadding to create a batch of examples. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

DataCollatorForMultipleChoice flattens all the model inputs, applies padding, and then unflattens the results:

In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

Check if GPU is available

In [None]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

**Decide the metric you want to track while training**

I have decided to go with accuracy

In [None]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

**TRAINING LOOP**

I am using the TrainingArguments and Trainer provided by HuggingFace. This is optimized to run on models present in Hugging Face Transformers.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="fine-tuned-bert-base-uncased-swag",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
    hub_token=""#Enter your hub token here
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

**Saving the model to local instead of hugging face hub**
Set push_to_hub=False in the TrainingArguments before training and run the cell below


In [None]:
trainer.save_model("bert-swag-trained")