# Fine-tuning DeBERTA
In this notebook, we will provide the code for fine-tuning DeBERTA.

## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [20]:
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import pandas as pd
from datasets import Dataset, DatasetDict, load_dataset, load_metric
import numpy as np
import re
import torch

In [2]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

In [3]:
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    mps_device = torch.device("mps")

## Load dataset

We will read the three csv files (train, test, validation) and convert them to a HuggingFace Dataset format.

In [4]:
# choose between 'superglue', 'xsum', 'commensense'
data = 'superglue'

In [None]:
#### CHOOSE SIZE OF DATA ####
# if you just want to try out the code, select a small number
testing_only = True
num_data_points = 100

In [23]:
if data == 'superglue':
    # choose SuperGLUE BoolQ (Yes/No Questions)
    superglue_data_path = '../../data/SloveneSuperGLUE/SuperGLUE-GoogleMT/csv/BoolQ'    
    # Load your NLP dataset
    if testing_only:
        train_df = pd.read_csv(f"{superglue_data_path}/train.csv")[:num_data_points]
        eval_df = pd.read_csv(f"{superglue_data_path}/val.csv")[:num_data_points]
    else:
        train_df = pd.read_csv(f"{superglue_data_path}/train.csv")
        eval_df = pd.read_csv(f"{superglue_data_path}/val.csv")


    # Convert data into Hugging Face Dataset format
    dataset_train = Dataset.from_pandas(train_df)
    dataset_eval = Dataset.from_pandas(eval_df)

    # Create a DatasetDict containing the three splits
    dataset = DatasetDict({
        'train': dataset_train,
        'validation': dataset_eval
    })
elif data == 'commensense':
    dataset = load_dataset("commonsense_qa")
elif data == 'xsum':
    dataset = load_dataset("GEM/xsum")

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [24]:
dataset

DatasetDict({
    train: Dataset({
        features: ['idx', 'label', 'passage', 'question'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['idx', 'label', 'passage', 'question'],
        num_rows: 100
    })
})

Let's check the first example of the training split:

In [25]:
example = dataset['train'][0]
example

{'idx': 0,
 'label': True,
 'passage': 'Perzijski jezik - perzijščina (/ ˈpɜːrʒən, -ʃən /), znana tudi pod endonimom farsi (فارسی fārsi (fɒːɾˈsiː) (poslušaj)), je eden od zahodnoiranskih jezikov v indoiranski veji indoevropskega jezika družina. Govorijo se predvsem v Iranu, Afganistanu (od 1958 uradno znan kot Dari) in Tadžikistanu (od sovjetske dobe uradno znani kot Tadžiki) ter nekaterih drugih regijah, ki so bile v preteklosti perzijske družbe in so veljale za del Velikega Irana. Zapisano je v perzijski abecedi, spremenjeni različici arabske pisave, ki se je sama razvila iz aramejske abecede.',
 'question': 'ali Iran in Afganistan govorita isti jezik'}

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

In [26]:
def clean_text(text):
    # Convert to lowercase
    cleaned_text = text.lower()
    
    # Remove special characters except whitespace
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)
    
    # Remove extra whitespaces
    cleaned_text = ' '.join(cleaned_text.split())
    
    return cleaned_text

We follow the example of [this](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb#scrollTo=oAeoKVaWaIEl) notebook.

In [27]:
# Load the pre-trained tokenizer for deberta
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [28]:
pad_on_right = tokenizer.padding_side == "right"
# The maximum length of a feature (question and context)
max_length = 384 
# The authorized overlap between two part of the context when splitting it is needed.
doc_stride = 128 

In [29]:
def preprocess_function(examples):
    # Clean questions and passages (or context)
    cleaned_questions = [clean_text(q).lstrip() for q in examples["question"]]
    cleaned_passages = [clean_text(p) for p in examples["passage"]]

    # Tokenize the cleaned inputs
    tokenized_examples = tokenizer(
        cleaned_questions,
        cleaned_passages,
        truncation="only_second",  # Assuming passage comes after question
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["labels"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will use 1 for True and 0 for False
        label = 1 if examples["label"][sample_mapping[i]] == "True" else 0
        tokenized_examples["labels"].append(label)

    return tokenized_examples

In [30]:
# Pre-process the train, validation, and test datasets
train_dataset = dataset['train'].map(
    preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
validation_dataset = dataset['validation'].map(
    preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map: 100%|██████████| 100/100 [00:00<00:00, 1563.18 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 4287.08 examples/s]


In [31]:
example = train_dataset[0]
print(example.keys())

dict_keys(['input_ids', 'attention_mask', 'labels'])


In [32]:
tokenizer.decode(example['input_ids'])

'[CLS] ali iran in afganistan govorita isti jezik [SEP] perzijski jezik perzijscina [UNK] ʃən znana tudi pod endonimom farsi فارسی farsi fɒːɾˈsiː poslusaj je eden od zahodnoiranskih jezikov v indoiranski veji indoevropskega jezika druzina govorijo se predvsem v iranu afganistanu od 1958 uradno znan kot dari in tadzikistanu od sovjetske dobe uradno znani kot tadziki ter nekaterih drugih regijah ki so bile v preteklosti perzijske druzbe in so veljale za del velikega irana zapisano je v perzijski abecedi spremenjeni razlicici arabske pisave ki se je sama razvila iz aramejske abecede [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [33]:
train_dataset.set_format("torch")

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [34]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [35]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-boolq",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [37]:
trainer = Trainer(
    model = model,
    args = args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Let's start training!

In [38]:
trainer.train()

 33%|███▎      | 7/21 [00:22<01:06,  4.76s/it]
[A
[A
[A
[A
[A
[A
                                              
[A                                     

 33%|███▎      | 7/21 [00:24<01:06,  4.76s/it]
[A
[A

{'eval_loss': 0.3122859597206116, 'eval_runtime': 2.2531, 'eval_samples_per_second': 48.377, 'eval_steps_per_second': 3.107, 'epoch': 1.0}


 67%|██████▋   | 14/21 [00:46<00:19,  2.78s/it]
[A
[A
[A
[A
[A
[A
                                               
[A                                     

 67%|██████▋   | 14/21 [00:48<00:19,  2.78s/it]
[A
[A

{'eval_loss': 0.1253909468650818, 'eval_runtime': 1.9068, 'eval_samples_per_second': 57.164, 'eval_steps_per_second': 3.671, 'epoch': 2.0}


100%|██████████| 21/21 [00:58<00:00,  1.35s/it]
[A
[A
[A
[A
[A
[A
                                               
[A                                     

100%|██████████| 21/21 [01:00<00:00,  1.35s/it]
[A
                                               
100%|██████████| 21/21 [01:00<00:00,  2.90s/it]

{'eval_loss': 0.09127047657966614, 'eval_runtime': 2.2652, 'eval_samples_per_second': 48.12, 'eval_steps_per_second': 3.09, 'epoch': 3.0}
{'train_runtime': 60.9586, 'train_samples_per_second': 5.069, 'train_steps_per_second': 0.344, 'train_loss': 0.29338477906726657, 'epoch': 3.0}





TrainOutput(global_step=21, training_loss=0.29338477906726657, metrics={'train_runtime': 60.9586, 'train_samples_per_second': 5.069, 'train_steps_per_second': 0.344, 'train_loss': 0.29338477906726657, 'epoch': 3.0})

In [None]:
save_model = False

if save_model:
    trainer.save_model(f"{model_name}-finetuned-boolq")

## Evaluate

After training, we evaluate our model on the validation set.

In [39]:
trainer.evaluate()

100%|██████████| 7/7 [00:07<00:00,  1.03s/it]


{'eval_loss': 0.09127047657966614,
 'eval_runtime': 7.262,
 'eval_samples_per_second': 15.01,
 'eval_steps_per_second': 0.964,
 'epoch': 3.0}

## Inference

Let's test the model on a new sentence: