# 1. Fine tune pretrained model

Welcome to the next coding lesson. Previously we learnt how to load and apply pretrained models from transformers library. We saw that despite the fact that models are huge and powerful, they still sometimes fail to generalise well to unseen data or to a data from different domain.

A way to adapt model to a specific domain is to finetune it. During this lesson we will take a pretrained model, a real dataset and will learn how to fine tune model on given dataset to improve its quality.

In [None]:
!pip install transformers==4.24.0
!pip install Pillow==10.0.0
!pip install -U sentence-transformers==2.2.2
!pip install datasets==2.14.4
!pip install sentencepiece==0.1.99

In [None]:
from datasets import load_dataset
import transformers
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
import numpy as np
import typing as tp

## 2. Data

We will use a dataset from Hugging Face hub
https://huggingface.co/datasets/imdb

This is a collection of reviews from imdb website labelled with positive or negative sentiment.

In [None]:
imdb = load_dataset("imdb", split="test")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Convert data to pandas dataframe.

In [None]:
imdb_df = imdb.to_pandas()

In [None]:
imdb_df.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


Take a sample of 1000 rows for testing our model.

In [None]:
imdb_sample = imdb_df.sample(n=1000, random_state=2023)

Look up class distribution to make sure dataset is balanced.

In [None]:
imdb_sample["label"].value_counts()

1    529
0    471
Name: label, dtype: int64

## 3. Using pretrained model as is.

Before we start fine tuning let's try to see how available models may perform as is without any finetuning by us. Let's take a pretrained roberta sentiment model. This is a very big model that was trained to predict sentiment. It looks like it can be a good candidate to solve our problem.

In [None]:
model_name = "siebert/sentiment-roberta-large-english"

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
config = transformers.AutoConfig.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

### 3.1 Dataset and Dataloader creation

Whether we fine tune our model or not, we need to create supplementary obgects to manipulate with our data and model.

In order to run model on all test examples in Pytorch, one needs to create a dataset and Dataloader objects that will be able to iterate data in an efficient way.

Read more about Pytorch Dataset conception https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

In [None]:
class ExampleDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(),
            "attention_mask": encoding["attention_mask"].squeeze(),
            "label": torch.tensor(label, dtype=torch.long),
        }

Lets create an instance of Dataset object. We pass our data and labels, asd well as tokenizer from pretrained model. We also pass max_length param to truncate longer texts.

In [None]:
example_dataset = ExampleDataset(
    texts=imdb_sample["text"].tolist(),
    labels=imdb_sample["label"].tolist(),
    tokenizer=tokenizer,
    max_length=512,
)

Dataloader taks as input Dataset and is able to iterate it effectively in a batch manner.

In [None]:
dataloader = DataLoader(example_dataset, batch_size=16, shuffle=False)

### 3.2 Evaluation function.

This function takes as input our model and a dataloader. It then applies model on each batch of dataloader. In the end function returns ground truth labels and predicted labels. We can use them to compute metrics.

In [None]:
def evaluate(model, dataloader, device="cuda"):
    model.eval()
    model.to(device)

    valid_preds, valid_labels = [], []  # init arrays to fill labels and preds

    for batch in dataloader:  # iterate all data points
        b_input_ids = batch["input_ids"].to(device)  # extract inputd and move to GPU
        b_input_mask = batch["attention_mask"].to(device)
        b_labels = batch["label"].to(device)  # extract labels and move to GPU

        with torch.no_grad():  # optimization low level detail
            logits = model(
                input_ids=b_input_ids, attention_mask=b_input_mask
            )  # apply model and get model predictions as logits

        logits = logits[0].detach().cpu().numpy()  # convert preds to simple array
        label_ids = b_labels.to("cpu").numpy()  # same as above

        batch_preds = np.argmax(
            logits, axis=1
        )  # take label with max proba, this is our prediction
        batch_labels = np.concatenate(label_ids.reshape(-1, 1))  # techincal detauls
        valid_preds.extend(batch_preds)  # fill our general list of preds
        valid_labels.extend(batch_labels)  # same as above but with labels

    return valid_labels, valid_preds  # return list of preds and ground truth labels

In [None]:
valid_labels, valid_preds = evaluate(model, dataloader)

In [None]:
# how above looks

# valid_labels = [0,1,1,0]
# valid_preds = [0,0,1,0]

Now we can compute metrics for our data. We will use F1 score which is a harmonic mean of Precision and Recall and is considered to be a good metric for classification problrem. Read more about F1 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [None]:
from sklearn.metrics import f1_score

f1_score(valid_labels, valid_preds)

0.9584905660377359

## 4. Model fine tuning.

We see that a pretrained model yield a good quality ~0.95 F1 score. This is indeed good, but that model is very big, maybe we can achieve a similar performance with much smaller model?

Lets take distilled bert foundation model and finetune it on our sentiment analysis task. Distillation is a process to make model more leightweight without loosing in model quality. (Google knowledge distillation to learn more about methods)

In [None]:
model_name = "distilbert-base-uncased"

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
config = transformers.AutoConfig.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

### 4.1 Data for fine tuning.

Let's take 5000 rows for training (fine tuning).

In [None]:
imdb_train = load_dataset("imdb", split="train")

In [None]:
imdb_train_sample = imdb_train.to_pandas().sample(n=5000, random_state=2023)

In [None]:
imdb_train_sample.label.value_counts(dropna=False)

1    2552
0    2448
Name: label, dtype: int64

Create an instance of Dataset passing train data.

In [None]:
train_dataset = ExampleDataset(
    texts=imdb_train_sample["text"].tolist(),
    labels=imdb_train_sample["label"].tolist(),
    tokenizer=tokenizer,
    max_length=512,
)

Create train dataloader.

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=False)

### 4.2 Function to fine-tune our data (in our context train == fine tune)

the function takes as input model, train_dataloader and some of the model params. Next it trains models to predict correct sentiment from ground truth labels.

In [None]:
def train(model, train_loader, num_epochs=2, learning_rate=2e-5, device="cuda"):
    """
    function is a simple train loop
    """
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=learning_rate)  # define optimizer

    for epoch in range(num_epochs):
        model.train()  # put in the train mode
        total_loss = 0.0

        for batch in train_loader:  # iterate our data
            # get below inout and ground truth labels
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            optimizer.zero_grad()  # technical op
            outputs = model(
                input_ids, attention_mask=attention_mask, labels=labels
            )  # apply model
            loss = outputs.loss  # compute the loss
            loss.backward()  # we do here backprop step
            optimizer.step()  # step of optimizer ti update the weights

            total_loss += loss.item()
        average_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{num_epochs} - Average Loss: {average_loss:.4f}")

    print("Training complete!")

Follow ups:

* look up for available models (e.g. use not simple bert, but more accurate roberta or deberta models).
* take more data for fune tuning.
* make sure your data is clean and doesnt have outliers (garbage in -> garbage out).
* tune learning rate.
* what else parameters is possible to tune in this function? What makes sense to tune and why?
* additional techinques: weight decay, learnirg rate scheduler.


In [None]:
train(model, train_dataloader)

Epoch 1/2 - Average Loss: 0.3269
Epoch 2/2 - Average Loss: 0.1612
Training complete!


### 4.3 Create a validation dataset to check the performance of our fine tuned model.

In [None]:
val_dataset = ExampleDataset(
    texts=imdb_sample["text"].tolist(),
    labels=imdb_sample["label"].tolist(),
    tokenizer=tokenizer,
    max_length=512,
)

val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)

In [None]:
valid_labels, valid_preds = evaluate(model, val_dataloader)


f1_score(valid_labels, valid_preds)

0.911190053285968

Nice! We can see that we took a raw model that has no knowledge about out sentiment task, we then fine tuned our model on givent task and now model is capable to obtain quite a good performance! It is possible to use more data and train model longer to get even higher quality.

In [None]:
# in case you need to free GPU memory

model.cpu()
torch.cuda.empty_cache()

## 5. Summary

* Fune tuning helps to adapate model for your desired data domain.
* Even when there are present heavy pretrained models capable t osolve your task, consider fine tuning a smaller model for efficiency.
* Good fine tuning will require at least several thousands of data examples.