In this lab you're asking to fine tune a Visual Transformer classifier on target dataset



Objectives:



1) Get familiar with **Huggingface** - the main library for working with transformers;



2) Use **low-rank adapters** for cheap training of a transformer.

### 1) Load transformers packages & dataset



***Transformers*** - is a package which is assosiated with HuggingFace community. It allows to load (and push) trained transformers and datasets. *transformers* package also connects with pytorch which allows to train a model by your own.



We will load an Visual Transformer (ViT) that was trained on ImageNet and fine tune it on images with different foods.

In [1]:
!pip install transformers accelerate evaluate datasets git+https://github.com/huggingface/peft -q

In [2]:

import torch

from datasets import load_from_disk


from transformers import AutoImageProcessor


Let's implement some preprocessing functions to fit images to ViT shape and distribution and add some augmentation

In [3]:
# Source director
import shutil
source_dir = "/kaggle/input/pmldl-week-8-fine-tuning-of-vi-t"

# Destination directory
destination_dir = "/kaggle/working/tr"
try:
    
    shutil.copytree(source_dir, destination_dir)
except:
    print("dir already existing")

In [4]:
# Add augmentation procedures if you like

from torchvision.transforms import Normalize, ToTensor



from torchvision.transforms import v2

# Target dataset

dataset = load_from_disk("/kaggle/working/tr/food-101_train/food-101_train")



# Data prepapator for a model

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224", use_fast=True)



# Extract parameters from image_processor

# Write your code here

# normalize = Normalize(mean=..., std=...)

normalize = Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))

# Write your code here

# Note that the size of images should fit size of image_processor

# train_transforms = Compose(

#     [

#         ...

#         ToTensor(),

#         normalize,

#     ]

# )

transforms = v2.Compose([

    v2.RandomResizedCrop(size=(224, 224), antialias=True),

    v2.RandomHorizontalFlip(p=0.5),

    # v2.ToDtype(torch.float32, scale=True),

    
    ToTensor(),

    normalize,

])
validate_transform = v2.Compose([

    v2.Resize(size=(256, 256), antialias=True),  # Resize the shorter side to 256 pixels

    v2.CenterCrop(size=(224, 224)),  # Crop the center to match model input size

    ToTensor(),  # Convert to tensor

    normalize  # Apply the same normalization used during training
])

# Write your code here

# Note that the size of images should fit size of image_processor

# val_transforms = Compose(

#     [

#         ...

#         ToTensor(),

#         normalize,

#     ]

# )



def preprocess_train(example_batch):

    """Apply train_transforms across a batch."""

    example_batch["pixel_values"] = [transforms(image.convert("RGB")) for image in example_batch["image"]]

    return example_batch





def preprocess_val(example_batch):

    """Apply val_transforms across a batch."""

    example_batch["pixel_values"] = [validate_transform(image.convert("RGB")) for image in example_batch["image"]]

    return example_batch

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Next, we need to map labels from string to int and vise versa

In [5]:
label2id, id2label = dict(), dict()

labels = dataset.features["label"].names



# Go through the labels and save corresponding indexes

for i, label in enumerate(labels):

    label2id[label] = i  # map label to id

    id2label[i] = label  # map id to label


Do train-test split

In [6]:
# split up training into training + validation

splits = dataset.train_test_split(test_size=0.1)

train_ds = splits["train"]

val_ds = splits["test"]



train_ds.set_transform(preprocess_train)

val_ds.set_transform(preprocess_val)

### 2) Model loading



First of all, we should load the model itself

In [7]:
def print_trainable_parameters(model):

    """

    Prints the number of trainable parameters in the model.

    """

    trainable_params = 0

    all_param = 0

    for _, param in model.named_parameters():

        all_param += param.numel()

        if param.requires_grad:

            trainable_params += param.numel()

    print(

        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"

    )

In [8]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer



# Write your code here

model = AutoModelForImageClassification.from_pretrained(

    "google/vit-base-patch16-224",

    label2id=label2id,

    id2label=id2label,

    ignore_mismatched_sizes=True

)

print_trainable_parameters(model)

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([101]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([101, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 85876325 || all params: 85876325 || trainable%: 100.00


### 3) Low-rank adaptation



[LoRA](https://arxiv.org/pdf/2106.09685) - is a well-known method for transformers training. The one can **decompose** weight matrix of a transformer into two smaller matricies.



Where are several parameters for LoRA. For now, let's focus on one, **r** - intrictic dimension of the decomposed matricies. **r** usually varies from 4 to 64.

In [9]:
from peft import LoraConfig, get_peft_model

# Load config

# Write your code here

config = LoraConfig(

    r=32,

    lora_alpha=16,

    target_modules=["query", "value"],

    lora_dropout=0.1,

    bias="none",

    modules_to_save=["classifier"],

)

lora_model = get_peft_model(model, config)

print_trainable_parameters(lora_model)


trainable params: 1257317 || all params: 87133642 || trainable%: 1.44


That's how you prepared an adapter. Note the trainable percent of parameters

### 4) Training of transformer



For `transformers` you don't need to write a training function as in pytorch. You need to set all the training config in `TrainingArguments` and run a `Trainer`.




In [10]:
from transformers import TrainingArguments, Trainer



# Write your code here

batch_size = 64

epochs = 8

# Train LoRA and save it to "fine-tunned-model"

args = TrainingArguments(

    "fine-tunned-model",

    remove_unused_columns=False,

    eval_strategy="epoch",

    save_strategy="epoch",

    learning_rate=5e-3,

    per_device_train_batch_size=batch_size,

    gradient_accumulation_steps=4,

    per_device_eval_batch_size=batch_size,

    fp16=True,

    num_train_epochs=epochs,

    logging_steps=10,

    load_best_model_at_end=True,

    metric_for_best_model="accuracy",

    push_to_hub=False,

    label_names=["labels"],
    report_to=None
)

Let's define a function for performance calculation and collate function that will map a sample from dataset into the image and label

In [11]:
import numpy as np

import evaluate

import torch



metric = evaluate.load("accuracy")



# the compute_metrics function takes a Named Tuple as input:

# predictions, which are the logits of the model as Numpy arrays,

# and label_ids, which are the ground-truth labels as Numpy arrays.

# Use metric.compute(...) to calculate an accuracy between arrays
def compute_metrics(eval_pred):
    # Unpack the predictions and true labels
    logits, label_ids = eval_pred
    
    # Convert logits to predicted class indices (the class with the highest score)
    predictions = np.argmax(logits, axis=1)
    
    # Calculate the accuracy using the metric defined earlier
    accuracy = metric.compute(predictions=predictions, references=label_ids)
    
    return accuracy


def collate_fn(examples):

    pixel_values = torch.stack([example["pixel_values"] for example in examples])

    labels = torch.tensor([example["label"] for example in examples])

    return {"pixel_values": pixel_values, "labels": labels}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Define the main training function:

In [12]:
import gc

gc.collect()

210

In [13]:
import torch

torch.cuda.empty_cache()

In [14]:
!nvidia-smi

  pid, fd = os.forkpty()


Thu Oct 24 22:44:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P0             27W /  250W |       3MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [15]:
!pip install wandb



In [24]:
import wandb

try:
    
    wandb.login(key='1a7158562f4e89adb744a65c610197888f19516b')
    anony = None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')

[34m[1mwandb[0m: Currently logged in as: [33myazan-nukari[0m ([33myazan-nukari-innopolis-university[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [25]:
trainer = Trainer(

    lora_model,

    args,

    train_dataset=train_ds,

    eval_dataset=val_ds,

    tokenizer=image_processor,

    compute_metrics=compute_metrics,

    data_collator=collate_fn,

)

train_results = trainer.train()

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112961177776824, max=1.0…

KeyboardInterrupt: 

In [None]:
from peft import PeftModel

trainer.model.save_pretrained("my_adapter")



finetuned_model = PeftModel.from_pretrained(model,

                                  "my_adapter",

                                  torch_dtype=torch.float16,

                                  is_trainable=False,

                                  device_map="auto"

                                  )

finetuned_model = finetuned_model.merge_and_unload()

In [None]:
# Test dataset



test_dataset = load_from_disk("/kaggle/working/tr/food-101_test_images/food-101_test_images")

test_dataset.set_transform(preprocess_val)

In [None]:
test_dataset[0]

In [None]:
def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    # labels = torch.tensor([example["label"] for example in examples])  # This line is not needed for test data
    return {"pixel_values": pixel_values}


In [None]:
import pandas as pd
import torch

# Load the fine-tuned model (assuming it has already been merged and unloaded)
finetuned_model.eval()  # Set the model to evaluation mode

# DataLoader to efficiently batch and make predictions
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, collate_fn=collate_fn)

# Initialize a list to store the predictions and image ids
predictions_list = []
image_id =0 
# Disable gradient calculation to save memory and computations
with torch.no_grad():
    for batch in test_loader:
        # Move the pixel values to the same device as the model
        pixel_values = batch["pixel_values"].to(finetuned_model.device)

        # Forward pass through the model to get logits
        outputs = finetuned_model(pixel_values)
        logits = outputs.logits

        # Get the predicted labels (class with the highest score)
        predicted_labels = torch.argmax(logits, dim=-1).cpu().numpy()

        # Collect image IDs or other identifying information (assuming test_dataset has 'id' field)
        for pred in predicted_labels:
            predictions_list.append({"ID": image_id, "TARGET": id2label[pred]})
            image_id+=1

# Create a DataFrame to store predictions
predictions_df = pd.DataFrame(predictions_list)



In [None]:

# Save predictions to a CSV file
predictions_df.to_csv("submission.csv", index=False)

print("Predictions saved to submission.csv")

In [None]:
predictions_df