# Fine-Tuning ViT Classification models using AdapterPlus

In this vision tutorial, we will show how to fine-tune ViT Image models using the `AdapterPlus` Config, which is a bottleneck adapter using the parameters as defined in the `AdapterPlus` paper. For more information on bottleneck adapters, you can visit our basic [tutorial](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/01_Adapter_Training.ipynb) and our docs [page](https://docs.adapterhub.ml/methods#bottleneck-adapters).

You can find the link to the `AdapterPlus` paper by Steitz and Roth [here](https://openaccess.thecvf.com/content/CVPR2024/papers/Steitz_Adapters_Strike_Back_CVPR_2024_paper.pdf) and their GitHub page [here](https://github.com/visinf/adapter_plus).

### Installation

Before we can get started, we need to ensure the proper packages are installed. Here's a breakdown of what we need:

- `adapters` to load the model and the adapter configuration
- `accelerate` for training optimization
- `evaluate` for metric computation and model evaluation
- `datasets` to import the datasets for training and evaluation

In [None]:
!pip install -qq -U adapters datasets accelerate torchvision evaluate

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch

### Dataset

For this tutorial, we will be using a light-weight image dataset `cifar100`, which contains 60,000 images with 100 classes. We use the `datasets` library to directly import the dataset and split it into its training and evaluation datasets.

In [None]:
from datasets import load_dataset

num_classes = 100
train_dataset = load_dataset("uoft-cs/cifar100", split = "train")
eval_dataset = load_dataset("uoft-cs/cifar100", split = "test")

train_dataset.set_format("torch")
eval_dataset.set_format("torch")

For this tutorial, we will be using the fine_label, to match the number of classes (100) that were used in the paper.

In [None]:
train_dataset[0].keys()

We will now initialize our `ViT` image processor to convert the images into a more friendly format. It will also apply transformations to each image in order to improve the performance during training.

In [5]:
model_name_or_path = 'google/vit-base-patch16-224-in21k'

In [None]:
from transformers import ViTImageProcessor
processor = ViTImageProcessor.from_pretrained(model_name_or_path)

We'll print out the processor here in order to get an idea of what types of transformations it is applying onto the iamge

In [None]:
processor

### Data Preprocessing

We will pre-process every image as defined in the `processor` above, and add the `label` key which contains our labels

In [8]:
def preprocess_image(example):
  image = processor(example["img"], return_tensors='pt')
  image["label"] = example["fine_label"]
  return image

In [None]:
train_dataset = train_dataset.map(preprocess_image)
eval_dataset = eval_dataset.map(preprocess_image)
#remove uneccessary columns
train_dataset = train_dataset.remove_columns(['img', 'fine_label', 'coarse_label'])
eval_dataset = eval_dataset.remove_columns(['img', 'fine_label', 'coarse_label'])

Defining a Datacollator

We'll be using a very simple custom datacollator to help us combine multiple data samples into one batch

In [10]:
from typing import Any
from dataclasses import dataclass

@dataclass
class DataCollator:
  processor : Any
  def __call__(self, inputs):

    pixel_values = [input["pixel_values"].squeeze() for input in inputs]
    labels = [input["label"] for input in inputs]

    pixel_values = torch.stack(pixel_values)
    labels = torch.stack(labels)
    return {
        'pixel_values': pixel_values,
        'labels': labels,
    }

data_collator = DataCollator(processor = processor)

### Loading the `ViT` model and the `AdapterPlusConfig`

Here we load the `vit-base-patch16-224-in21k` model similar to the one used in the `AdapterConfig` paper. We will load the model using the `adapters` `AutoAdapterModel` and add the corresponding `AdapterPlusConfig`. To read more about the config, you can check out the docs page [here](https://docs.adapterhub.ml/methods#bottleneck-adapters) under `AdapterPlusConfig`

In [None]:
from adapters import ViTAdapterModel
from adapters import AdapterPlusConfig

model = ViTAdapterModel.from_pretrained(model_name_or_path)
config = AdapterPlusConfig(original_ln_after=True)

model.add_adapter("adapterplus_config", config)
model.add_image_classification_head("adapterplus_config", num_labels=num_classes)
model.train_adapter("adapterplus_config")

In [None]:
print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
adapterplus_config       bottleneck          165,984       0.192       1       1
--------------------------------------------------------------------------------
Full model                                86,389,248     100.000               0


### Evaluation Metrics

We'll use accuracy as our main metric to evaluate the perforce of the reft model on the `cifar100` dataset

In [13]:
import numpy as np
import evaluate
accuracy = evaluate.load("accuracy")

def compute_metrics(p):
  return accuracy.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

### Training

Now we are ready to train our model. The same set of hyper-parameters that were used in the original paper will be re-used, except for the number of training epochs that the model will be trained on. You can always adjust the number of epochs yourself or any other hyperparameter in the notebook.

In [15]:
from adapters import AdapterTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./training_results',
    eval_strategy='epoch',
    learning_rate=10e-3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=10e-4,
    report_to = "none",
    remove_unused_columns=False,
)

In [None]:
trainer = AdapterTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor,
    compute_metrics = compute_metrics
)

trainer.train()

In [None]:
trainer = AdapterTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor,
    compute_metrics = compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5235,0.351386,0.8977
2,0.2642,0.329842,0.9065
3,0.1847,0.324673,0.9114
4,0.1354,0.347025,0.9091
5,0.1147,0.349541,0.9109
6,0.0743,0.370867,0.9096
7,0.0581,0.373732,0.9124
8,0.0393,0.37622,0.9139
9,0.0282,0.381238,0.9133
10,0.0216,0.380825,0.9124


TrainOutput(global_step=7820, training_loss=0.13645367293101748, metrics={'train_runtime': 9867.7777, 'train_samples_per_second': 50.67, 'train_steps_per_second': 0.792, 'total_flos': 3.9121684697088e+19, 'train_loss': 0.13645367293101748, 'epoch': 10.0})

### Inference

Now, we'll use our `adapters` trained model to classify some new images!

In [None]:
from torch.nn import Softmax
#select a random sample from the evaluation dataset
image = eval_dataset.select([0])
logits = model(image['pixel_values'].squeeze(0))
softmax = Softmax(0)
prediction = torch.argmax(softmax(logits.logits.squeeze()))

prediction

Our prediction is the 49th class which corresponds to the 'mountain' label

### Saving the adapter model

If you would like to save your model or push it to HuggingFace, you can always do so with the below code. Make sure to sign in before you do

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
model.push_adapter_to_hub(
    "cifar100-adapterplus_config",
    "adapterplus_config",
    datasets_tag="cifar100"
)