# Image classification using LoRA with Vision Transformers

## Introduction

In this notebook, we will learn how to use [LoRA](https://arxiv.org/abs/2106.09685) from 🤗 PEFT to fine-tune an image classification model by ONLY using **0.72%** of the original trainable parameters of the model.

LoRA adds low-rank "update matrices" to certain blocks in the underlying model (in this case the attention blocks) and ONLY trains those matrices during fine-tuning. During inference, these update matrices are _merged_ with the original model parameters. For more details, check out the [original LoRA paper](https://arxiv.org/abs/2106.09685).

## What is LoRA

<div class="wy-nav-content-img">
    <img src="assets/LoRA-Image-Classification_lora_diagram.png" width="800px" alt="LoRA 的原理示意图">
    <p>LoRA 的原理示意图</p>
</div>

LoRA (Low-rank Optimization for Rapid Adaptation) is a parameter-efficient training method that utilizes low-rank decomposition to reduce the number of trainable parameters. Instead of updating the entire weight matrix, LoRA employs smaller matrices that adapt to new data while maintaining the original weight matrix.

To implement LoRA, we first freeze the pre-trained model weights and then insert trainable rank decomposition matrices into each layer of the transformer architecture. These matrices are much smaller than the original model weights and can capture the essential information for the adaptation. By multiplying the rank decomposition matrices with the frozen weights, we obtain a low-rank approximation of the adapted model weights. This way, we can reduce the number of trainable parameters by several orders of magnitude and also save GPU memory and inference time.

For example, let's suppose we have a pre-trained model with 100 million parameters and 50 layers, and we want to fine-tune the model. If we use LoRA, we can insert two rank decomposition matrices of size 100 x 10 and 10 x 100 into each layer of the model. By multiplying these matrices with the frozen weights, we can obtain a low-rank approximation of the adapted weights with only 2000 parameters per layer. This means that we can reduce the number of trainable parameters from 100 million to 2 million, which is a 50 times reduction. Moreover, we can also speed up the inference process by using the low-rank approximation instead of the original weights.

This approach offers several benefits:

1.   Reduced Trainable Parameters: LoRA significantly reduces the number of trainable parameters, leading to faster training and reduced memory consumption.
2.   Frozen Pre-trained Weights: The original weight matrix remains frozen, allowing it to be used as a foundation for multiple lightweight LoRA models. This facilitates efficient transfer learning and domain adaptation.
3.  Compatibility with Other Parameter-Efficient Methods: LoRA can be seamlessly integrated with other parameter-efficient techniques, such as knowledge distillation and pruning, further enhancing model efficiency.
4.  Comparable Performance: LoRA achieves performance comparable to fully fine-tuned models, demonstrating its effectiveness in preserving model accuracy despite reducing trainable parameters.



Let's get started by installing the dependencies.

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torchvision.transforms as T
from datasets import load_dataset
from transformers import AutoImageProcessor, ViTForImageClassification
from transformers import Trainer, TrainingArguments
import evaluate

## Loading the Dataset

We'll be using [Oxford-IIIT Pets Dataset](https://huggingface.co/datasets/pcuenq/oxford-pets). It is a collection of 37 different cat and dog breed images.

关于这个数据集的详细介绍，可以参考：[transfer-learning-image-classification](./transfer-learning-image-classification.ipynb)

In [2]:
dataset = load_dataset("pcuenq/oxford-pets")
classes = dataset["train"].unique("label")
print(len(classes), classes)

37 ['Siamese', 'Birman', 'shiba inu', 'staffordshire bull terrier', 'basset hound', 'Bombay', 'japanese chin', 'chihuahua', 'german shorthaired', 'pomeranian', 'beagle', 'english cocker spaniel', 'american pit bull terrier', 'Ragdoll', 'Persian', 'Egyptian Mau', 'miniature pinscher', 'Sphynx', 'Maine Coon', 'keeshond', 'yorkshire terrier', 'havanese', 'leonberger', 'wheaten terrier', 'american bulldog', 'english setter', 'boxer', 'newfoundland', 'Bengal', 'samoyed', 'British Shorthair', 'great pyrenees', 'Abyssinian', 'pug', 'saint bernard', 'Russian Blue', 'scottish terrier']


## Preprocessing

In this section we ensure the dataset is properly formatted, transformed, and ready to be fed into the ViT model for training. It handles data loading, exploration, splitting, and transformation steps necessary to facilitate efficient training and evaluation of the image classification model using the LoRA methodology.

Let's start with first split the dataset into `train` for training and `test` for evaluation.

In [3]:
dataset = dataset["train"].train_test_split(train_size=0.8)
dataset

DatasetDict({
    train: Dataset({
        features: ['path', 'label', 'dog', 'image'],
        num_rows: 5912
    })
    test: Dataset({
        features: ['path', 'label', 'dog', 'image'],
        num_rows: 1478
    })
})

To prepare the inputs for the model, we have to apply the required transformations. This can be achieved by utilizing the `AutoImageProcessor` module, which loads the appropriate transformations corresponding to the relevant model. We can see which transformations are used in the processor config.

In [4]:
model_name = "vit-base-patch16-224"
model_checkpoint = f"google/{model_name}"

processor = AutoImageProcessor.from_pretrained(model_checkpoint)

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


We'll also map the label names to indices with `label2id` and `id2label` so it is easy for us to read.

In [5]:
label2id = {c: idx for idx, c in enumerate(classes)}
id2label = {idx: c for idx, c in enumerate(classes)}

We can create a function which will preprocess the batch. The trainer will call this function when we add it to the dataset using `with_transform` during training time.

In this `transforms` function, we do the following:

- It might be possible that some images in your dataset will be grayscale or transparent (RGBA).To avoid dimension errors, it is safer to convert them to RGB using PIL convert method.
- We pass the images through the processor to apply the transforms to process and convert them into PyTorch format.
- Using `label2id` we convert the String labels to their integer representation.

In [6]:
def transforms(batch):
    batch["image"] = [x.convert("RGB") for x in batch["image"]]
    inputs = processor([x for x in batch["image"]], return_tensors="pt")
    inputs["labels"] = [label2id[y] for y in batch["label"]]
    return inputs

In [7]:
dataset = dataset.with_transform(transforms)

We also create `collate_fn` function to define how individual samples extracted from a dataset are combined into batches. `collate_fn` is used when iterating through the dataset in the training loop to prepare batches of data that can be fed into the model.

In [8]:
def collate_fn(batch):
    return {
        "pixel_values": torch.stack([x["pixel_values"] for x in batch]),
        "labels": torch.tensor([x["labels"] for x in batch]),
    }

## Metric

We can use Hugging Face [evaluate](https://huggingface.co/docs/evaluate/index) library to calculate metrics. For image classification, we can use the accuracy metric

In [9]:
accuracy = evaluate.load("accuracy")


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=1)
    acc = accuracy.compute(predictions=predictions, references=labels)
    return acc

Using the latest cached version of the module from /home/yangyansheng/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Wed Aug 30 11:14:34 2023) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.


Before loading the model, let’s define a helper function to check the total number of parameters a model has, as well as how many of them are trainable.

In [10]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

## Model

The Vision Transformer (ViT) model represents a significant innovation in computer vision tasks, departing from the conventional convolutional neural network (CNN) architecture. It applies the Transformer architecture, originally designed for natural language processing (NLP), directly to image data without relying on CNNs.

ViT processes images by splitting them into sequences of fixed-size non-overlapping patches, linearly embedding these patches, adding absolute position embeddings, and then feeding this sequence of vectors into a standard Transformer encoder. A special token, often referred to as the [CLS] token, is added to serve as the representation of the entire image, allowing for image classification.

To load our pre-trained ViT model, we will use `ViTForImageClassification` class. We will do that by passing in `num_labels` argument along with our label mappings id2label and label2id. We also need to pass `ignore_mismatched_sizes = True` to compensate for the change in number of parameters in the classifier layer.

In [11]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,
)
print_trainable_parameters(model)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([37]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([37, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 85827109 || all params: 85827109 || trainable%: 100.00


Next, we use `get_peft_model` to wrap the base model so that “update” matrices are added to the respective places.

In [12]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["classifier"],
)

lora_model = get_peft_model(model, config)
print_trainable_parameters(lora_model)

trainable params: 618277 || all params: 86445386 || trainable%: 0.72


### Let's examine the LoraConfig Parameters:
- `r`: The rank of the update matrices, represented as an integer. Lower rank values result in smaller update matrices with fewer trainable parameters.

- `target_modules`: The modules (such as attention blocks) where the LoRA update matrices should be applied.

- `alpha`: The scaling factor for LoRA.

- `layers_pattern`: A pattern used to match layer names in `target_modules` if `layers_to_transform` is specified. By default, Peft model will use a common layer pattern (layers, h, blocks, etc.). This pattern can also be used for exotic and custom models.

- `rank_pattern`: A mapping from layer names or regular expression expressions to ranks that differ from the default rank specified by `r`.

- `alpha_pattern`: A mapping from layer names or regular expression expressions to alphas that differ from the default alpha specified by `lora_alpha`.

### Here's our model architecture

In [13]:
lora_model.vit.encoder.layer[0].attention.attention

ViTSdpaSelfAttention(
  (query): lora.Linear(
    (base_layer): Linear(in_features=768, out_features=768, bias=True)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.1, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=768, out_features=16, bias=False)
    )
    (lora_B): ModuleDict(
      (default): Linear(in_features=16, out_features=768, bias=False)
    )
    (lora_embedding_A): ParameterDict()
    (lora_embedding_B): ParameterDict()
  )
  (key): Linear(in_features=768, out_features=768, bias=True)
  (value): lora.Linear(
    (base_layer): Linear(in_features=768, out_features=768, bias=True)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.1, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=768, out_features=16, bias=False)
    )
    (lora_B): ModuleDict(
      (default): Linear(in_features=16, out_features=768, bias=False)
    )
    (lora_embedding_A): ParameterDict()
    (lora_embedding_B): 


### Let's look at the components of the LoRA model:

**lora.Linear**: LoRA adapts pre-trained models using a low-rank decomposition. It modifies the linear transformation layers (query, key, value) in the attention mechanism.
  - **base_layer**: The original linear transformation.
  - **lora_dropout**: Dropout applied to the LoRA parameters.
  - **lora_A**: The matrix A in the low-rank decomposition.
  - **lora_B**: The matrix B in the low-rank decomposition.
  - **lora_embedding_A/B**: The learnable embeddings for LoRA.


LoRA 模块的实现原理如下：

```python
result = self.base_layer(x, *args, **kwargs)
for active_adapter in self.active_adapters:
    if active_adapter not in self.lora_A.keys():
        continue
    lora_A = self.lora_A[active_adapter]
    lora_B = self.lora_B[active_adapter]
    dropout = self.lora_dropout[active_adapter]
    scaling = self.scaling[active_adapter]
    x = x.to(lora_A.weight.dtype)
    result += lora_B(lora_A(dropout(x))) * scaling
```

## Training

We'll use HuggingFace Trainer to train our model, we can set our desired training arguments and start the training.

In [14]:
batch_size = 128

args = TrainingArguments(
    f"logs/{model_name}-finetuned-lora-oxford-pets",
    per_device_train_batch_size=batch_size,
    learning_rate=5e-3,
    num_train_epochs=5,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=4,
    logging_steps=10,
    save_total_limit=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="accuracy",
    report_to="tensorboard",
    bf16=True,
    remove_unused_columns=False,
    load_best_model_at_end=True,
)

Some things to note here:

* We're using a larger batch size since there is only a handful of parameters to train.
* Larger learning rate than the normal (1e-5 for example).

All of these things are a byproduct of the fact that we're training only a small number of parameters. This can potentially also reduce the need to conduct expensive hyperparameter tuning experiments.

当前在测试时 (2024-11-21) 发现，如果 `Trainer`中传入`PeftModel`，会导致模型不能够正常获取到 eval metrics

In [15]:
trainer = Trainer(
    model=model,
    args=args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=processor,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2024-11-21 12:58:34,861] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio
collect2: error: ld returned 1 exit status
/data/envs/cuda-12.1/lib64/libcufile.so: undefined reference to `dlopen'
/data/envs/cuda-12.1/lib64/libcufile.so: undefined reference to `dlclose'
/data/envs/cuda-12.1/lib64/libcufile.so: undefined reference to `dlerror'
/data/envs/cuda-12.1/lib64/libcufile.so: undefined reference to `dlsym'
collect2: error: ld returned 1 exit status


In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
0,0.9274,0.21099,0.930311
1,0.1402,0.190188,0.948579
2,0.0717,0.16882,0.948579
4,0.0239,0.16655,0.955345


TrainOutput(global_step=55, training_loss=0.2209889520298351, metrics={'train_runtime': 599.9586, 'train_samples_per_second': 49.27, 'train_steps_per_second': 0.092, 'total_flos': 2.166104653885735e+18, 'train_loss': 0.2209889520298351, 'epoch': 4.680851063829787})

## 保存与加载 LoRA 模型

In [17]:
lora_model_path = f"logs/{model_name}-finetuned-lora-oxford-pets/lora-model/"
lora_model.save_pretrained(lora_model_path)

Next, we see how to load the LoRA updated parameters along with our base model for inference. When we wrap a base model with `PeftModel` that modifications are DONE in place. So to mitigate any concerns that might stem from in place modifications, we newly initialize our base model just like we did earlier and construct our inference model.

In [18]:
from peft import PeftModel

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)
model_with_lora = PeftModel.from_pretrained(model, model_id=lora_model_path)
merged_model = model_with_lora.merge_and_unload()

merged_model_path = f"logs/{model_name}-finetuned-lora-oxford-pets/merged-model/"
merged_model.save_pretrained(merged_model_path)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([37]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([37, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Inference

And let's now fetch a sample for inference.

In [19]:
from PIL import Image
import requests

url = "https://hf-mirror.com/datasets/alanahmet/LoRA-pets-dataset/resolve/main/shiba_inu_136.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained(model_checkpoint)
encoding = image_processor(image.convert("RGB"), return_tensors="pt")
print(encoding.pixel_values.shape)

# forward pass
with torch.no_grad():
    outputs = merged_model(**encoding)
    logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", merged_model.config.id2label[predicted_class_idx])

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


torch.Size([1, 3, 224, 224])
Predicted class: shiba inu


## **Conclusion**

In this tutorial, we've explored LoRA (Low-rank Optimization for Rapid Adaptation), a parameter-efficient training methodology that significantly reduces the number of trainable parameters while preserving model accuracy. By utilizing low-rank decomposition, LoRA enables the adaptation of pre-trained models to new data by employing smaller matrices to capture essential information.

Key takeaways from this tutorial include:

1. **Parameter Efficiency**: LoRA reduces the number of trainable parameters, leading to faster training, reduced memory consumption, and improved efficiency in model adaptation.
2. **Adaptation Methodology**: The utilization of rank decomposition matrices inserted into specific layers of the model allows for rapid adaptation to new data while maintaining the original model's foundational knowledge.
3. **Model Performance**: Despite the reduction in trainable parameters, LoRA demonstrates comparable performance to fully fine-tuned models, showcasing its effectiveness in preserving accuracy.

Through step-by-step implementation, we've fine-tuned an image classification model using LoRA. We've covered dataset preparation, model loading, training, and sharing of the LoRA-updated parameters with the wider community.

The ability to efficiently adapt pre-trained models using LoRA provides a powerful approach for practitioners in various domains, allowing for quick adaptation to new data while ensuring resource efficiency and maintaining high model performance.

For further exploration of LoRA in various applications, we encourage further reading of the original LoRA paper and check the available resources provided by Hugging Face.

For more information about LoRA you can check:


*   [Hugging Face Image Classification using LoRA](https://huggingface.co/docs/peft/task_guides/image_classification_lora)
*   [CONCEPTUAL LoRA GUIDE](https://huggingface.co/docs/peft/conceptual_guides/lora)
*   [Computer Vision Course Image Classification with Transer Learning](https://github.com/shreydan/computer-vision-course/blob/main/notebooks/Unit%203%20-%20Vision%20Transformers/transfer-learning.ipynb)
*   [LoRA Paper](https://arxiv.org/abs/2106.09685)