# LoRA Fine-tuning

In this example, we'll fine-tune a base ViT model with two different datasets using LoRA. Then, we'll load the base model and dynamically swap both LoRA adapters depending on the task we want to complete.

## Why does this matter?

A foundation model knows how to do many things, but it's not great at many tasks. We can fine-tune the model to produce specialized models that are very good at solving specific tasks:

<img src='images/slide1.png' width="800">

We'll use LoRA to fine-tune the foundation model and generate many, specialized adapters. We can load these adapters together with a model to dynamically transform its capabilities:

<img src='images/slide2.png' width="800">

When loading the model, we'll take the foundation model's original weights and apply the LoRA weight changes to it to get the fine-tuned model weights:

<img src='images/slide3.png' width="800">

The beauty of LoRA is that we don't need to fine-tune the entire matrix of weights. Instead, we can get away by fine-tuning two matrices of lower rank. These matrices, when multiplied together, will get us the weight updates we'll need to apply the foundation model to modify its capabilities:

<img src='images/slide4.png' width="800">

Here is how much you can save when using LoRA to fine-tune models of different sizes: 

<img src='images/slide5.png' width="800">

In [1]:
%pip install --quiet transformers accelerate evaluate datasets peft

Note: you may need to restart the kernel to use updated packages.


We are going to use a Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. [Here is the model card](https://huggingface.co/google/vit-base-patch16-224).

This model has a size of 346 MB on disk.

In [1]:
model_checkpoint = "google/vit-base-patch16-224-in21k"

## Creating A Couple Of Helpful Functions

In [2]:
import os 
import torch
from peft import PeftModel, LoraConfig, get_peft_model
from transformers import AutoModelForImageClassification


def print_model_size(path):
    size = 0
    for f in os.scandir(path):
        size += os.path.getsize(f)

    print(f"Model size: {(size / 1e6):.2} MB")


def print_trainable_parameters(model, label):
    parameters, trainable = 0, 0
    
    for _, p in model.named_parameters():
        parameters += p.numel()
        trainable += p.numel() if p.requires_grad else 0

    print(f"{label} trainable parameters: {trainable:,}/{parameters:,} ({100 * trainable / parameters:.2f}%)")


def split_dataset(dataset):
    dataset_splits = dataset.train_test_split(test_size=0.1)
    return dataset_splits.values()
    

def create_label_mappings(dataset):
    label2id, id2label = dict(), dict()
    for i, label in enumerate(dataset.features["label"].names):
        label2id[label] = i
        id2label[i] = label 

    return label2id, id2label

## Loading and Preparing the Datasets

We'll be loading two different datasets to fine-tune the base model:

1. A dataset of pictures of food.
2. A dataset of pictures of cats and dogs.

In [3]:
from datasets import load_dataset

# This is the food dataset
dataset1 = load_dataset("food101", split="train[:10000]")

# This is the datasets of pictures of cats and dogs.
# Notice we need to rename the label column so we can
# reuse the same code for both datasets.
dataset2 = load_dataset("microsoft/cats_vs_dogs", split="train", trust_remote_code=True)
dataset2 = dataset2.rename_column("labels", "label")

dataset1_train, dataset1_test = split_dataset(dataset1)
dataset2_train, dataset2_test = split_dataset(dataset2)

We need these mappings to properly fine-tune the Vision Transformer model. You can find more information in the [`PretrainedConfig`](https://huggingface.co/docs/transformers/en/main_classes/configuration#transformers.PretrainedConfig) documentation, under the "Parameters for fine-tuning tasks" section.

In [4]:
dataset1_label2id, dataset1_id2label = create_label_mappings(dataset1)
dataset2_label2id, dataset2_id2label = create_label_mappings(dataset2)

In [5]:
config = {
    "model1": {
        "train_data": dataset1_train,
        "test_data": dataset1_test,
        "label2id": dataset1_label2id,
        "id2label": dataset1_id2label,
        "epochs": 5,
        "path": "./lora-model1"
    },
    "model2": {
        "train_data": dataset2_train,
        "test_data": dataset2_test,
        "label2id": dataset2_label2id,
        "id2label": dataset2_id2label,
        "epochs": 1,
        "path": "./lora-model2"
    },
}

Let's create an image processor automatically from the [preprocessor configuration](https://huggingface.co/google/vit-base-patch16-224/blob/main/preprocessor_config.json) specified by the base model.

In [6]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(model_checkpoint, use_fast=True)

We can now prepare the preprocessing pipeline to transform the images in our dataset.

In [7]:
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    Resize,
    ToTensor,
)

preprocess_pipeline = Compose([
    Resize(image_processor.size["height"]),
    CenterCrop(image_processor.size["height"]),
    ToTensor(),
    Normalize(mean=image_processor.image_mean, std=image_processor.image_std),
])

def preprocess(batch):
    batch["pixel_values"] = [
        preprocess_pipeline(image.convert("RGB")) for image in batch["image"]
    ]
    return batch


# Let's set the transform function to every train and test sets
for cfg in config.values():
    cfg["train_data"].set_transform(preprocess)
    cfg["test_data"].set_transform(preprocess)

## Fine-Tuning the Model

These are functions that we'll need to fine-tune the model.

In [8]:
import numpy as np
import evaluate
import torch
from peft import PeftModel, LoraConfig, get_peft_model
from transformers import AutoModelForImageClassification


metric = evaluate.load("accuracy")


def data_collate(examples):
    """
    Prepare a batch of examples from a list of elements of the
    train or test datasets.
    """
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}


def compute_metrics(eval_pred):
    """
    Compute the model's accuracy on a batch of predictions.
    """
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)


def get_base_model(label2id, id2label):
    """
    Create an image classification base model from
    the model checkpoint.
    """
    return AutoModelForImageClassification.from_pretrained(
        model_checkpoint,
        label2id=label2id,
        id2label=id2label,
        ignore_mismatched_sizes=True, 
    )


def build_lora_model(label2id, id2label):
    """Build the LoRA model to fine-tune the base model."""
    model = get_base_model(label2id, id2label)
    print_trainable_parameters(model, label="Base model")

    config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=["query", "value"],
        lora_dropout=0.1,
        bias="none",
        modules_to_save=["classifier"],
    )

    lora_model = get_peft_model(model, config)
    print_trainable_parameters(lora_model, label="LoRA")

    return lora_model


Let's now configure the fine-tuning process.

In [9]:
from transformers import TrainingArguments

batch_size = 128
training_arguments = TrainingArguments(
    output_dir="./model-checkpoints",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=4,
    fp16=True,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    label_names=["labels"],
)

Let's now fine-tune both models.

In [10]:
from transformers import Trainer

for cfg in config.values():
    training_arguments.num_train_epochs = cfg["epochs"]
    
    trainer = Trainer(
        build_lora_model(cfg["label2id"], cfg["id2label"]),
        training_arguments,
        train_dataset=cfg["train_data"],
        eval_dataset=cfg["test_data"],
        tokenizer=image_processor,
        compute_metrics=compute_metrics,
        data_collator=data_collate,
    )

    results = trainer.train()
    evaluation_results = trainer.evaluate(cfg['test_data'])
    print(f"Evaluation accuracy: {evaluation_results['eval_accuracy']}")

    # We can now save the fine-tuned model to disk.
    trainer.save_model(cfg["path"])
    print_model_size(cfg["path"])

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base model trainable parameters: 85,876,325/85,876,325 (100.00%)
LoRA trainable parameters: 667,493/86,543,818 (0.77%)


Epoch,Training Loss,Validation Loss,Accuracy
0,2.4038,0.286665,0.937
1,0.1751,0.222291,0.936
2,0.0863,0.191845,0.939
4,0.0368,0.189346,0.946


Evaluation accuracy: 0.946
Model size: 2.7 MB


Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base model trainable parameters: 85,800,194/85,800,194 (100.00%)
LoRA trainable parameters: 591,362/86,391,556 (0.68%)


Epoch,Training Loss,Validation Loss,Accuracy
0,0.0088,0.021229,0.994874


Evaluation accuracy: 0.9948739854762921
Model size: 2.4 MB


## Running Inference

Let's start by defining a couple of functions that will help us build the inference model and run predictions using it.

In [12]:
def build_inference_model(label2id, id2label, lora_adapter_path):
    """Build the model that will be use to run inference."""

    # Let's load the base model
    model = get_base_model(label2id, id2label)

    # Now, we can create the inference model combining the base model
    # with the fine-tuned LoRA adapter.
    return PeftModel.from_pretrained(model, lora_adapter_path)


def predict(image, model, image_processor):
    """Predict the class represented by the supplied image."""
    
    encoding = image_processor(image.convert("RGB"), return_tensors="pt")
    with torch.no_grad():
        outputs = model(**encoding)
        logits = outputs.logits

    class_index = logits.argmax(-1).item()
    return model.config.id2label[class_index]


We need to create two inference models, one using each of the LoRA adapters:

In [14]:
for cfg in config.values():
    cfg["inference_model"] = build_inference_model(cfg["label2id"], cfg["id2label"], cfg["path"]) 
    cfg["image_processor"] = AutoImageProcessor.from_pretrained(cfg["path"])

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here is a list of sample images and the model that we need to use to 

In [44]:
samples = [
    {
        "image": "https://www.allrecipes.com/thmb/AtViolcfVtInHgq_mRtv4tPZASQ=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/ALR-187822-baked-chicken-wings-4x3-5c7b4624c8554f3da5aabb7d3a91a209.jpg",
        "model": "model1",
    },
    {
        "image": "https://wallpapers.com/images/featured/kitty-cat-pictures-nzlg8fu5sqx1m6qj.jpg",
        "model": "model2",
    },
    {
        "image": "https://i.natgeofe.com/n/5f35194b-af37-4f45-a14d-60925b280986/NationalGeographic_2731043_3x4.jpg",
        "model": "model2",
    },
    {
        "image": "https://www.simplyrecipes.com/thmb/KE6iMblr3R2Db6oE8HdyVsFSj2A=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/__opt__aboutcom__coeus__resources__content_migration__simply_recipes__uploads__2019__09__easy-pepperoni-pizza-lead-3-1024x682-583b275444104ef189d693a64df625da.jpg",
        "model": "model1"
    }
]

We can now run predictions on every sample.

In [45]:
from PIL import Image
import requests

for sample in samples:
    image = Image.open(requests.get(sample["image"], stream=True).raw)
    
    inference_model = config[sample["model"]]["inference_model"]
    image_processor = config[sample["model"]]["image_processor"]

    prediction = predict(image, inference_model, image_processor)
    print(f"Prediction: {prediction}")

Prediction: chicken_wings
Prediction: cat
Prediction: dog
Prediction: pizza
