# Fine-tune SmolVLM on Visual Question Answering

In this notebook we will fine-tune SmolVLM on VQAv2 dataset.

In [7]:
!pip install -q accelerate datasets peft bitsandbytes huggingface_hub pillow

In [9]:
!uv pip install torch

[2mAudited [1m1 package[0m [2min 117ms[0m[0m


In [2]:
!pip install -q flash-attn --no-build-isolation

We will push out model to Hub so we need to authenticate ourselves.

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True.

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3" # you don't need this unless you work on a multigpu setup and need to use a specific index
# if you want to use multiple GPUs, use e.g. "2,4"

We will load VQAv2 dataset. For educational purposes we will load the validation split and split it twice.

In [2]:
from datasets import load_dataset
ds = load_dataset('merve/vqav2-small', trust_remote_code=True)

In [3]:
split_ds = ds["validation"].train_test_split(test_size=0.8)
train_ds = split_ds["train"]
test_ds = split_ds["test"]

In [4]:
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

USE_LORA = False
USE_QLORA = False
model_id = "HuggingFaceTB/SmolVLM_converted_4"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(
    model_id
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    lora_config.inference_mode = False
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
        device_map="auto"
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    print(model.get_nb_trainable_parameters())
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to(DEVICE)
    
    # # if you'd like to only fine-tune LLM
    # for param in model.model.vision_model.parameters():
    #     param.requires_grad = False

Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both.

In [5]:
image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")]

def collate_fn(examples):
  texts = []
  images = []
  for example in examples:
      image = example["image"]
      if image.mode != 'RGB':
        image = image.convert('RGB')
      question = example["question"]
      answer = example["multiple_choice_answer"]
      messages = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Answer briefly."},
                  {"type": "image"},
                  {"type": "text", "text": question}
              ]
          },
          {
              "role": "assistant",
              "content": [
                  {"type": "text", "text": answer}
              ]
          }
      ]
      text = processor.apply_chat_template(messages, add_generation_prompt=False)
      texts.append(text.strip())
      images.append([image])

  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  labels = batch["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token_id] = -100 
  batch["labels"] = labels

  return batch


We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    optim="adamw_torch", # for 8-bit, pick paged_adamw_hf
    #evaluation_strategy="epoch",
    bf16=True,
    output_dir="./idefics3-llama-vqav2",
    hub_model_id="idefics3-llama-vqav2",
    remove_unused_columns=False,
)


In [7]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_ds,
    eval_dataset=test_ds,
)

In [8]:
trainer.evaluate()



{'eval_loss': 1.1348639726638794,
 'eval_runtime': 2965.5955,
 'eval_samples_per_second': 5.782,
 'eval_steps_per_second': 0.181}

I'm running standalone scripts on top of tmux so the logs will not appear here. I will upload my training script to this repository.

In [9]:
trainer.train()

Step,Training Loss
25,0.6268
50,0.1128


TrainOutput(global_step=67, training_loss=0.30232649240920795, metrics={'train_runtime': 1017.9478, 'train_samples_per_second': 4.211, 'train_steps_per_second': 0.066, 'total_flos': 7.901553723342048e+16, 'train_loss': 0.30232649240920795, 'epoch': 1.0})

In [10]:
trainer.evaluate()



{'eval_loss': 0.10185369849205017,
 'eval_runtime': 2914.2069,
 'eval_samples_per_second': 5.884,
 'eval_steps_per_second': 0.184,
 'epoch': 1.0}

In [None]:
trainer.push_to_hub()