
# üè• MediaEval-Medico-2025 ‚Äî Subtask 1: GI Image VQA (Colab/T4 Friendly)

This notebook fine-tunes **`google/paligemma-3b-pt-224`** on **Kvasir-VQA-x1** using **[ms-swift](https://swift.readthedocs.io/)**, then pushes the result to **Hugging Face Hub**.  
It‚Äôs optimized for the **free Colab T4 GPU** tier (‚âà16‚ÄØGB) with 4-bit quantization + LoRA.

**Repo:** üåê MediaEval-Medico-2025 ‚Äî https://github.com/simula/MediaEval-Medico-2025


**What you‚Äôll get**
- ‚úÖ Data prep (images + JSONL suitable for ms-swift VLMs)
- ‚úÖ T4-friendly training config (QLoRA + LoRA + checkpointing)
- ‚úÖ Validation during training
- ‚úÖ Auto-push to Hugging Face Hub
- ‚úÖ Minimal inference sanity-check

> **Tip:** Tune `num_train_epochs`, batch size, and learning rate based on your GPU memory.


## üîß Runtime & GPU Check

In [None]:

# Make sure you're on Colab with GPU: Runtime ‚Üí Change runtime type ‚Üí T4 GPU
import torch, platform, sys, subprocess, json

print("Python:", sys.version)
print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("‚ö†Ô∏è No GPU detected. Please enable a T4 GPU in Colab runtime.")

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
CUDA available: True
GPU: Tesla T4


## üì¶ Install dependencies

In [None]:

!pip install ms-swift bitsandbytes wandb


Collecting ms-swift
  Downloading ms_swift-3.10.0-py3-none-any.whl.metadata (35 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting addict (from ms-swift)
  Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)
Collecting attrdict (from ms-swift)
  Downloading attrdict-2.0.1-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting binpacking (from ms-swift)
  Downloading binpacking-1.5.2.tar.gz (8.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cpm-kernels (from ms-swift)
  Downloading cpm_kernels-1.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting dacite (from ms-swift)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting datasets<4.0,>=3.0 (from ms-swift)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting json-repair (from ms-swift)
  Downloading json_repair-0.53.0-py3-none-any.whl.metadata (11 kB)
Collecting modelscope>=1.23 (from ms-swift)
  Downl


## üîê Authenticate
- **Hugging Face**: Required to push your model to Hub. Create a [token](https://huggingface.co/settings/tokens) with `write` scope.
- **Weights & Biases (optional)**: Set a project name to log metrics.


In [None]:
from huggingface_hub import whoami, login
import wandb, os
from datetime import datetime


!hf auth login --add-to-git-credential
wandb.login()

os.environ["WANDB_PROJECT"] = "Kvasir-VQA-x1_Subtask1"
os.environ["WANDB_DISABLED"] = "false"
HF_USER = whoami()["name"]
print("Logged into HF as:", HF_USER)


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): Traceback (most recent call last):
  File "/usr/local/bin/hf", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/cli/hf.py", line 59, in main
    service.run()
  File "/usr/local/lib/python3.12/dist-

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:


Abort: 


## üóÇÔ∏è Data Preparation (Kvasir-VQA-x1)
We‚Äôll:
1) Cache all images locally (once) from **`SimulaMet-HOST/Kvasir-VQA`**.  
2) Build **VLM-ready JSONL** files (`messages` + `<image>` + `images` path) for **`SimulaMet/Kvasir-VQA-x1`** train/test splits.


Remember, you also can add your data augmentation scripts  to augment images  or question, answers here.


In [None]:
from datasets import load_dataset
from pathlib import Path
from tqdm import tqdm
import json, os

# Working directories
BASE_DIR = Path("./")
DATA_DIR = BASE_DIR / "Kvasir-VQA-x1"
IMG_DIR  = DATA_DIR / "images"
DATA_DIR.mkdir(parents=True, exist_ok=True)
IMG_DIR.mkdir(parents=True, exist_ok=True)

print("Data dir:", DATA_DIR)
print("Images dir:", IMG_DIR)

# 1) Save unique images locally
print("‚è¨ Caching images from SimulaMet-HOST/Kvasir-VQA ...")
host = load_dataset("SimulaMet-HOST/Kvasir-VQA", split="raw")
df = host.select_columns(['source', 'question', 'answer', 'img_id']).to_pandas()
# Save one image per unique img_id
for i, row in tqdm(df.groupby('img_id').nth(0).iterrows(), total=df['img_id'].nunique()):
    p = IMG_DIR / f"{row['img_id']}.jpg"
    if p.exists():
        continue
    host[i]['image'].save(p)

# 2) Create JSONLs for train/test from Kvasir-VQA-x1 (VLM-ready for ms-swift)
print("Creating JSONLs ...")
def write_jsonl(split):
    out_path = DATA_DIR / f"Kvasir-VQA-x1-{split}.jsonl"
    ds = load_dataset("SimulaMet/Kvasir-VQA-x1", split=split)
    with open(out_path, "w", encoding="utf-8") as f:
        for r in ds:
            rec = {
                "messages": [
                    {"role": "user", "content": f"<image>{r['question']}"},
                    {"role": "assistant", "content": r["answer"]}
                ],
                "images": [str(IMG_DIR / f"{r['img_id']}.jpg")]
            }
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")
    return out_path

train_jsonl = write_jsonl("train")
test_jsonl  = write_jsonl("test")

print("Train JSONL:", train_jsonl)
print("Test  JSONL:", test_jsonl)

### üîç Quick sanity check

In [None]:
import json, itertools
from pathlib import Path

sample_lines = list(itertools.islice(open(train_jsonl, "r", encoding="utf-8"), 3))
for i, line in enumerate(sample_lines, 1):
    j = json.loads(line)
    print(f"\n--- Sample {i} ---")
    print("messages:", j["messages"])
    print("images:", j["images"])
    assert Path(j["images"][0]).exists(), "Missing image file!"
print("\nLooks good ‚úÖ")


--- Sample 1 ---
messages: [{'role': 'user', 'content': '<image>Are there any abnormalities, polyps, or anatomical landmarks visible in the image?'}, {'role': 'assistant', 'content': 'Evidence of oesophagitis is present with no polyps identified and the z-line is visible.'}]
images: ['Kvasir-VQA-x1/images/clb0kvxvm90y4074yf50vf5nq.jpg']

--- Sample 2 ---
messages: [{'role': 'user', 'content': '<image>What procedure is depicted in the image and what type of polyp is observed?'}, {'role': 'assistant', 'content': 'Evidence of a colonoscopy with a paris iia polyp noted'}]
images: ['Kvasir-VQA-x1/images/cl8k2u1r71foz083278j63qnm.jpg']

--- Sample 3 ---
messages: [{'role': 'user', 'content': '<image>Have all polyps been removed, is there any text present, and where is the abnormality located in the image?'}, {'role': 'assistant', 'content': 'Polyps remain present, text is visible, and the abnormality is located in the central and upper-center regions.'}]
images: ['Kvasir-VQA-x1/images/cl8k2

‚ö†Ô∏è ‚ö†Ô∏è ‚ö†Ô∏è  
To reduce validation time, we randomly sampled 1,000 entries from the full test set using the shuf command:

In [None]:
!shuf -n 1000 Kvasir-VQA-x1/Kvasir-VQA-x1-test.jsonl > Kvasir-VQA-x1/Kvasir-VQA-x1-test-1000.jsonl
VAL_1000_PATH= "Kvasir-VQA-x1/Kvasir-VQA-x1-test-1000.jsonl"


## üöÄ Fine-tune PaliGemma 3B (QLoRA + LoRA)
> You can also use any other multimodal models listed here:  
> https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html

In [None]:
MODEL_NAME="google/paligemma-3b-pt-224" ## you can choose this from the link above
# Your target Huggingface repo name (can change as required!)
HUB_MODEL_ID = f"Kvasir-VQA-x1-lora_{datetime.now().strftime('%y%m%d-%H%M')}" # appends date time at end

TRAIN_PATH=str(train_jsonl)
VAL_PATH=str(test_jsonl)

print("Model:      ", MODEL_NAME)
print("Train file: ", TRAIN_PATH)
print("Valid file: ", VAL_PATH)
print("Hub repo:   ", HUB_MODEL_ID)

print("üìù You can find training logs after the training starts at: https://wandb.ai/home")
print("üìå After each validation stage, the HF repository will be updated with the best model.")
print(f"‚úÖ Model will be available at: https://huggingface.co/{HF_USER}/{HUB_MODEL_ID}")

Model:       google/paligemma-3b-pt-224
Train file:  Kvasir-VQA-x1/Kvasir-VQA-x1-train.jsonl
Valid file:  Kvasir-VQA-x1/Kvasir-VQA-x1-test.jsonl
Hub repo:    Kvasir-VQA-x1-lora_250812-1155
üìù You can find training logs after the training starts at: https://wandb.ai/home
üìå After each validation stage, the HF repository will be updated with the best model.
‚úÖ Model will be available at: https://huggingface.co/SushantGautam/Kvasir-VQA-x1-lora_250812-1155



T4-friendly defaults for 3B:
- `bnb` 4-bit quantization (nf4 + double quant)
- `per_device_train_batch_size=4` (adjust if OOM)
- `gradient_accumulation_steps=4` (effective batch ‚âà16)
- `freeze_vit=true`, `gradient_checkpointing=true`

> Increase batch size and/or `num_train_epochs` if you have more VRAM.

See https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html for all supported training parameters. Play with them to get the best results.

In [None]:
# training command
# can also use full validation set in --val_dataset with "VAL_PATH"
!swift sft \
--dataset "$TRAIN_PATH" \
--val_dataset "$VAL_1000_PATH" \
--model "$MODEL_NAME" \
--max_length 512 \
--train_type lora \
--torch_dtype float16 \
--quant_method bnb --quant_bits 4 \
--bnb_4bit_compute_dtype float16 \
--bnb_4bit_quant_type nf4 \
--bnb_4bit_use_double_quant true \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--weight_decay 0.01 \
--lora_rank 16 --lora_alpha 32 \
--freeze_vit true \
--gradient_checkpointing true \
--load_best_model_at_end True \
--metric_for_best_model eval_token_acc \
--greater_is_better True \
--save_steps 1000 \
--save_total_limit 2 \
--logging_steps 20 \
--output_dir output_Kvasir-VQA-x1 \
--use_hf true \
--push_to_hub true \
--hub_token  "$(cat ~/.cache/huggingface/token)" \
--hub_model_id "$HUB_MODEL_ID" \
--report_to wandb \
--dataloader_num_workers 2 \
--dataset_num_proc 2 \
# --resume_from_checkpoint output_Kvasir-VQA-x1/checkpoint-<LAST_STEP>

run sh: `/usr/bin/python3 /usr/local/lib/python3.11/dist-packages/swift/cli/sft.py --dataset Kvasir-VQA-x1/Kvasir-VQA-x1-train.jsonl --val_dataset Kvasir-VQA-x1/Kvasir-VQA-x1-test-1000.jsonl --model google/paligemma-3b-pt-224 --max_length 512 --train_type lora --torch_dtype float16 --quant_method bnb --quant_bits 4 --bnb_4bit_compute_dtype float16 --bnb_4bit_quant_type nf4 --bnb_4bit_use_double_quant true --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 --learning_rate 2e-5 --lr_scheduler_type linear --warmup_ratio 0.03 --weight_decay 0.01 --lora_rank 16 --lora_alpha 32 --freeze_vit true --gradient_checkpointing true --load_best_model_at_end True --metric_for_best_model eval_token_acc --greater_is_better True --save_steps 1000 --save_total_limit 2 --logging_steps 20 --output_dir output_Kvasir-VQA-x1 --use_hf true --push_to_hub true --hub_model_id Kvasir-VQA-x1-lora_250812-1155 --report_to wandb --dataloader_num_workers 


## üî¨ Inference Sanity Check
Load the LoRA-adapted model via `swift infer` on a couple of samples.


In [None]:
from swift.llm import PtEngine, RequestConfig, InferRequest
import json, random
from PIL import Image

import torch, gc # clean mem
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

ADAPTERS = f"{HF_USER}/{HUB_MODEL_ID}"
print(f"Try to load model from: https://huggingface.co/{ADAPTERS} as an adapter to {MODEL_NAME}")
engine = PtEngine(model_id_or_path=MODEL_NAME, adapters=f"{ADAPTERS}", max_batch_size=2, use_hf=True, model_type="paligemma")
# adapters=XXXX should be  your huggingface repo saved from the training process above like "SushantGautam/Kvasir-VQA-x1-lora-XXXX"

In [None]:
VAL_SAMPLES = 10

rcfg = RequestConfig(max_tokens=64, temperature=0)
gc.collect(); torch.cuda.empty_cache(); torch.cuda.ipc_collect()

choices = random.sample([json.loads(l) for l in open(VAL_PATH)], VAL_SAMPLES)
reqs = [InferRequest(messages=[{'role':'user','content':f"<image>{c['messages'][0]['content'].replace('<image>','').strip()}"}],
                     images=[c['images'][0]]) for c in choices]

for c, r in zip(choices, engine.infer(reqs, rcfg)):
    question = c['messages'][0]['content'].replace('<image>', '').strip()
    real_answer = c['messages'][1]['content']
    pred_answer = r.choices[0].message.content

    print("\nQ:", question)
    display(Image.open(c['images'][0]).resize((256,256)))
    print("Pred:", pred_answer, "\nReal:", real_answer)

## Submitting to the competition
To submit this model you have to add a new file named submission_task1.py in the root of your submission repo and need to edit that file with your details following the instructiosn at https://github.com/simula/MediaEval-Medico-2025/blob/main/README.md#-submission-system.




## üß† Tips & Tuning
- If you hit **CUDA OOM**:
  - Lower `per_device_train_batch_size` to 2 (or 1) and increase `gradient_accumulation_steps`.
  - Lower `max_length` to 384 or 256.
  - Ensure `freeze_vit=true` and `bnb` 4-bit is enabled.
- If training is too slow, reduce dataset size temporarily for prototyping.
- Increase `num_train_epochs` to 2‚Äì3 for better results if time allows.
- For different VLMs, change `--model` to any supported multimodal model (see SWIFT docs).



---

### ‚úÖ You‚Äôre done!
You can now use your pushed model in other notebooks or pipelines, or extend this setup for **Subtask 2** (explanations) by adding structured outputs (text / visual evidence). Good luck! üçÄ


Dont hesitate to contact the organizers for any questiosn or help.
https://github.com/simula/MediaEval-Medico-2025/blob/main/README.md#-organizers

