In [1]:
!pip -q install evaluate  torch transformers datasets tqdm unsloth

In [1]:
import os
from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
import gc
import torch
import tqdm
from transformers import  AutoProcessor
from datasets import load_dataset
from PIL import Image
from evaluate import load
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


### Baseline Evaluation

I will be using Qwen2-2B-Instruct Vision Language Model, first i calculate the BLEU score of the vanilla model on 100 examples from the websight dataset.

In [None]:
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="Qwen/Qwen2-VL-2B-Instruct", 
    dtype=torch.bfloat16,
    load_in_4bit=False,
)

FastVisionModel.for_inference(model)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

dataset = load_dataset("HuggingFaceM4/WebSight", split="train", streaming=True)
subset = []
for i, sample in enumerate(dataset):
    if i >= 100:
        break
    subset.append(sample)

def preprocess_image(img):
    return img.convert("RGB").resize((224, 224))

images = [preprocess_image(s["image"]) for s in subset]
references = [s["text"] for s in subset]

messages_batch = [
    [{
        "role": "user",
        "content": [
            {"type": "image", "image": img},
            {"type": "text", "text": (
                "Generate only the HTML (with Tailwind CSS) for this UI. "
                "For images, always use this pattern: "
                "https://source.unsplash.com/random/WxH/?keyword "
                "(replace W, H, and keyword appropriately). "
            )}
        ],
    }]
    for img in images
]

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in messages_batch]

batch_size = 4
predictions = []

for i in tqdm.tqdm(range(0, len(subset), batch_size)):
    batch_imgs = images[i:i + batch_size]
    batch_txts = texts[i:i + batch_size]

    with torch.no_grad():
        inputs = processor(text=batch_txts, images=batch_imgs, padding=True, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=False)
        
        for j in range(len(batch_imgs)):
            generated = processor.decode(
                outputs[j][inputs.input_ids[j].shape[0]:],  
                skip_special_tokens=True
            )
            predictions.append(generated)

    del inputs, outputs
    if i % (10 * batch_size) == 0:
        torch.cuda.empty_cache()
        gc.collect()

==((====))==  Unsloth 2025.11.1: Fast Qwen2_Vl patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.278 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/4.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 25/25 [37:43<00:00, 90.55s/it]


In [4]:
bleu = load("bleu")

references_formatted = [[ref] for ref in references]  

results = bleu.compute(predictions=predictions, references=references_formatted)

print(f"\nBLEU score: {results['bleu']:.4f}")
print("Precisions:", results["precisions"])
print("Brevity penalty:", results["brevity_penalty"])
print("Length ratio:", results["length_ratio"])
print("Translation length:", results["translation_length"])
print("Reference length:", results["reference_length"])

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]


BLEU score: 0.4115
Precisions: [0.6047577364708305, 0.47449796086934076, 0.39389783137518763, 0.33511493954646876]
Brevity penalty: 0.932848581380388
Length ratio: 0.9350055365654036
Translation length: 38842
Reference length: 41542


The BLEU score is **0.4115** without any fine-tuning.

### **LoRA Fine-Tuning**

I apply **rsLoRA** fine-tuning on the model with the following setup:
- **Rank:** 16  
- **Alpha:** 32  
- **Data:** 1,500 examples (indices 100â€“1600) from the dataset  
- **Epochs:** 2
- **Layers:** I applied Lora on all layers(vision, language, attention, mlp)

In [None]:
model, tokenizer = FastVisionModel.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    random_state=3407,
    use_rslora=True,
)

dataset = load_dataset("HuggingFaceM4/WebSight", split="train", streaming=True)
subset = [s for _, s in zip(range(100,1600), dataset)]

def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Generate only the HTML (with Tailwind CSS) for this UI. For images, use https://source.unsplash.com/random/WxH/?keyword"},
                {"type": "image", "image": sample["image"]}
            ]
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": sample["text"]}  
            ]
        }
    ]
    return {"messages": conversation}

converted_dataset = [convert_to_conversation(sample) for sample in subset]

FastVisionModel.for_training(model)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),  
    train_dataset=converted_dataset,
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=2,  
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
        
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_seq_length=2048,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
    ),
)

trainer.train()

model.save_pretrained("lora_qwen_websight_unsloth")
tokenizer.save_pretrained("lora_qwen_websight_unsloth")

==((====))==  Unsloth 2025.11.1: Fast Qwen2_Vl patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.278 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


Unsloth: Making `model.base_model.model.model.visual` require gradients


Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

Unsloth: Model does not have a default image size - using 512


The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,500 | Num Epochs = 2 | Total steps = 188
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 28,950,528 of 2,237,936,128 (1.29% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.455
2,0.5334
3,0.4548
4,0.3531
5,0.3313
6,0.2282
7,0.1838
8,0.1773
9,0.1817
10,0.1425


[]

Running inference again on the same 100 examples to calculate BLEU score of the fine tuned model.

In [None]:
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="lora_qwen_websight_unsloth", 
    dtype=torch.bfloat16,
    load_in_4bit=False,
)

FastVisionModel.for_inference(model)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

dataset = load_dataset("HuggingFaceM4/WebSight", split="train", streaming=True)
subset = []
for i, sample in enumerate(dataset):
    if i >= 100:
        break
    subset.append(sample)

def preprocess_image(img):
    return img.convert("RGB").resize((224, 224))

images = [preprocess_image(s["image"]) for s in subset]
references = [s["text"] for s in subset]

messages_batch = [
    [{
        "role": "user",
        "content": [
            {"type": "image", "image": img},
            {"type": "text", "text": (
                "Generate only the HTML (with Tailwind CSS) for this UI. "
                "For images, always use this pattern: "
                "https://source.unsplash.com/random/WxH/?keyword "
                "(replace W, H, and keyword appropriately). "
            )}
        ],
    }]
    for img in images
]

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in messages_batch]

batch_size = 4
predictions = []

for i in tqdm.tqdm(range(0, len(subset), batch_size)):
    batch_imgs = images[i:i + batch_size]
    batch_txts = texts[i:i + batch_size]

    with torch.no_grad():
        inputs = processor(text=batch_txts, images=batch_imgs, padding=True, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=False)
        
        for j in range(len(batch_imgs)):
            generated = processor.decode(
                outputs[j][inputs.input_ids[j].shape[0]:],  
                skip_special_tokens=True
            )
            predictions.append(generated)

    del inputs, outputs
    if i % (10 * batch_size) == 0:
        torch.cuda.empty_cache()
        gc.collect()

==((====))==  Unsloth 2025.11.1: Fast Qwen2_Vl patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.278 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/4.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/738 [00:00<?, ?it/s]

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 25/25 [46:42<00:00, 112.11s/it]


In [None]:
bleu = load("bleu")
references_formatted = [[ref] for ref in references]  
results = bleu.compute(predictions=predictions, references=references_formatted)

print(f"\nBLEU score: {results['bleu']:.4f}")
print("Precisions:", results["precisions"])
print("Brevity penalty:", results["brevity_penalty"])
print("Length ratio:", results["length_ratio"])
print("Translation length:", results["translation_length"])
print("Reference length:", results["reference_length"])


BLEU score: 0.5895
Precisions: [0.7840160936356986, 0.6772055741827326, 0.5994852400462234, 0.5338108278912997]
Brevity penalty: 0.9182114102907166
Length ratio: 0.9213807712676327
Translation length: 38276
Reference length: 41542


The BLEU score of the fine-tuned model is **0.5895**. **rsLoRA** achieved approximately **18% improvement**.

### **Training Details**

- The notebook took around **2 hours** to run. I used **L4** GPU for complete code , i tried getting the **A100** GPU but it always took a lot of time to load and kept freezing and crashing ; this was probably due to its high demand so I had to stick with **L4**.
- **Unsloth** was used for fine-tuning : it was quite faster than training purely on Hugging Face.  

