# Qwen3-VL-2B GRPO Training: UI Screenshot â†’ HTML/CSS

Train a Vision Language Model to generate HTML+Tailwind CSS from website screenshots using GRPO.

- **Model**: Qwen3-VL-2B-Instruct (LoRA)
- **Dataset**: WebSight v0.2 (1% subset)
- **Rewards**: format (0.1), HTML validity (0.2), visual fidelity/SSIM (0.5), structural similarity (0.2)

In [None]:
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import LoraConfig, get_peft_model
from trl import GRPOTrainer, GRPOConfig

from vcoder import (
    format_reward,
    html_validity_reward,
    visual_fidelity_reward,
    structural_similarity_reward,
)
from vcoder.data.websight import load_websight_dataset

## 1. Load Dataset

In [None]:
model_id = "Qwen/Qwen3-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

train_dataset = load_websight_dataset(processor, split="train[:1%]")
print(f"Training samples: {len(train_dataset)}")
print(train_dataset[0])

## 2. Load Model + LoRA

In [None]:
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

## 3. Configure GRPO Training

In [None]:
training_args = GRPOConfig(
    output_dir="Qwen3-VL-2B-HTMLCSS",
    learning_rate=5e-6,
    remove_unused_columns=False,
    num_train_epochs=1,
    bf16=True,

    # Batch / generation parameters
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    max_completion_length=2048,
    num_generations=4,
    max_prompt_length=4096,

    # Reporting and saving
    report_to=["tensorboard"],
    logging_steps=1,
    save_strategy="steps",
    save_steps=25,
    save_total_limit=5,
)

## 4. Initialize Trainer and Train

In [None]:
trainer = GRPOTrainer(
    model=model,
    processing_class=processor,
    reward_funcs=[
        format_reward,
        html_validity_reward,
        visual_fidelity_reward,
        structural_similarity_reward,
    ],
    reward_weights=[0.1, 0.2, 0.5, 0.2],
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

## 5. Save Final Adapter

In [None]:
trainer.save_model("Qwen3-VL-2B-HTMLCSS/final")
processor.save_pretrained("Qwen3-VL-2B-HTMLCSS/final")
print("Training complete. Adapter saved to Qwen3-VL-2B-HTMLCSS/final")

## 6. TensorBoard

Launch TensorBoard to monitor training:
```bash
tensorboard --logdir=./Qwen3-VL-2B-HTMLCSS/runs --host=0.0.0.0 --port=6006
```

Port forward from local machine:
```bash
ssh -L 6006:localhost:6006 cn17-dgx -p 4422
```

Key metrics:
- `reward` - overall average (should trend up)
- `rewards/visual_fidelity_reward/mean` - most important signal
- `rewards/format_reward/mean` - should converge quickly to ~0.8-1.0
- `completions/clipped_ratio` - should be <0.2
- `completions/mean_length` - expect 500-1500 tokens
- `entropy` - gradual decrease