# Unsloth Vision Training Verification (Pixtral)

This notebook tests the complete vision model fine-tuning pipeline with Pixtral-12B:
- FastVisionModel loading
- LoRA adapter configuration
- Dataset loading and formatting
- SFTTrainer training loop
- Inference after training

**Model:** `unsloth/pixtral-12b-2409-bnb-4bit` (pre-quantized 4-bit)

**Important:** This notebook includes a kernel shutdown cell at the end
to release all GPU memory after the vision training test.

In [3]:
# Environment Setup
from dotenv import load_dotenv
import os
load_dotenv()
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
import transformers
import vllm
import trl
import torch

print(f"unsloth: {unsloth.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"TRL: {trl.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

HF_TOKEN loaded: Yes


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  if is_vllm_available():


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


unsloth: 2025.12.10
transformers: 5.0.0rc1
vLLM: 0.14.0rc1.dev201+gadcf682fc
TRL: 0.26.2
PyTorch: 2.9.1+cu130
CUDA: True
GPU: NVIDIA GeForce RTX 4080 SUPER


## Pixtral VL (Vision) Training Verification

This section tests the complete vision model fine-tuning pipeline:
- FastVisionModel loading (Pixtral-12B pre-quantized)
- LoRA adapter configuration
- Dataset loading and formatting
- SFTTrainer training loop
- Inference after training

In [4]:
# Complete Vision Pipeline Test (self-contained)
# Tests: Model loading, LoRA, Dataset, Training (2 steps), Inference
print("=== Vision Training Pipeline Test (Pixtral-12B) ===")

from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# 1. Load model (pre-quantized 4-bit)
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/pixtral-12b-2409-bnb-4bit",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)
print(f"âœ“ FastVisionModel loaded: {type(model).__name__}")

# 2. Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    random_state=3407,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"âœ“ LoRA applied ({trainable:,} trainable params)")

# 3. Load dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    return {
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]}
            ]},
            {"role": "assistant", "content": [
                {"type": "text", "text": sample["text"]}
            ]}
        ]
    }

converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"âœ“ Dataset loaded ({len(converted_dataset)} samples)")

# 4. Train (2 steps)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),
    train_dataset=converted_dataset,
    args=SFTConfig(
        per_device_train_batch_size=1,
        max_steps=2,
        warmup_steps=0,
        learning_rate=2e-4,
        logging_steps=1,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        output_dir="outputs_pixtral_vl_test",
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_seq_length=1024,
    ),
)
trainer_stats = trainer.train()
print(f"âœ“ Training completed (loss: {trainer_stats.metrics.get('train_loss', 'N/A'):.4f})")

# 5. Inference test
FastVisionModel.for_inference(model)
test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
print("âœ“ Inference test passed")
print("âœ“ Vision Training Pipeline test PASSED")

=== Vision Training Pipeline Test (Pixtral-12B) ===


==((====))==  Unsloth 2025.12.10: Fast Llava patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/585 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/133 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

The tokenizer you are loading from 'unsloth/pixtral-12b-2409-bnb-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


âœ“ FastVisionModel loaded: LlavaForConditionalGeneration


Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradients


âœ“ LoRA applied (66,060,288 trainable params)


warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


âœ“ Dataset loaded (5 samples)


The model is already on multiple devices. Skipping the move to device specified in `args`.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 2 x 1) = 2
 "-____-"     Trainable parameters = 66,060,288 of 12,748,800,000 (0.52% trained)


<IPython.core.display.HTML object>

âœ“ Training completed (loss: 3.2785)


âœ“ Inference test passed
âœ“ Vision Training Pipeline test PASSED


## Test Complete

The Vision Training Pipeline test has completed. The kernel will now shut down to release all GPU memory.

### What Was Verified
- FastVisionModel loading with Pixtral-12B (pre-quantized 4-bit)
- LoRA adapter application (vision + language layers)
- Dataset loading and conversation formatting
- SFTTrainer training loop (2 steps)
- Post-training inference

### Ready for Production
If this test passed, your environment is ready for:
- `03_SFT_Training_Pixtral_Vision.ipynb` - Full vision fine-tuning
- `04_GRPO_Training_Pixtral_Vision.ipynb` - GRPO reinforcement learning
- `07_RLOO_Training_Pixtral_Vision.ipynb` - RLOO reinforcement learning

In [3]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

{'status': 'ok', 'restart': False}