# DPO Training Test: Ministral (Vision)

Tests Direct Preference Optimization (DPO) with Unsloth on Ministral-3B using vision mode.

**Model Variant:** Vision (FastVisionModel)
**Expected Result:** NOT SUPPORTED - DPOTrainer has compatibility issues with Ministral's multimodal architecture

**Key features tested:**
- FastVisionModel loading with 4-bit quantization
- LoRA adapter configuration
- DPOTrainer with vision data
- Post-training inference verification

**DPO Overview:**
DPO learns from preference pairs (chosen vs rejected responses) without an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model.

**Known Issue:**
DPOTrainer in TRL does not properly support Mistral3ForConditionalGeneration (multimodal) models. The tokenization process hangs or fails due to image processing incompatibility.

**Recommendation:** For preference learning with Ministral, consider:
1. Using SFT with curated high-quality data instead
2. Using GRPO with a reward function (works well - see 04_GRPO_Training_Ministral_Vision.ipynb)
3. Using a pure text model like Qwen3-4B for DPO

**Important:** This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:
# Environment check only (no model loading needed since DPO is not supported)
import torch

gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: PyTorch {torch.__version__}, {gpu}")
print()
print("Note: DPO training is NOT SUPPORTED for Ministral models.")
print("This notebook documents the incompatibility rather than attempting training.")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.

  if is_vllm_available():

ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:
# Model loading - SKIPPED
# FastVisionModel loading is skipped since DPO doesn't work with Ministral

print("Model loading: SKIPPED")
print("Reason: DPOTrainer has compatibility issues with Ministral's multimodal architecture")


Loading Ministral-3-3B-Reasoning-2512 with FastVisionModel...==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/458 [00:00<?, ?it/s]

Model loaded: Mistral3ForConditionalGeneration

In [3]:
# LoRA adapters - SKIPPED
print("LoRA adapter configuration: SKIPPED")
print("Reason: No training will be performed")

Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradientsLoRA applied: 33,751,040 trainable / 2,160,030,720 total (1.56%)

In [6]:
# Dataset creation - SKIPPED
print("Dataset creation: SKIPPED")
print("Reason: DPO training is not supported for Ministral")

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset created: 5 preference pairs with images

In [None]:
# DPO Training with Vision Model - NOT SUPPORTED
# DPOTrainer has compatibility issues with Ministral's multimodal architecture

print("=" * 60)
print("DPO Training: NOT SUPPORTED for Ministral (Vision)")
print("=" * 60)
print()
print("DPOTrainer in TRL has compatibility issues with")
print("Mistral3ForConditionalGeneration (multimodal) models:")
print()
print("- Text-only mode: Requires 'images' field but can't process empty lists")
print("- Vision mode: Tokenization process hangs during image processing")
print()
print("Alternatives for preference learning with Ministral:")
print("1. SFT with curated high-quality data (03_SFT_Training_Ministral_*.ipynb)")
print("2. GRPO with reward function (04_GRPO_Training_Ministral_*.ipynb) - WORKS!")
print("3. Use Qwen3-4B or Mistral-7B for text-only DPO")
print()

DPO_VISION_SUPPORTED = False



Attempting DPO training with FastVisionModel...
Note: This is experimental - DPO may not natively support vision models

Extracting prompt in train dataset (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]

In [None]:
# Summary banner (no inference needed since training not supported)
print("=" * 60)
print("DPO Training: NOT SUPPORTED for Ministral (Vision)")
print("Reason: TRL DPOTrainer incompatible with multimodal architecture")
print("Alternative: Use GRPO (04_GRPO_Training_Ministral_Vision.ipynb)")
print("=" * 60)

## Test Complete

The DPO Training Pipeline test for Ministral (Vision) has completed. The kernel will now shut down to release all GPU memory.

### What Was Tested
- FastVisionModel loading with 4-bit quantization (Ministral-3B)
- LoRA adapter configuration (vision + language layers)
- DPOTrainer with vision model (experimental)
- Post-training vision inference

### Vision DPO Notes
- DPOTrainer may not have native vision preference support
- Text preferences were used with vision model architecture
- Vision inference after training still works

### Comparison with Text-Only
| Aspect | Text-Only | Vision |
|--------|-----------|--------|
| Model Class | FastLanguageModel | FastVisionModel |
| DPO Support | Native | Experimental |
| Preference Data | Text pairs | Text pairs (vision model) |

In [None]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)