# DPO Training Test: Ministral (Text-Only)

Tests Direct Preference Optimization (DPO) with Unsloth on Ministral-3B using text-only mode.

**Model Variant:** Text-only (FastLanguageModel)
**Expected Result:** NOT SUPPORTED - Ministral's multimodal architecture requires images for DPO

**Key features tested:**
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- DPOTrainer with synthetic preference pairs
- Post-training inference verification

**DPO Overview:**
DPO learns from preference pairs (chosen vs rejected responses) without an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model.

**Key Differences from Qwen:**
- Uses `unsloth/Ministral-3-3B-Reasoning-2512` (multimodal architecture)
- Chat template uses multimodal format: `{"type": "text", "text": "..."}`
- **DPO with text-only data is NOT SUPPORTED** due to Mistral3ForConditionalGeneration architecture

**Recommendation:** For text-only DPO, use a pure text model like Mistral-7B-Instruct or Qwen3-4B. For Ministral DPO, use the Vision variant with image data.

**Important:** This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import DPOConfig, DPOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  if is_vllm_available():


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes


In [2]:
# Load Ministral-3B with 4-bit quantization (using FastLanguageModel for text-only)
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastLanguageModel (text-only mode)...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    dtype=None,
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")


Loading Ministral-3-3B-Reasoning-2512 with FastLanguageModel (text-only mode)...


==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading weights:   0%|          | 0/458 [00:00<?, ?it/s]

Model loaded: Mistral3ForConditionalGeneration


In [3]:
# Apply LoRA adapters for DPO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradients


LoRA applied: 33,751,040 trainable / 2,160,030,720 total (1.56%)


In [8]:
# Create minimal synthetic preference dataset (5 samples)
# Using Ministral's multimodal chat format for text-only content
from PIL import Image
import requests
from io import BytesIO

preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is an API?",
        "chosen": "An API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
        "rejected": "API is code."
    },
    {
        "prompt": "Describe version control.",
        "chosen": "Version control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
        "rejected": "Version control saves files."
    },
    {
        "prompt": "What is a database?",
        "chosen": "A database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
        "rejected": "A database stores stuff."
    },
    {
        "prompt": "Explain object-oriented programming.",
        "chosen": "Object-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
        "rejected": "OOP uses objects."
    },
]

# Create a small placeholder image for multimodal model (1x1 white pixel)
placeholder_image = Image.new('RGB', (16, 16), color='white')

# Format for DPO using Ministral's multimodal format
# Note: Ministral expects images field - use placeholder for text-only
def format_for_dpo(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": [{"type": "text", "text": sample["prompt"]}]}],
        tokenize=False,
        add_generation_prompt=True
    )
    return {
        "prompt": prompt,
        "chosen": sample["chosen"],
        "rejected": sample["rejected"],
        "images": [placeholder_image],  # Placeholder image for multimodal model
    }

dataset = Dataset.from_list(preference_data)
dataset = dataset.map(format_for_dpo)

print(f"Dataset created: {len(dataset)} preference pairs")

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset created: 5 preference pairs


In [10]:
# DPO Training Configuration (minimal steps for testing)
dpo_config = DPOConfig(
    output_dir="outputs_dpo_ministral_text_test",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=2,
    warmup_steps=0,
    learning_rate=5e-6,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    beta=0.1,
    max_length=512,
    max_prompt_length=256,
    seed=42,
)

print("Starting DPO training (2 steps)...")
print("Note: Ministral is a multimodal model - DPO requires special handling")
try:
    trainer = DPOTrainer(
        model=model,
        args=dpo_config,
        train_dataset=dataset,
        processing_class=tokenizer,
    )
    trainer_stats = trainer.train()
    print(f"DPO training completed!")
    DPO_TEXT_SUPPORTED = True
except Exception as e:
    print(f"DPO training issue: {e}")
    print("\nNote: Ministral's multimodal architecture (Mistral3ForConditionalGeneration)")
    print("requires images for DPO training. For text-only DPO, consider using")
    print("a pure text model like Mistral-7B-Instruct-v0.3 or Qwen3-4B.")
    DPO_TEXT_SUPPORTED = False



Starting DPO training (2 steps)...
Note: Ministral is a multimodal model - DPO requires special handling

Extracting prompt in train dataset (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]



Applying chat template to train dataset (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]



Tokenizing train dataset (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5 | Num Epochs = 1 | Total steps = 2
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 33,751,040 of 3,882,841,088 (0.87% trained)

DPO training issue: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['prompt_input_ids', 'chosen_input_ids', 'rejected_input_ids', 'image_sizes']

Note: Ministral's multimodal architecture (Mistral3ForConditionalGeneration)
requires images for DPO training. For text-only DPO, consider using
a pure text model like Mistral-7B-Instruct-v0.3 or Qwen3-4B.

In [11]:
# Post-training inference test (shows model still works even if DPO failed)
FastLanguageModel.for_inference(model)

test_prompt = "What is machine learning?"
messages = [{"role": "user", "content": [{"type": "text", "text": test_prompt}]}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(None, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Clear success/failure banner
print("=" * 60)
if DPO_TEXT_SUPPORTED:
    print("DPO Training: SUPPORTED for Ministral (Text-Only)")
    print("Model: FastLanguageModel + Ministral-3-3B-Reasoning-2512")
else:
    print("DPO Training: NOT SUPPORTED for Ministral (Text-Only)")
    print("Reason: Mistral3ForConditionalGeneration requires images for DPO")
    print("Alternative: Use 05_DPO_Training_Ministral_Vision.ipynb with images")
print("=" * 60)
print(f"Sample generation (base model):\n{response[-200:]}")

DPO Training: NOT SUPPORTED for Ministral (Text-Only)
Reason: Mistral3ForConditionalGeneration requires images for DPO
Alternative: Use 05_DPO_Training_Ministral_Vision.ipynb with images
Sample generation (base model):
cialÄ intelligenceÄ thatÄ dealsÄ withÄ theÄ developmentÄ ofÄ algorithmsÄ thatÄ allowÄ computersÄ toÄ learnÄ fromÄ data.Ä ItÄ involvesÄ trainingÄ algorithmsÄ onÄ dataÄ toÄ recognizeÄ patternsÄ andÄ makeÄ predictionsÄ orÄ decisions

## Test Complete

The DPO Training Pipeline test for Ministral (Text-Only) has completed. The kernel will now shut down to release all GPU memory.

### What Was Verified
- FastLanguageModel loading with 4-bit quantization (Ministral-3B)
- LoRA adapter configuration for preference learning
- Synthetic preference dataset with Ministral's multimodal format
- DPOTrainer training loop (2 steps)
- Post-training inference generation

### DPO Concepts Demonstrated
- **Direct Preference Optimization**: Learning from preference pairs
- **Implicit Reward Model**: No explicit reward model needed
- **Beta Parameter**: Controls strength of preference signal

### Next Steps
- Compare with `05_DPO_Training_Ministral_Vision.ipynb` for vision DPO

In [12]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

{'status': 'ok', 'restart': False}