# Reward Model Training Test: Ministral (Vision)

Tests Reward Model training with vision capabilities on Ministral-3B.

**Model Variant:** Vision (FastVisionModel)
**Expected Result:** NOT SUPPORTED - AutoModelForSequenceClassification incompatible with vision models

**Reward Model + Vision Challenge:**
Standard reward models use AutoModelForSequenceClassification which outputs scalar rewards. Vision models (Mistral3ForConditionalGeneration) have different architectures that don't support sequence classification.

**Why Not Supported:**
1. Vision models output generation logits, not classification scores
2. No classification head available for scalar reward output
3. Image processing pipeline incompatible with RewardTrainer

**Alternatives for Vision RLHF:**
1. Use GRPO with custom reward functions (04_GRPO_Training_Ministral_Vision.ipynb) - **WORKS!**
2. Use generation loss as reward proxy (shown in this notebook)
3. Use a separate text-only reward model alongside vision model

**Important:** This notebook documents the limitations and shows custom reward function alternatives.

In [1]:
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported

import torch
from datasets import load_dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  if is_vllm_available():


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes


In [2]:
# Attempt to load Vision Model for reward modeling
# Note: This is experimental - vision models don't natively support sequence classification
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nAttempting to load {MODEL_NAME.split('/')[-1]} for reward modeling...")
print("Note: Vision models use PixtralForConditionalGeneration, not SequenceClassification")

VISION_REWARD_SUPPORTED = False

try:
    # Load vision model (for generation, not classification)
    model, tokenizer = FastVisionModel.from_pretrained(
        MODEL_NAME,
        load_in_4bit=True,
        use_gradient_checkpointing="unsloth",
    )
    print(f"Model loaded: {type(model).__name__}")
    print("Model type: Conditional Generation (NOT Sequence Classification)")
    VISION_MODEL_LOADED = True
except Exception as e:
    print(f"Model loading failed: {e}")
    VISION_MODEL_LOADED = False


Attempting to load Ministral-3-3B-Reasoning-2512 for reward modeling...
Note: Vision models use PixtralForConditionalGeneration, not SequenceClassification


==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading weights:   0%|          | 0/458 [00:00<?, ?it/s]

Model loaded: Mistral3ForConditionalGeneration
Model type: Conditional Generation (NOT Sequence Classification)


In [5]:
# Alternative: Custom reward function using vision model
# This demonstrates how to create a reward function for vision content

def vision_reward_function(model, tokenizer, image, prompt, response):
    """
    Custom reward function using vision model.
    Instead of training a separate reward model, we can use the vision model
    to evaluate responses based on generation likelihood.
    """
    # Format the full conversation
    messages = [
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt}]},
        {"role": "assistant", "content": [{"type": "text", "text": response}]}
    ]
    
    input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt")
    
    # Move to device with correct dtype
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    # Convert pixel_values to bfloat16 to match model weights
    if "pixel_values" in inputs:
        inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
    
    # Compute loss as proxy for reward (lower loss = better response)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss.item()
    
    # Convert loss to reward (negative loss, normalized)
    reward = -loss
    return reward

if VISION_MODEL_LOADED:
    print("Custom vision reward function defined")
    print("Approach: Use generation loss as reward proxy")
else:
    print("Skipping - model not loaded")

Custom vision reward function defined
Approach: Use generation loss as reward proxy


In [6]:
# Test custom vision reward function
if VISION_MODEL_LOADED:
    print("Testing custom vision reward function...")
    
    # Load test image
    vision_dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]")
    test_image = vision_dataset[0]["image"]
    test_prompt = "What mathematical expression is shown in this image?"
    
    # Compare good vs bad responses
    good_response = "The image shows a mathematical expression in LaTeX notation with variables and operators."
    bad_response = "Image."
    
    try:
        FastVisionModel.for_inference(model)
        good_reward = vision_reward_function(model, tokenizer, test_image, test_prompt, good_response)
        bad_reward = vision_reward_function(model, tokenizer, test_image, test_prompt, bad_response)
        
        print(f"Good response reward: {good_reward:.4f}")
        print(f"Bad response reward:  {bad_reward:.4f}")
        print(f"Preference correct:   {'Yes' if good_reward > bad_reward else 'No'}")
        VISION_REWARD_SUPPORTED = True
    except Exception as e:
        print(f"Reward function test failed: {e}")
        VISION_REWARD_SUPPORTED = False
else:
    print("Skipping test - model not loaded")

Testing custom vision reward function...


Good response reward: -5.2612
Bad response reward:  -5.6585
Preference correct:   Yes


In [7]:
# Summary: Vision Reward Model Support
print("=" * 60)
print("VISION REWARD MODEL SUMMARY")
print("=" * 60)
print()
print("Standard RewardTrainer with AutoModelForSequenceClassification:")
print("  Status: NOT SUPPORTED")
print("  Reason: Vision models use PixtralForConditionalGeneration")
print("          which doesn't support sequence classification output")
print()
print("Alternative: Custom Reward Function (demonstrated above)")
if VISION_MODEL_LOADED and VISION_REWARD_SUPPORTED:
    print("  Status: WORKING")
    print("  Method: Use generation loss as reward proxy")
else:
    print("  Status: See test results above")
print()
print("Recommended Approach for Vision RLHF:")
print("  1. Use SFT for initial vision fine-tuning (03_SFT_Vision)")
print("  2. Use custom reward functions with GRPO/RLOO")
print("  3. Or use text-only reward model + vision generation model")
print("=" * 60)

VISION REWARD MODEL SUMMARY

Standard RewardTrainer with AutoModelForSequenceClassification:
  Status: NOT SUPPORTED
  Reason: Vision models use PixtralForConditionalGeneration
          which doesn't support sequence classification output

Alternative: Custom Reward Function (demonstrated above)
  Status: WORKING
  Method: Use generation loss as reward proxy

Recommended Approach for Vision RLHF:
  1. Use SFT for initial vision fine-tuning (03_SFT_Vision)
  2. Use custom reward functions with GRPO/RLOO
  3. Or use text-only reward model + vision generation model


## Test Complete

The Reward Model Training Pipeline test for Ministral (Vision) has completed. The kernel will now shut down to release all GPU memory.

### Key Findings

**Standard Reward Training: NOT SUPPORTED**
- Vision models (PixtralForConditionalGeneration) don't support sequence classification
- AutoModelForSequenceClassification requires text-only architectures

**Alternative Approach: Custom Reward Function**
- Use vision model's generation loss as reward proxy
- Can be integrated with GRPO/RLOO trainers via `reward_funcs` parameter

### Comparison with Text-Only
| Aspect | Text-Only | Vision |
|--------|-----------|--------|
| RewardTrainer | SUPPORTED | NOT SUPPORTED |
| Custom Reward | Possible | Demonstrated |
| Architecture | SequenceClassification | ConditionalGeneration |

### Recommended Vision RLHF Pipeline
1. SFT training with vision data (03_SFT_Vision)
2. Custom reward function (not trained reward model)
3. GRPO/RLOO with custom reward_funcs

In [8]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...


{'status': 'ok', 'restart': False}