# Fast Inference Test: Ministral Models

Tests Ministral 3B models with fast_inference support:
- Ministral 3B standard inference (fast_inference not supported for multimodal)
- Ministral 3B Vision with fast_inference attempt
- Parameter availability verification

**Important:** This notebook includes a kernel shutdown cell at the end.
vLLM does not release GPU memory in single-process mode (Jupyter), so kernel
restart is required between different model tests.

In [1]:
# Environment Setup (quiet mode)
import warnings
import os
import sys
import logging

# Suppress all verbose output
warnings.filterwarnings("ignore")
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TQDM_DISABLE"] = "1"
logging.getLogger("unsloth").setLevel(logging.ERROR)
logging.getLogger("vllm").setLevel(logging.ERROR)
logging.getLogger("transformers").setLevel(logging.ERROR)

from dotenv import load_dotenv
load_dotenv()

# Suppress unsloth banner during import
from contextlib import redirect_stdout, redirect_stderr
from io import StringIO
with redirect_stdout(StringIO()), redirect_stderr(StringIO()):
    import unsloth
    from unsloth import FastLanguageModel, FastVisionModel

import vllm
import torch

# Suppress model loading verbosity
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

# Single-line environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, vLLM {vllm.__version__}, {gpu}")

HF_TOKEN loaded: Yesü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.

  if is_vllm_available():

ü¶• Unsloth Zoo will now patch everything to make training faster!unsloth: 2025.12.10
transformers: 5.0.0rc1
vLLM: 0.14.0rc1.dev201+gadcf682fc
TRL: 0.26.2
PyTorch: 2.9.1+cu130
CUDA: True
GPU: NVIDIA GeForce RTX 4080 SUPER

In [2]:
# Test Ministral 3B (multimodal architecture - fast_inference NOT supported)
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nTesting {MODEL_NAME.split('/')[-1]} (multimodal architecture)...")

import time
import os

# Suppress verbose model loading output by redirecting to /dev/null
_stdout_fd = os.dup(1)
_stderr_fd = os.dup(2)
_devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(_devnull, 1)
os.dup2(_devnull, 2)

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        MODEL_NAME,
        max_seq_length=512,
        load_in_4bit=True,
        # fast_inference=False (default) - multimodal models don't support vLLM backend
    )
finally:
    os.dup2(_stdout_fd, 1)
    os.dup2(_stderr_fd, 2)
    os.close(_devnull)
    os.close(_stdout_fd)
    os.close(_stderr_fd)

# Test standard generation
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": [{"type": "text", "text": "Say hello in one word."}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(None, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

start = time.time()
output = model.generate(**inputs, max_new_tokens=10, temperature=0.1)
elapsed = time.time() - start

# Clear result
print(f"\n{'='*60}")
print(f"Model: {MODEL_NAME}")
print(f"FastInference: ‚ùå NOT SUPPORTED (multimodal architecture)")
print(f"Reason: vLLM PixtralForConditionalGeneration lacks packed_modules_mapping")
print(f"Standard inference: {elapsed:.2f}s")
print(f"{'='*60}")

=== Ministral 3B Model Test ===
NOTE: Ministral 3 models are multimodal (vision+text)
fast_inference=True is NOT supported - vLLM PixtralForConditionalGeneration lacks packed_modules_mapping
Testing standard inference path...==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/458 [00:00<?, ?it/s]

‚úì Ministral 3B loaded: Mistral3ForConditionalGeneration‚úì Generation completed in 1.67s
  Response (last 30 chars): ...theƒ†userƒ†wantsƒ†meƒ†toƒ†sayƒ†hello
‚úì Ministral 3B standard inference test PASSED

In [3]:
# Test FastVisionModel with fast_inference=True (expected to fail)
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nTesting FastVisionModel with fast_inference=True...")

from datasets import load_dataset
import time
import os

fast_inference_supported = False
elapsed = None

# Suppress verbose model loading output by redirecting to /dev/null
_stdout_fd = os.dup(1)
_stderr_fd = os.dup(2)
_devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(_devnull, 1)
os.dup2(_devnull, 2)

try:
    model, tokenizer = FastVisionModel.from_pretrained(
        MODEL_NAME,
        load_in_4bit=True,
        fast_inference=True,
        gpu_memory_utilization=0.5,
    )
    fast_inference_supported = True
    
    # If it loads, test generation
    FastVisionModel.for_inference(model)
    dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:1]")
    test_image = dataset[0]["image"]
    
    messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe briefly."}]}]
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
    
    start = time.time()
    output = model.generate(**inputs, max_new_tokens=32, temperature=0.1)
    elapsed = time.time() - start

except Exception as e:
    pass  # Expected - fast_inference not supported

finally:
    os.dup2(_stdout_fd, 1)
    os.dup2(_stderr_fd, 2)
    os.close(_devnull)
    os.close(_stdout_fd)
    os.close(_stderr_fd)

# Clear result
print(f"\n{'='*60}")
print(f"Model: {MODEL_NAME} (Vision)")
if fast_inference_supported:
    print(f"FastInference: ‚úÖ SUPPORTED")
    if elapsed:
        print(f"Generation: {elapsed:.2f}s")
else:
    print(f"FastInference: ‚ùå NOT SUPPORTED")
    print(f"Reason: vLLM vision models lack packed_modules_mapping")
print(f"{'='*60}")

=== Ministral 3B Vision Fast Inference Test ===
Testing if fast_inference=True works with Ministral vision models...==((====))==  Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: Vision model detected, setting approx_max_num_seqs to 1
Unsloth: vLLM loading unsloth/Ministral-3-3B-Reasoning-2512 with actual GPU utilization = 15.16%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 15.57 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 1.
Unsloth: vLLM's KV Cache can use up t

  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'apply_yarn_scaling'}`rope_parameters`'s factor field must be a float >= 1, got 16`rope_parameters`'s beta_fast field must be a float, got 32`rope_parameters`'s beta_slow field must be a float, got 1



  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(

INFO 01-02 23:15:58 [topk_topp_sampler.py:47] Using FlashInfer for top-p & top-k sampling.INFO 01-02 23:15:58 [gpu_model_runner.py:3762] Starting to load model unsloth/Ministral-3-3B-Reasoning-2512...INFO 01-02 23:15:59 [cuda.py:315] Using AttentionBackendEnum.FLASHINFER backend.‚ö† fast_inference=True NOT SUPPORTED for Ministral vision models
  Reason: vLLM's PixtralForConditionalGeneration lacks packed_modules_mapping
  This is a known vLLM limitation, not an unsloth bug
  Workaround: Use standard inference (fast_inference=False)
‚úì Test completed - limitation documented

üìä Result: fast_inference=NOT SUPPORTED for Ministral vision

In [4]:
# Summary: FastInference support by model type
import inspect

print(f"\n{'='*60}")
print("FASTINFERENCE SUPPORT SUMMARY")
print(f"{'='*60}")
print(f"Llama, Qwen (text-only):     ‚úÖ SUPPORTED")
print(f"Ministral (multimodal):      ‚ùå NOT SUPPORTED")
print(f"Vision models:               ‚ùå NOT SUPPORTED")
print(f"{'='*60}")
print(f"\nNote: Multimodal models use PixtralForConditionalGeneration")
print(f"which lacks vLLM's packed_modules_mapping for fast inference.")

=== Fast Inference Capability Check ===
‚úì fast_inference parameter available: True
‚úì fast_inference in FastVisionModel: True

Current versions:
  vLLM: 0.14.0rc1.dev201+gadcf682fc
  Unsloth: 2025.12.10

‚úì fast_inference=True works with vLLM 0.14.0 (patched)

## Test Complete

The Ministral model tests have completed. The kernel will now shut down to release all GPU memory.

**Summary:**
- Ministral 3B (text): Standard inference works, fast_inference not supported (multimodal architecture)
- Ministral 3B Vision: fast_inference support depends on vLLM's PixtralForConditionalGeneration

**Next:** Run `04_Vision_Training.ipynb` for vision training pipeline testing.

In [5]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

{'status': 'ok', 'restart': False}