# Fast Inference Test: Llama-3.2-1B

Tests `fast_inference=True` with vLLM backend on Llama-3.2-1B-Instruct.

**Important:** This notebook includes a kernel shutdown cell at the end.
vLLM does not release GPU memory in single-process mode (Jupyter), so kernel
restart is required between different model tests.

In [1]:
# Environment Setup (quiet mode)
import warnings
import os
import sys
import logging

# Suppress all verbose output
warnings.filterwarnings("ignore")
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["TQDM_DISABLE"] = "1"
os.environ["UNSLOTH_IS_PRESENT"] = "1"  # May reduce some banners
logging.getLogger("unsloth").setLevel(logging.ERROR)
logging.getLogger("vllm").setLevel(logging.ERROR)
logging.getLogger("transformers").setLevel(logging.ERROR)

from dotenv import load_dotenv
load_dotenv()

# Suppress unsloth banner during import
from contextlib import redirect_stdout, redirect_stderr
from io import StringIO
with redirect_stdout(StringIO()), redirect_stderr(StringIO()):
    import unsloth
    from unsloth import FastLanguageModel

import vllm
import torch

# Suppress model loading verbosity
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

# Single-line environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, vLLM {vllm.__version__}, {gpu}")

Environment: unsloth 2025.12.10, vLLM 0.14.0rc1.dev201+gadcf682fc, NVIDIA GeForce RTX 4080 SUPER

In [2]:
# Test Llama-3.2-1B with fast_inference=True
MODEL_NAME = "unsloth/Llama-3.2-1B-Instruct"
print(f"\nTesting {MODEL_NAME.split('/')[-1]} with fast_inference=True...")

from vllm import SamplingParams
import time
import sys
import os

# Suppress verbose model loading output by redirecting to /dev/null
_stdout_fd = os.dup(1)
_stderr_fd = os.dup(2)
_devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(_devnull, 1)
os.dup2(_devnull, 2)

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        MODEL_NAME,
        max_seq_length=512,
        load_in_4bit=True,
        fast_inference=True,
        gpu_memory_utilization=0.5,
    )
finally:
    os.dup2(_stdout_fd, 1)
    os.dup2(_stderr_fd, 2)
    os.close(_devnull)
    os.close(_stdout_fd)
    os.close(_stderr_fd)

# Test generation
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Say hello in one word."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

start = time.time()
outputs = model.fast_generate([prompt], sampling_params=sampling_params)
elapsed = time.time() - start

# Clear result
print(f"\n{'='*60}")
print(f"Model: {MODEL_NAME}")
print(f"FastInference: ✅ SUPPORTED")
print(f"Generation: {elapsed:.2f}s")
print(f"{'='*60}")


Testing Llama-3.2-1B-Instruct with fast_inference=True...==((====))==  Unsloth 2025.12.10: Fast Llama patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!Unsloth: vLLM loading unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit with actual GPU utilization = 45.07%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 15.57 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 32.
Unsloth: vLLM's KV Cache can use up to 5.88 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `use_cudagraph` is not supported in vLLM.config.Compilati

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/22 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|▉         | 2/22 [00:00<00:01, 19.59it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  23%|██▎       | 5/22 [00:00<00:00, 23.74it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  41%|████      | 9/22 [00:00<00:00, 27.17it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  59%|█████▉    | 13/22 [00:00<00:00, 30.30it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  77%|███████▋  | 17/22 [00:00<00:00, 32.44it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  95%|█████████▌| 21/22 [00:00<00:00, 33.59it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 22/22 [00:00<00:00, 31.52it/s]Capturing CUDA graphs (decode, FULL):   0%|          | 0/14 [00:00<?, ?it/s]Capturing CUDA graphs (decode, FULL):  29%|██▊       | 4/14 [00:00<00:00, 36.78it/s]Capturing

Unsloth: Just some info: will skip parsing ['post_layernorm', 'pre_feedforward_layernorm', 'post_attention_layernorm', 'norm1', 'post_feedforward_layernorm', 'norm2', 'ffn_norm', 'input_layernorm', 'q_norm', 'layer_norm2', 'norm', 'layer_norm1', 'k_norm', 'attention_norm']Performing substitution for additional_keys=set()
Unsloth: Just some info: will skip parsing ['post_layernorm', 'pre_feedforward_layernorm', 'post_attention_layernorm', 'norm1', 'post_feedforward_layernorm', 'norm2', 'ffn_norm', 'input_layernorm', 'q_norm', 'layer_norm2', 'cross_attn_post_attention_layernorm', 'norm', 'cross_attn_input_layernorm', 'layer_norm1', 'k_norm', 'attention_norm']

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


Model: unsloth/Llama-3.2-1B-Instruct
FastInference: ✅ SUPPORTED
Generation: 0.02s

## Test Complete

The Llama-3.2-1B fast_inference test has completed. The kernel will now shut down to release all GPU memory.

**Next:** Run `02_FastInference_Qwen.ipynb` for Qwen3-4B testing.

In [3]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

{'status': 'ok', 'restart': False}