# Fast Inference Test: Qwen3-4B

Tests `fast_inference=True` with vLLM backend on Qwen3-4B.

**Important:** This notebook includes a kernel shutdown cell at the end.
vLLM does not release GPU memory in single-process mode (Jupyter), so kernel
restart is required between different model tests.

In [None]:
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

import unsloth
from unsloth import FastLanguageModel

import vllm
import torch

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, vLLM {vllm.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

In [None]:
# Test Qwen3-4B with fast_inference=True
MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit"
print(f"\nTesting {MODEL_NAME.split('/')[-1]} with fast_inference=True...")

from vllm import SamplingParams
import time

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    fast_inference=True,
)

# Test generation
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Say hello in one word."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

start = time.time()
outputs = model.fast_generate([prompt], sampling_params=sampling_params)
elapsed = time.time() - start

# Clear result
print(f"\n{'='*60}")
print(f"Model: {MODEL_NAME}")
print(f"FastInference: âœ… SUPPORTED")
print(f"Generation: {elapsed:.2f}s")
print(f"{'='*60}")

## Test Complete

The Qwen3-4B fast_inference test has completed. The kernel will now shut down to release all GPU memory.

**Next:** Run `03_FastInference_Ministral.ipynb` for Ministral model testing.

In [3]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

{'status': 'ok', 'restart': False}