# Fast Inference Test: Llama-3.2-1B

Tests `fast_inference=True` with vLLM backend on Llama-3.2-1B-Instruct.

**Important:** This notebook includes a kernel shutdown cell at the end.
vLLM does not release GPU memory in single-process mode (Jupyter), so kernel
restart is required between different model tests.

In [None]:
# Environment Setup
from dotenv import load_dotenv
import os
load_dotenv()
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel
import transformers
import vllm
import trl
import torch

print(f"unsloth: {unsloth.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"TRL: {trl.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Test fast_inference=True with Llama-3.2-1B-Instruct
print("=== Llama-3.2-1B Fast Inference Test (vLLM Backend) ===")

from unsloth import FastLanguageModel
from vllm import SamplingParams
import torch
import time

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=512,
    load_in_4bit=True,
    fast_inference=True,
    gpu_memory_utilization=0.5,
)
print("✓ Model loaded with fast_inference=True")

FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Say hello in one word."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

start = time.time()
outputs = model.fast_generate([prompt], sampling_params=sampling_params)
elapsed = time.time() - start

response = outputs[0].outputs[0].text
print(f"✓ vLLM generation completed in {elapsed:.2f}s")
print(f"  Response: {response}")
print("✓ Llama-3.2-1B fast_inference test PASSED")

## Test Complete

The Llama-3.2-1B fast_inference test has completed. The kernel will now shut down to release all GPU memory.

**Next:** Run `02_FastInference_Qwen.ipynb` for Qwen3-4B testing.

In [None]:
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)