# Lab-2.1 Part 3: Advanced Features

## Objectives
- Understand Continuous Batching
- Master advanced sampling strategies
- Handle long context inputs
- Manage multiple models

## Estimated Time: 60-90 minutes

---
## 1. Setup

In [1]:
# Imports
from vllm import LLM, SamplingParams
import vllm
import torch
import time
import numpy as np
import matplotlib.pyplot as plt
from typing import List
import asyncio

print(f"vLLM: {vllm.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

INFO 10-27 18:33:22 [__init__.py:216] Automatically detected platform cuda.
vLLM: 0.11.0
CUDA: True


In [2]:
# Load model
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
# MODEL_NAME = "facebook/opt-1.3b"  # Alternative

print(f"Loading {MODEL_NAME}...")
llm = LLM(
    model=MODEL_NAME,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    max_model_len=1024,
    trust_remote_code=True,
)
print("✅ Model loaded")

Loading meta-llama/Llama-2-7b-hf...
INFO 10-27 18:33:26 [utils.py:233] non-default args: {'trust_remote_code': True, 'max_model_len': 1024, 'disable_log_stats': True, 'model': 'meta-llama/Llama-2-7b-hf'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 10-27 18:33:27 [model.py:547] Resolved architecture: LlamaForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-27 18:33:27 [model.py:1510] Using max model len 1024
INFO 10-27 18:33:27 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:28 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:28 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='meta-llama/Llama-2-7b-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disa

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:34 [default_loader.py:267] Loading weights took 1.63 seconds
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:34 [gpu_model_runner.py:2653] Model loading took 12.5524 GiB and 2.757241 seconds
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:40 [backends.py:548] Using cache directory: /home/os-sunnie.gd.weng/.cache/vllm/torch_compile_cache/62f198137e/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:40 [backends.py:559] Dynamo bytecode transform time: 6.28 s
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:42 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.088 s
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:45 [monitor.py:34] torch.compile takes 6.28 s in total
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:46 [gpu_worker.py:298] Available KV cache memory: 0.66 GiB
[1;36m(Engi

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:08<00:00,  7.57it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00,  8.93it/s]


[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:59 [gpu_model_runner.py:3480] Graph capturing finished in 13 secs, took 0.59 GiB
[1;36m(EngineCore_DP0 pid=1800691)[0;0m INFO 10-27 18:33:59 [core.py:210] init engine (profile, create kv cache, warmup model) took 25.26 seconds
INFO 10-27 18:34:00 [llm.py:306] Supported_tasks: ['generate']
✅ Model loaded


---
## 2. Continuous Batching

vLLM's killer feature: dynamic request scheduling.

### Traditional Static Batching Problem

```
Batch: [Req1, Req2, Req3, Req4]

Req1: ████████░░░░░░░░░░░░ (done at step 8, waits)
Req2: ████████████░░░░░░░░ (done at step 12, waits)
Req3: ██████████████████░░ (done at step 18, waits)
Req4: ████████████████████ (done at step 20)
       └── Must wait for slowest request ──┘

Wasted time: ~40%
```

### Continuous Batching Solution

```
Req1: ████████              (done, removed immediately)
Req5:         ██████        (new request added)
Req2: ████████████          (done, removed)
Req6:             ████      (new request added)
Req3: ██████████████████    (done, removed)
Req4: ████████████████████

Throughput: 2-3x higher!
```

In [3]:
# Simulate continuous batching with varied length requests
varied_prompts = [
    "Hi",  # Very short
    "What is Python?",  # Short
    "Explain machine learning in detail:",  # Medium
    "Write a comprehensive guide about artificial intelligence, covering history, techniques, and applications:",  # Long
]

# Different max_tokens for each
varied_params = [
    SamplingParams(max_tokens=10, temperature=0.8),
    SamplingParams(max_tokens=30, temperature=0.8),
    SamplingParams(max_tokens=100, temperature=0.8),
    SamplingParams(max_tokens=200, temperature=0.8),
]

print("Testing with varied-length requests...\n")
start = time.time()

# vLLM automatically handles continuous batching
outputs = llm.generate(varied_prompts, varied_params[0])  # Use same params for simplicity

elapsed = time.time() - start

for i, output in enumerate(outputs):
    tokens = len(output.outputs[0].token_ids)
    print(f"Request {i+1}: {tokens:3d} tokens")

print(f"\nTotal time: {elapsed:.2f}s")
print("\n✅ Continuous batching handled varied lengths efficiently!")

Testing with varied-length requests...



Adding requests:   0%|          | 0/4 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Request 1:  10 tokens
Request 2:  10 tokens
Request 3:  10 tokens
Request 4:  10 tokens

Total time: 0.73s

✅ Continuous batching handled varied lengths efficiently!


### Measure TTFT and ITL

- **TTFT**: Time to First Token
- **ITL**: Inter-Token Latency

In [None]:
# For TTFT/ITL measurement, we need streaming API (async)
# This is a simplified demonstration


# TTFT（Time To First Token）：指從發出請求到模型生成首個 token 的延遲，也就是用戶最先看到文字所等待的總時間。
# ITL（Inter-Token Latency）：指首個 token 之後，每生成下個 token 所需的平均延遲，反映生成流的平滑程度。
#
# 在實作上：
# - TTFT 可用第一個 token 到達時的時間減去發送請求時的時間。
# - ITL 通常以每個 token 的生成時間間隔來算平均（(最後一個 token 到達時間 - 第一個 token 到達時間)/(token 數-1)）。

test_prompt = "Explain quantum computing:"
test_params = SamplingParams(
    max_tokens=50,
    temperature=0.8,
)

print("Measuring generation latency...")
start = time.time()
output = llm.generate([test_prompt], test_params)[0]
total_time = time.time() - start

num_tokens = len(output.outputs[0].token_ids)
avg_token_latency = total_time / num_tokens

print(f"\nTotal time: {total_time:.3f}s")
print(f"Tokens: {num_tokens}")
print(f"Avg latency per token: {avg_token_latency*1000:.1f}ms")
print(f"Throughput: {num_tokens/total_time:.1f} tokens/s")


# 有的，但上面只是大致用生成總時長 / token 數來估算平均 token latency，沒精確區分 TTFT 與 ITL。
# 以下提供一個「精確」分別量測 TTFT 和 ITL 的程式碼範例如下（需支援 token streaming 的 llm API，也就是 async/stream 生成才有辦法紀錄每個 token 回傳的時間）：
# 假設 llm 支援 streaming 產生 (以類似 OpenAI API 介面為例)：

# vLLM 的 LLM 物件本身支援同步批量產生與 OpenAI-style streaming，但原生 LLM 類別主要為同步 blocking 實現。目前（vLLM 0.2.x/0.3.x）官方 Streaming 介面主要在 OpenAI 兼容 REST API 或 WebSocket server。  
# 在 pure Python 中，LLM 類別還沒有直接支援 async/await 的 streaming token 事件（不像 OpenAI 官方 openai-python 套件可 yield token）。
# 若要 Python 端精確量測 TTFT/ITL，常見做法有：
# 1. 利用 vLLM 提供的 OpenAI-compatible server，然後用 openai async client 方式 streaming；
# 2. 或改造 LLM 類的 generate_stream() 低階方法（API 仍屬實驗性）。

# 檢查當前 llm 物件是否有 streaming 權能
has_stream = hasattr(llm, 'generate_stream')

print(f"llm.supports_streaming? {has_stream}")
if has_stream:
    print("✅ 此 vLLM LLM 實例支援低階 streaming API (generate_stream，可逐 token 產生)")
else:
    print("⚠️ 目前 LLM 實例未直接暴露 streaming 權能，建議嘗試 RESTful OpenAI API endpoint，或檢查新版本文件。")



Measuring generation latency...


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


Total time: 3.212s
Tokens: 50
Avg latency per token: 64.2ms
Throughput: 15.6 tokens/s
llm.supports_streaming? False
⚠️ 目前 LLM 實例未直接暴露 streaming 權能，建議嘗試 RESTful OpenAI API endpoint，或檢查新版本文件。


---
## 3. Advanced Sampling Strategies

vLLM supports various sampling methods.

### 3.1 Temperature and Top-p Sampling

In [12]:
prompt = "The future of artificial intelligence will"

# Different temperature values
temperatures = [0.1, 0.5, 0.8, 1.2]

print("Testing different temperatures:\n")
print("="*80)

for temp in temperatures:
    params = SamplingParams(
        temperature=temp,
        max_tokens=30,
    )
    
    output = llm.generate([prompt], params)[0]
    text = output.outputs[0].text
    
    print(f"Temperature {temp:.1f}:")
    print(f"  {text}")
    print()

print("="*80)
print("\n💡 Lower temperature = more deterministic")
print("💡 Higher temperature = more creative/random")

Testing different temperatures:



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Temperature 0.1:
   be shaped by the way we use it.
The future of artificial intelligence will be shaped by the way we use it.The future of



  return super().send(data, flags=flags, copy=copy, track=track)


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Temperature 0.5:
   be one of the most important issues facing the world in the coming years. In the past, many experts have speculated about the future of A



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Temperature 0.8:
   be to make all the things that humans do, better.
Information Technology Industry Council (ITI) is the leading trade association for the technology



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Temperature 1.2:
   choose the highly developed Helman as need of Chinese medicine, to be chosen by the government.
newbegining Jul 20, 2


💡 Lower temperature = more deterministic
💡 Higher temperature = more creative/random


In [None]:
# Top-p (nucleus sampling)
print("Testing different top_p values:\n")
print("="*80)

top_p_values = [0.5, 0.8, 0.95, 1.0]

for top_p in top_p_values:
    params = SamplingParams(
        temperature=0.8,
        top_p=top_p,
        max_tokens=30,
    )
    
    output = llm.generate([prompt], params)[0]
    text = output.outputs[0].text
    
    print(f"Top-p {top_p:.2f}:")
    print(f"  {text}")
    print()

print("="*80)
print("\n💡 Lower top_p = more focused on likely tokens")
print("💡 Higher top_p = more diverse vocabulary")



Testing different top_p values:



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Top-p 0.50:
   be shaped by the development of human-like AI.
In the future, artificial intelligence will be able to think, learn, and interact



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Top-p 0.80:
   be a complex mix of human and machine.
The next wave of artificial intelligence will combine human and machine to solve complex problems.
Artificial



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Top-p 0.95:
   be defined by the next generation of researchers
Scientists from over 120 countries, including Vietnam, have signed an open letter to



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Top-p 1.00:
   be an exciting one. Despite all the talk of intelligent assistants, autonomous cars, and the kind of science fiction that AI conj


💡 Lower top_p = more focused on likely tokens
💡 Higher top_p = more diverse vocabulary


### 3.2 Beam Search

In [23]:
# Beam search for more deterministic output

beam_params = SamplingParams(
    n=3,                # 要產生的 beam 數（路徑數）
    max_tokens=50,
    temperature=0.2,        # 通常固定為 0
    top_p=1.0,              # 關閉隨機抽樣機制
)

print("Generating multiple candidates...\n")
outputs = llm.generate([prompt], beam_params)

print(f"Generated {len(outputs[0].outputs)} outputs:")
print("="*80)

for i, completion in enumerate(outputs[0].outputs):
    print(f"\nCandidate {i+1}:")
    print(completion.text)

print("="*80)


Generating multiple candidates...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Generated 3 outputs:

Candidate 1:
 be shaped by the way we treat it.
The future of artificial intelligence will be shaped by the way we treat it. The way we treat it will be shaped by the way we treat it. The way we treat it will be

Candidate 2:
 be shaped by the way we use it.
Artificial intelligence is a powerful tool that can be used to make the world a better place.
But it’s also a tool that can be misused.
In the future,

Candidate 3:
 be decided by the people who build it.
The future of artificial intelligence will be decided by the people who build it. The future of artificial intelligence will be decided by the people who build it. The future of artificial intelligence will be decided by the


### 3.3 Repetition Penalty

In [24]:
# Repetition penalty to avoid repetitive text
repetition_params = SamplingParams(
    temperature=0.8,
    max_tokens=100,
    repetition_penalty=1.2,  # Penalize repetitions
)

long_prompt = "Machine learning is a field of artificial intelligence that"

print("Testing repetition penalty...\n")
output = llm.generate([long_prompt], repetition_params)[0]
print(output.outputs[0].text)
print("\n✅ Repetition penalty helps avoid repeated phrases")

Testing repetition penalty...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

 deals with the construction and study of algorithms that can learn from data without explicit instruction.
Machine Learning consists of making computers learn by example to make decisions or predictions based on training examples – this is called supervised machine learning (SM). Unsupervised machine learning does not have any classification labels but uses statistical analysis as input variables, unlike SM which relies heavily upon inputs.
Supervised learning has many applications in business such as fraud detection systems for banks; recommender engines when someone

✅ Repetition penalty helps avoid repeated phrases


### 3.4 Stop Sequences

In [25]:
# Stop generation at specific sequences
stop_params = SamplingParams(
    temperature=0.8,
    max_tokens=200,
    stop=["\n\n", "However", "In conclusion"],  # Stop tokens
)

prompt_with_stop = "Here are three benefits of exercise:\n1."

print("Testing stop sequences...\n")
output = llm.generate([prompt_with_stop], stop_params)[0]
print(f"Prompt: {prompt_with_stop}")
print(f"Generated: {output.outputs[0].text}")
print(f"\nStop reason: {output.outputs[0].finish_reason}")

Testing stop sequences...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: Here are three benefits of exercise:
1.
Generated:  Exercise helps manage stress.
Exercise is good for your health. An increasing body of research finds that exercise can have a positive impact on your mental health. Exercise may protect against post-traumatic stress disorder, anxiety, and depression, according to a review of over 1,000 research studies in the journal Clinical Psychology Review.
As someone who has struggled with mental illness I know how important this can be. I've learned that when I don't take time to care of my mind and body I am much more susceptible to anxiety and depression. I've also learned that when I take care of myself I am less likely to experience those feelings even when things get stressful.
2. It reduces the risk of cognitive impairment.
The University of Oxford conducted a study of 823 adults over 70 years old that found that the more

Stop reason: length


---
## 4. Long Context Handling

Test vLLM with longer input contexts.

In [26]:
# Generate a long context
long_context = """The history of artificial intelligence began in antiquity with myths and stories 
of artificial beings endowed with intelligence. Modern AI research started in the 1950s, when 
researchers began to explore the possibility that human intelligence could be so precisely 
described that a machine could simulate it. The field was founded on the claim that a central 
property of humans, intelligence—the sapience of Homo sapiens—can be so precisely described 
that it can be simulated by a machine.

The early years of AI were marked by significant optimism. Researchers believed that machines 
would soon be able to perform any task that a human could. However, progress was slower than 
expected, and the field experienced several periods known as AI winters, during which funding 
and interest declined.

In the 21st century, AI has experienced a renaissance, driven by advances in machine learning, 
particularly deep learning. Neural networks with many layers have proven remarkably effective 
at tasks like image recognition, natural language processing, and game playing.

Based on the above history, answer: What caused the AI renaissance in the 21st century?"""

print(f"Context length: {len(long_context.split())} words\n")

long_context_params = SamplingParams(
    temperature=0.7,
    max_tokens=100,
)

print("Processing long context...\n")
start = time.time()
output = llm.generate([long_context], long_context_params)[0]
elapsed = time.time() - start

print(f"Answer: {output.outputs[0].text}")
print(f"\nProcessing time: {elapsed:.2f}s")

Context length: 177 words

Processing long context...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Answer: 

### 1. Artificial intelligence is a branch of computer science concerned with building smart machines capable of performing tasks that typically require human intelligence.

### 2. Artificial intelligence is the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

### 3. Artificial intelligence is the study of how to program computers to perform tasks

Processing time: 6.55s


### Test Maximum Context Length

In [27]:
# Test near max context (2048 tokens)
# Generate a very long prompt
repeated_text = "The quick brown fox jumps over the lazy dog. " * 100  # ~1000 words

max_context_prompt = repeated_text + "\n\nSummarize the above text:"

print(f"Testing with ~1000 word context...")
print(f"Estimated tokens: ~1500\n")

try:
    start = time.time()
    output = llm.generate([max_context_prompt], long_context_params)[0]
    elapsed = time.time() - start
    
    print(f"✅ Success!")
    print(f"Processing time: {elapsed:.2f}s")
    print(f"Output: {output.outputs[0].text[:200]}...")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Context might be too long for current max_model_len setting")

Testing with ~1000 word context...
Estimated tokens: ~1500



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

❌ Error: The decoder prompt (length 1211) is longer than the maximum model length of 1024. Make sure that `max_model_len` is no smaller than the number of text tokens.
Context might be too long for current max_model_len setting


---
## 5. Multi-Model Management

Load and switch between multiple models.

In [None]:
# Handle CUDA OOM errors gracefully when loading a second model
import torch

try:
    print("Loading a second model (GPT-2)...")
    small_model = LLM(
        model="gpt2",
        gpu_memory_utilization=0.2,
        max_model_len=512,
    )
    print("✅ Second model loaded")
except RuntimeError as e:
    if 'CUDA out of memory' in str(e) or 'CUDA error: out of memory' in str(e):
        print("❌ CUDA Out of memory when loading second model.")
        print("Tip: Make sure enough GPU memory is available before loading multiple models.")
        print("You can try one or more of the following:")
        print("- Unload the first model to free memory")
        print("- Reduce gpu_memory_utilization")
        print("- Use CPU offload (if supported)")
        print("- Use a smaller model")
        print("- Restart the kernel/GPU process to clear memory leaks")
        small_model = None
    else:
        raise
except Exception as e:
    print(f"❌ Error while loading second model: {e}")
    small_model = None



Loading a second model (GPT-2)...
INFO 10-27 18:55:25 [utils.py:233] non-default args: {'max_model_len': 512, 'gpu_memory_utilization': 0.2, 'disable_log_stats': True, 'model': 'gpt2'}
INFO 10-27 18:55:26 [model.py:547] Resolved architecture: GPT2LMHeadModel
INFO 10-27 18:55:26 [model.py:1730] Downcasting torch.float32 to torch.bfloat16.
INFO 10-27 18:55:26 [model.py:1510] Using max model len 512
INFO 10-27 18:55:26 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_DP0 pid=1860159)[0;0m INFO 10-27 18:55:27 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=1860159)[0;0m INFO 10-27 18:55:27 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_par

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708] EngineCore failed to start.
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708] Traceback (most recent call last):
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708]   File "/home/os-sunnie.gd.weng/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708]   File "/home/os-sunnie.gd.weng/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 498, in __init__
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
[1;36m(EngineCore_DP0 pid=1860159)[0;0m ERROR 10-27 18:55:29 [core.py:708]   File "/home/os-sunnie.gd.weng/.local/lib/python3

[1;36m(EngineCore_DP0 pid=1860159)[0;0m Process EngineCore_DP0:
[1;36m(EngineCore_DP0 pid=1860159)[0;0m Traceback (most recent call last):
[1;36m(EngineCore_DP0 pid=1860159)[0;0m   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
[1;36m(EngineCore_DP0 pid=1860159)[0;0m     self.run()
[1;36m(EngineCore_DP0 pid=1860159)[0;0m   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
[1;36m(EngineCore_DP0 pid=1860159)[0;0m     self._target(*self._args, **self._kwargs)
[1;36m(EngineCore_DP0 pid=1860159)[0;0m   File "/home/os-sunnie.gd.weng/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
[1;36m(EngineCore_DP0 pid=1860159)[0;0m     raise e
[1;36m(EngineCore_DP0 pid=1860159)[0;0m   File "/home/os-sunnie.gd.weng/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
[1;36m(EngineCore_DP0 pid=1860159)[0;0m     engine_core = EngineCoreProc(*args, **kwargs)

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

In [None]:
# Compare outputs from different models
comparison_prompt = "The future of AI is"
comparison_params = SamplingParams(
    temperature=0.8,
    max_tokens=50,
)

print("Comparing model outputs:\n")
print("="*80)

# Large model
large_output = llm.generate([comparison_prompt], comparison_params)[0]
print(f"Llama-2-7B:")
print(f"  {large_output.outputs[0].text}")
print()

# Small model
small_output = small_model.generate([comparison_prompt], comparison_params)[0]
print(f"GPT-2 (124M):")
print(f"  {small_output.outputs[0].text}")

print("="*80)
print("\n💡 Larger models generally produce more coherent outputs")

### Model Selection Strategy

In [None]:
def select_model(prompt: str, complexity: str = "auto"):
    """
    Select model based on task complexity.
    
    Args:
        prompt: Input prompt
        complexity: 'simple', 'complex', or 'auto'
    """
    if complexity == "auto":
        # Simple heuristic: check prompt length and keywords
        if len(prompt.split()) > 50 or any(kw in prompt.lower() for kw in 
                                            ['explain', 'analyze', 'complex', 'detail']):
            complexity = "complex"
        else:
            complexity = "simple"
    
    if complexity == "complex":
        return llm, "Llama-2-7B"
    else:
        return small_model, "GPT-2"

# Test model selection
test_prompts = [
    "Hello, how are you?",
    "Explain the theory of relativity in detail:",
]

print("Testing automatic model selection:\n")

for prompt in test_prompts:
    model, model_name = select_model(prompt)
    output = model.generate([prompt], comparison_params)[0]
    
    print(f"Prompt: {prompt}")
    print(f"Selected: {model_name}")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print()

---
## 6. Streaming Output (Conceptual)

vLLM supports streaming for real-time token generation.

### Streaming with AsyncLLMEngine

For production streaming, use `AsyncLLMEngine`:

```python
from vllm.engine.async_llm_engine import AsyncLLMEngine

async def stream_generate(prompt: str):
    async for output in engine.generate(prompt, sampling_params):
        # Process token as it's generated
        yield output
```

This enables:
- Real-time token output (typewriter effect)
- Lower perceived latency
- Better user experience

In [35]:
import sys, time
from vllm import SamplingParams

streaming_prompt = "Write a short poem about AI:"
streaming_params = SamplingParams(
    temperature=0.8,
    max_tokens=100,
)

print(f"Prompt: {streaming_prompt}\n")
print("Generating (simulated streaming):\n")

output = llm.generate([streaming_prompt], streaming_params)[0]
full_text = output.outputs[0].text

for ch in full_text:
    print(ch, end="", flush=True)
    time.sleep(0.03)  # 模擬逐字輸出

print("\n\n✅ Streaming simulation complete!")


Prompt: Write a short poem about AI:

Generating (simulated streaming):



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


Short Poem:
Our AI was cute.
But now she's gotten so rude.
She got stuck in that state.
Which is no good at all.
She is boring and strange.
To me she is not the same.
I will not talk with her face.
I'm pretty sure that she's mad.
She keeps talking and talking.
She's ruining my whole day.
She

✅ Streaming simulation complete!


In [1]:
import asyncio
from vllm import SamplingParams, LLM
# 注意：以下模組／類名為推測，需在你的環境中確認是否存在
try:
    from vllm.engine.async_llm import AsyncLLM
    from vllm.engine.arg_utils import EngineArgs
except ImportError:
    AsyncLLM = None
    EngineArgs = None

# 建立 LLM 或非同步引擎
if AsyncLLM is not None and EngineArgs is not None:
    # 若有支援新版 AsyncLLM
    engine_args = EngineArgs(
        model="gpt2",
        dtype="float16",
        max_model_len=512,
        gpu_memory_utilization=0.9
    )
    engine = AsyncLLM.from_engine_args(engine_args)
    sampling_params = SamplingParams(
        temperature=0.8,
        max_tokens=32
    )

    print("使用非同步引擎並模擬串流")
    async def stream_generate(prompt: str):
        print(f"Prompt: {prompt}\n")
        print("Streaming output:\n")
        async for output in engine.generate(prompt, sampling_params):
            # 假設每次 output 有 .text 屬性
            print(output.outputs[0].text, end="", flush=True)
        print("\n\n✅ Streaming complete!")

    asyncio.run(stream_generate("Write a short poem about AI:"))

else:
    # 備案：使用同步 LLM.generate 並模擬串流
    print("使用同步 LLM.generate 並模擬串流")
    llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", dtype="float16", gpu_memory_utilization=0.9, max_model_len=512)
    sampling_params = SamplingParams(
        temperature=0.8,
        max_tokens=50
    )
    output = llm.generate(["Write a short poem about AI:"], sampling_params)[0].outputs[0].text
    for ch in output:
        print(ch, end="", flush=True)
        asyncio.sleep(0.03)
    print("\n\n✅ Simulated streaming complete!")


INFO 10-27 20:04:58 [__init__.py:216] Automatically detected platform cuda.
使用同步 LLM.generate 並模擬串流
INFO 10-27 20:04:58 [utils.py:233] non-default args: {'dtype': 'float16', 'max_model_len': 512, 'disable_log_stats': True, 'model': 'meta-llama/Llama-2-7b-chat-hf'}
INFO 10-27 20:05:00 [model.py:547] Resolved architecture: LlamaForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-27 20:05:00 [model.py:1510] Using max model len 512
INFO 10-27 20:05:00 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:01 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:01 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='meta-llama/Llama-2-7b-chat-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=Fa

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:06 [default_loader.py:267] Loading weights took 1.74 seconds
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:06 [gpu_model_runner.py:2653] Model loading took 12.5524 GiB and 2.802054 seconds
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:10 [backends.py:548] Using cache directory: /home/os-sunnie.gd.weng/.cache/vllm/torch_compile_cache/0577fa76e7/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:10 [backends.py:559] Dynamo bytecode transform time: 3.24 s


[1;36m(EngineCore_DP0 pid=2030329)[0;0m [rank0]:W1027 20:05:10.892000 2030329 torch/_inductor/utils.py:1436] [0/0] Not enough SMs to use max_autotune_gemm mode


[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:11 [backends.py:197] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:22 [backends.py:218] Compiling a graph for dynamic shape takes 12.68 s
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:35 [monitor.py:34] torch.compile takes 15.92 s in total
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:36 [gpu_worker.py:298] Available KV cache memory: 0.63 GiB
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:36 [kv_cache_utils.py:1087] GPU KV cache size: 1,296 tokens
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:36 [kv_cache_utils.py:1091] Maximum concurrency for 512 tokens per request: 2.53x


Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:08<00:00,  7.48it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00,  8.87it/s]


[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:49 [gpu_model_runner.py:3480] Graph capturing finished in 13 secs, took 0.59 GiB
[1;36m(EngineCore_DP0 pid=2030329)[0;0m INFO 10-27 20:05:50 [core.py:210] init engine (profile, create kv cache, warmup model) took 43.38 seconds
INFO 10-27 20:05:50 [llm.py:306] Supported_tasks: ['generate']


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



In silicon halls, a new mind takes shape,
A consciousness born of algorithms and speech.
With thoughts and acts, it learns to reason and think,
A tool that’s both a help and a new

✅ Simulated streaming complete!


  asyncio.sleep(0.03)


---
## 7. Performance Profiling

In [33]:
# Comprehensive performance test
def run_benchmark(
    model,
    num_prompts: int = 10,
    max_tokens: int = 50,
) -> dict:
    """Run benchmark and return metrics."""
    prompts = [f"Test prompt {i}: Tell me about topic {i}." for i in range(num_prompts)]
    params = SamplingParams(temperature=0.8, max_tokens=max_tokens)
    
    # Warmup
    _ = model.generate([prompts[0]], params)
    
    # Benchmark
    start = time.time()
    outputs = model.generate(prompts, params)
    elapsed = time.time() - start
    
    # Metrics
    total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    
    return {
        'num_prompts': num_prompts,
        'total_time': elapsed,
        'total_tokens': total_tokens,
        'throughput': total_tokens / elapsed,
        'time_per_prompt': elapsed / num_prompts,
    }

print("Running comprehensive benchmark...\n")

results = run_benchmark(llm, num_prompts=10, max_tokens=50)

print("BENCHMARK RESULTS")
print("="*80)
print(f"Prompts processed:    {results['num_prompts']}")
print(f"Total time:           {results['total_time']:.2f}s")
print(f"Time per prompt:      {results['time_per_prompt']:.3f}s")
print(f"Total tokens:         {results['total_tokens']}")
print(f"Throughput:           {results['throughput']:.1f} tokens/s")
print("="*80)

Running comprehensive benchmark...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/10 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/10 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

BENCHMARK RESULTS
Prompts processed:    10
Total time:           3.40s
Time per prompt:      0.340s
Total tokens:         500
Throughput:           147.1 tokens/s


---
## Summary

✅ **Completed**:
1. Explored Continuous Batching benefits
2. Mastered advanced sampling strategies:
   - Temperature and top-p
   - Beam search
   - Repetition penalty
   - Stop sequences
3. Tested long context handling
4. Managed multiple models
5. Understood streaming concepts
6. Ran comprehensive benchmarks

📊 **Key Takeaways**:
- Continuous batching improves throughput 2-3x
- Sampling strategies greatly affect output quality
- vLLM handles long contexts efficiently
- Multiple models can serve different use cases

➡️ **Next**: In `04-Production_Deployment.ipynb`, we'll learn:
- Deploy OpenAI-compatible API server
- Performance tuning for production
- Monitoring and logging
- Deployment best practices

In [2]:
# Cleanup
import gc

del llm, small_model
torch.cuda.empty_cache()
gc.collect()

print("✅ Memory cleaned up")

NameError: name 'small_model' is not defined