# Lab-2.1 Part 3: Advanced Features

## Objectives
- Understand Continuous Batching
- Master advanced sampling strategies
- Handle long context inputs
- Manage multiple models

## Estimated Time: 60-90 minutes

---
## 1. Setup

In [None]:
# Imports
from vllm import LLM, SamplingParams
import torch
import time
import numpy as np
import matplotlib.pyplot as plt
from typing import List
import asyncio

print(f"vLLM: {vllm.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load model
MODEL_NAME = "meta-llama/Llama-2-7b-hf"
# MODEL_NAME = "facebook/opt-1.3b"  # Alternative

print(f"Loading {MODEL_NAME}...")
llm = LLM(
    model=MODEL_NAME,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    max_model_len=2048,
    trust_remote_code=True,
)
print("✅ Model loaded")

---
## 2. Continuous Batching

vLLM's killer feature: dynamic request scheduling.

### Traditional Static Batching Problem

```
Batch: [Req1, Req2, Req3, Req4]

Req1: ████████░░░░░░░░░░░░ (done at step 8, waits)
Req2: ████████████░░░░░░░░ (done at step 12, waits)
Req3: ██████████████████░░ (done at step 18, waits)
Req4: ████████████████████ (done at step 20)
       └── Must wait for slowest request ──┘

Wasted time: ~40%
```

### Continuous Batching Solution

```
Req1: ████████              (done, removed immediately)
Req5:         ██████        (new request added)
Req2: ████████████          (done, removed)
Req6:             ████      (new request added)
Req3: ██████████████████    (done, removed)
Req4: ████████████████████

Throughput: 2-3x higher!
```

In [None]:
# Simulate continuous batching with varied length requests
varied_prompts = [
    "Hi",  # Very short
    "What is Python?",  # Short
    "Explain machine learning in detail:",  # Medium
    "Write a comprehensive guide about artificial intelligence, covering history, techniques, and applications:",  # Long
]

# Different max_tokens for each
varied_params = [
    SamplingParams(max_tokens=10, temperature=0.8),
    SamplingParams(max_tokens=30, temperature=0.8),
    SamplingParams(max_tokens=100, temperature=0.8),
    SamplingParams(max_tokens=200, temperature=0.8),
]

print("Testing with varied-length requests...\n")
start = time.time()

# vLLM automatically handles continuous batching
outputs = llm.generate(varied_prompts, varied_params[0])  # Use same params for simplicity

elapsed = time.time() - start

for i, output in enumerate(outputs):
    tokens = len(output.outputs[0].token_ids)
    print(f"Request {i+1}: {tokens:3d} tokens")

print(f"\nTotal time: {elapsed:.2f}s")
print("\n✅ Continuous batching handled varied lengths efficiently!")

### Measure TTFT and ITL

- **TTFT**: Time to First Token
- **ITL**: Inter-Token Latency

In [None]:
# For TTFT/ITL measurement, we need streaming API (async)
# This is a simplified demonstration

test_prompt = "Explain quantum computing:"
test_params = SamplingParams(
    max_tokens=50,
    temperature=0.8,
)

print("Measuring generation latency...")
start = time.time()
output = llm.generate([test_prompt], test_params)[0]
total_time = time.time() - start

num_tokens = len(output.outputs[0].token_ids)
avg_token_latency = total_time / num_tokens

print(f"\nTotal time: {total_time:.3f}s")
print(f"Tokens: {num_tokens}")
print(f"Avg latency per token: {avg_token_latency*1000:.1f}ms")
print(f"Throughput: {num_tokens/total_time:.1f} tokens/s")

---
## 3. Advanced Sampling Strategies

vLLM supports various sampling methods.

### 3.1 Temperature and Top-p Sampling

In [None]:
prompt = "The future of artificial intelligence will"

# Different temperature values
temperatures = [0.1, 0.5, 0.8, 1.2]

print("Testing different temperatures:\n")
print("="*80)

for temp in temperatures:
    params = SamplingParams(
        temperature=temp,
        max_tokens=30,
    )
    
    output = llm.generate([prompt], params)[0]
    text = output.outputs[0].text
    
    print(f"Temperature {temp:.1f}:")
    print(f"  {text}")
    print()

print("="*80)
print("\n💡 Lower temperature = more deterministic")
print("💡 Higher temperature = more creative/random")

In [None]:
# Top-p (nucleus sampling)
print("Testing different top_p values:\n")
print("="*80)

top_p_values = [0.5, 0.8, 0.95, 1.0]

for top_p in top_p_values:
    params = SamplingParams(
        temperature=0.8,
        top_p=top_p,
        max_tokens=30,
    )
    
    output = llm.generate([prompt], params)[0]
    text = output.outputs[0].text
    
    print(f"Top-p {top_p:.2f}:")
    print(f"  {text}")
    print()

print("="*80)
print("\n💡 Lower top_p = more focused on likely tokens")
print("💡 Higher top_p = more diverse vocabulary")

### 3.2 Beam Search

In [None]:
# Beam search for more deterministic output
beam_params = SamplingParams(
    n=3,  # Generate 3 candidates
    best_of=3,  # Return best of 3
    temperature=0.8,
    max_tokens=50,
    use_beam_search=False,  # vLLM uses sampling by default
)

print("Generating multiple candidates...\n")
outputs = llm.generate([prompt], beam_params)

print(f"Generated {len(outputs[0].outputs)} outputs:")
print("="*80)

for i, completion in enumerate(outputs[0].outputs):
    print(f"\nCandidate {i+1}:")
    print(completion.text)
    print(f"Cumulative logprob: {completion.cumulative_logprob:.2f}")

print("="*80)

### 3.3 Repetition Penalty

In [None]:
# Repetition penalty to avoid repetitive text
repetition_params = SamplingParams(
    temperature=0.8,
    max_tokens=100,
    repetition_penalty=1.2,  # Penalize repetitions
)

long_prompt = "Machine learning is a field of artificial intelligence that"

print("Testing repetition penalty...\n")
output = llm.generate([long_prompt], repetition_params)[0]
print(output.outputs[0].text)
print("\n✅ Repetition penalty helps avoid repeated phrases")

### 3.4 Stop Sequences

In [None]:
# Stop generation at specific sequences
stop_params = SamplingParams(
    temperature=0.8,
    max_tokens=200,
    stop=["\n\n", "However", "In conclusion"],  # Stop tokens
)

prompt_with_stop = "Here are three benefits of exercise:\n1."

print("Testing stop sequences...\n")
output = llm.generate([prompt_with_stop], stop_params)[0]
print(f"Prompt: {prompt_with_stop}")
print(f"Generated: {output.outputs[0].text}")
print(f"\nStop reason: {output.outputs[0].finish_reason}")

---
## 4. Long Context Handling

Test vLLM with longer input contexts.

In [None]:
# Generate a long context
long_context = """The history of artificial intelligence began in antiquity with myths and stories 
of artificial beings endowed with intelligence. Modern AI research started in the 1950s, when 
researchers began to explore the possibility that human intelligence could be so precisely 
described that a machine could simulate it. The field was founded on the claim that a central 
property of humans, intelligence—the sapience of Homo sapiens—can be so precisely described 
that it can be simulated by a machine.

The early years of AI were marked by significant optimism. Researchers believed that machines 
would soon be able to perform any task that a human could. However, progress was slower than 
expected, and the field experienced several periods known as AI winters, during which funding 
and interest declined.

In the 21st century, AI has experienced a renaissance, driven by advances in machine learning, 
particularly deep learning. Neural networks with many layers have proven remarkably effective 
at tasks like image recognition, natural language processing, and game playing.

Based on the above history, answer: What caused the AI renaissance in the 21st century?"""

print(f"Context length: {len(long_context.split())} words\n")

long_context_params = SamplingParams(
    temperature=0.7,
    max_tokens=100,
)

print("Processing long context...\n")
start = time.time()
output = llm.generate([long_context], long_context_params)[0]
elapsed = time.time() - start

print(f"Answer: {output.outputs[0].text}")
print(f"\nProcessing time: {elapsed:.2f}s")

### Test Maximum Context Length

In [None]:
# Test near max context (2048 tokens)
# Generate a very long prompt
repeated_text = "The quick brown fox jumps over the lazy dog. " * 100  # ~1000 words

max_context_prompt = repeated_text + "\n\nSummarize the above text:"

print(f"Testing with ~1000 word context...")
print(f"Estimated tokens: ~1500\n")

try:
    start = time.time()
    output = llm.generate([max_context_prompt], long_context_params)[0]
    elapsed = time.time() - start
    
    print(f"✅ Success!")
    print(f"Processing time: {elapsed:.2f}s")
    print(f"Output: {output.outputs[0].text[:200]}...")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Context might be too long for current max_model_len setting")

---
## 5. Multi-Model Management

Load and switch between multiple models.

In [None]:
# Note: Loading multiple models simultaneously requires sufficient GPU memory
# For demonstration, we'll load one small model

print("Loading a second model (GPT-2)...")
small_model = LLM(
    model="gpt2",
    gpu_memory_utilization=0.2,
    max_model_len=512,
)
print("✅ Second model loaded")

In [None]:
# Compare outputs from different models
comparison_prompt = "The future of AI is"
comparison_params = SamplingParams(
    temperature=0.8,
    max_tokens=50,
)

print("Comparing model outputs:\n")
print("="*80)

# Large model
large_output = llm.generate([comparison_prompt], comparison_params)[0]
print(f"Llama-2-7B:")
print(f"  {large_output.outputs[0].text}")
print()

# Small model
small_output = small_model.generate([comparison_prompt], comparison_params)[0]
print(f"GPT-2 (124M):")
print(f"  {small_output.outputs[0].text}")

print("="*80)
print("\n💡 Larger models generally produce more coherent outputs")

### Model Selection Strategy

In [None]:
def select_model(prompt: str, complexity: str = "auto"):
    """
    Select model based on task complexity.
    
    Args:
        prompt: Input prompt
        complexity: 'simple', 'complex', or 'auto'
    """
    if complexity == "auto":
        # Simple heuristic: check prompt length and keywords
        if len(prompt.split()) > 50 or any(kw in prompt.lower() for kw in 
                                            ['explain', 'analyze', 'complex', 'detail']):
            complexity = "complex"
        else:
            complexity = "simple"
    
    if complexity == "complex":
        return llm, "Llama-2-7B"
    else:
        return small_model, "GPT-2"

# Test model selection
test_prompts = [
    "Hello, how are you?",
    "Explain the theory of relativity in detail:",
]

print("Testing automatic model selection:\n")

for prompt in test_prompts:
    model, model_name = select_model(prompt)
    output = model.generate([prompt], comparison_params)[0]
    
    print(f"Prompt: {prompt}")
    print(f"Selected: {model_name}")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print()

---
## 6. Streaming Output (Conceptual)

vLLM supports streaming for real-time token generation.

### Streaming with AsyncLLMEngine

For production streaming, use `AsyncLLMEngine`:

```python
from vllm.engine.async_llm_engine import AsyncLLMEngine

async def stream_generate(prompt: str):
    async for output in engine.generate(prompt, sampling_params):
        # Process token as it's generated
        yield output
```

This enables:
- Real-time token output (typewriter effect)
- Lower perceived latency
- Better user experience

In [None]:
# Simulate streaming by showing tokens progressively
import sys
from IPython.display import clear_output

streaming_prompt = "Write a short poem about AI:"
streaming_params = SamplingParams(
    temperature=0.8,
    max_tokens=100,
)

print(f"Prompt: {streaming_prompt}\n")
print("Generating (simulated streaming):\n")

output = llm.generate([streaming_prompt], streaming_params)[0]
full_text = output.outputs[0].text

# Simulate token-by-token display
words = full_text.split()
displayed = ""

for word in words:
    displayed += word + " "
    print(f"\r{displayed}", end="", flush=True)
    time.sleep(0.1)  # Simulate generation delay

print("\n\n✅ Streaming simulation complete!")

---
## 7. Performance Profiling

In [None]:
# Comprehensive performance test
def run_benchmark(
    model,
    num_prompts: int = 10,
    max_tokens: int = 50,
) -> dict:
    """Run benchmark and return metrics."""
    prompts = [f"Test prompt {i}: Tell me about topic {i}." for i in range(num_prompts)]
    params = SamplingParams(temperature=0.8, max_tokens=max_tokens)
    
    # Warmup
    _ = model.generate([prompts[0]], params)
    
    # Benchmark
    start = time.time()
    outputs = model.generate(prompts, params)
    elapsed = time.time() - start
    
    # Metrics
    total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    
    return {
        'num_prompts': num_prompts,
        'total_time': elapsed,
        'total_tokens': total_tokens,
        'throughput': total_tokens / elapsed,
        'time_per_prompt': elapsed / num_prompts,
    }

print("Running comprehensive benchmark...\n")

results = run_benchmark(llm, num_prompts=10, max_tokens=50)

print("BENCHMARK RESULTS")
print("="*80)
print(f"Prompts processed:    {results['num_prompts']}")
print(f"Total time:           {results['total_time']:.2f}s")
print(f"Time per prompt:      {results['time_per_prompt']:.3f}s")
print(f"Total tokens:         {results['total_tokens']}")
print(f"Throughput:           {results['throughput']:.1f} tokens/s")
print("="*80)

---
## Summary

✅ **Completed**:
1. Explored Continuous Batching benefits
2. Mastered advanced sampling strategies:
   - Temperature and top-p
   - Beam search
   - Repetition penalty
   - Stop sequences
3. Tested long context handling
4. Managed multiple models
5. Understood streaming concepts
6. Ran comprehensive benchmarks

📊 **Key Takeaways**:
- Continuous batching improves throughput 2-3x
- Sampling strategies greatly affect output quality
- vLLM handles long contexts efficiently
- Multiple models can serve different use cases

➡️ **Next**: In `04-Production_Deployment.ipynb`, we'll learn:
- Deploy OpenAI-compatible API server
- Performance tuning for production
- Monitoring and logging
- Deployment best practices

In [None]:
# Cleanup
import gc

del llm, small_model
torch.cuda.empty_cache()
gc.collect()

print("✅ Memory cleaned up")