# Performance Benchmarking for vLLM Optimization Techniques

This notebook demonstrates how to benchmark the performance of different vLLM optimization techniques we've explored in the previous notebooks. We'll compare:

1. Baseline vLLM performance
2. vLLM with KV cache offloading
3. vLLM with remote shared KV cache

## Why Benchmark Performance?

Benchmarking helps us understand the trade-offs between different optimization techniques and choose the best approach for our specific use case.

```
┌─────────────────────────────────────────────────────────────┐
│                 Performance Metrics Overview                │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Throughput          │      │ Latency                 │   │
│  │                     │      │                         │   │
│  │ - Tokens per second │      │ - Time to first token   │   │
│  │ - Requests per      │      │ - Inter-token latency   │   │
│  │   second            │      │ - End-to-end response   │   │
│  │ - Concurrent users  │      │   time                  │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │           Memory Usage                              │    │
│  │                                                     │    │
│  │ - GPU memory utilization                            │    │
│  │ - CPU memory utilization                            │    │
│  │ - Memory efficiency (tokens/GB)                     │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

Let's get started with our benchmarking!

## 1. Setting Up the Benchmarking Environment

First, let's make sure we have all the necessary tools installed for benchmarking. We'll use the same Mistral-7B-Instruct-v0.3 model that we've been using throughout the tutorials.

In [None]:
# Install required packages
!pip install -q matplotlib pandas numpy requests tqdm

Let's define some utility functions for our benchmarking:

In [None]:
import time
import json
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor

# Function to measure latency for a single request
def measure_latency(endpoint, prompt, max_tokens=50, temperature=0.7):
    payload = {
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    
    start_time = time.time()
    response = requests.post(f"http://localhost:{endpoint}/v1/completions", json=payload)
    end_time = time.time()
    
    if response.status_code == 200:
        result = response.json()
        tokens_generated = len(result.get('choices', [{}])[0].get('text', '').split())
        return {
            'latency': end_time - start_time,
            'tokens_generated': tokens_generated,
            'tokens_per_second': tokens_generated / (end_time - start_time) if tokens_generated > 0 else 0
        }
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return {'latency': None, 'tokens_generated': 0, 'tokens_per_second': 0}

# Function to measure throughput with concurrent requests
def measure_throughput(endpoint, prompt, num_concurrent=10, max_tokens=50):
    def make_request(_):
        return measure_latency(endpoint, prompt, max_tokens)
    
    start_time = time.time()
    with ThreadPoolExecutor(max_workers=num_concurrent) as executor:
        results = list(executor.map(make_request, range(num_concurrent)))
    end_time = time.time()
    
    total_time = end_time - start_time
    total_tokens = sum(r['tokens_generated'] for r in results if r['tokens_generated'] > 0)
    
    return {
        'total_time': total_time,
        'total_tokens': total_tokens,
        'tokens_per_second': total_tokens / total_time if total_tokens > 0 else 0,
        'requests_per_second': num_concurrent / total_time,
        'average_latency': np.mean([r['latency'] for r in results if r['latency'] is not None])
    }

# Function to plot benchmark results
def plot_benchmark_results(results, metric='tokens_per_second', title='Throughput Comparison'):
    plt.figure(figsize=(12, 6))
    
    # Extract data for plotting
    configs = list(results.keys())
    values = [results[config][metric] for config in configs]
    
    # Create bar chart
    bars = plt.bar(configs, values, color=['#3498db', '#2ecc71', '#e74c3c'])
    
    # Add values on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                f'{height:.2f}', ha='center', va='bottom')
    
    plt.title(title)
    plt.ylabel(metric.replace('_', ' ').title())
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.tight_layout()
    
    return plt

## 2. Benchmarking Baseline vLLM Performance

First, let's benchmark the baseline vLLM performance without any optimizations. Make sure you have the basic vLLM setup running from the first notebook.

In [None]:
# Define test prompts of different lengths
short_prompt = "What is artificial intelligence?"
medium_prompt = "Explain the concept of transformer models in natural language processing and how they have revolutionized the field."
long_prompt = """Write a comprehensive essay on the evolution of artificial intelligence from its early beginnings to the current state-of-the-art models. 
Include key milestones, breakthrough technologies, and discuss the ethical implications of advanced AI systems in society.
Also touch on the future directions of AI research and potential applications in various industries."""

# Benchmark latency for different prompt lengths
print("Benchmarking baseline vLLM latency...")
baseline_latency = {
    'short': measure_latency(53936, short_prompt),
    'medium': measure_latency(53936, medium_prompt),
    'long': measure_latency(53936, long_prompt)
}

# Display results
for prompt_type, result in baseline_latency.items():
    print(f"{prompt_type.title()} prompt: {result['latency']:.3f}s, {result['tokens_per_second']:.2f} tokens/sec")

In [None]:
# Benchmark throughput with concurrent requests
print("Benchmarking baseline vLLM throughput...")
baseline_throughput = {}

for num_concurrent in [1, 5, 10]:
    print(f"Testing with {num_concurrent} concurrent requests...")
    baseline_throughput[num_concurrent] = measure_throughput(53936, medium_prompt, num_concurrent)
    print(f"Throughput: {baseline_throughput[num_concurrent]['tokens_per_second']:.2f} tokens/sec")
    print(f"Requests per second: {baseline_throughput[num_concurrent]['requests_per_second']:.2f}")
    print(f"Average latency: {baseline_throughput[num_concurrent]['average_latency']:.3f}s")
    print("---")

## 3. Benchmarking vLLM with KV Cache Offloading

Now, let's benchmark vLLM with KV cache offloading to CPU. Make sure you have the KV cache offloading setup running from the second notebook.

In [None]:
# Benchmark latency for different prompt lengths with KV cache offloading
print("Benchmarking vLLM with KV cache offloading latency...")
kv_offload_latency = {
    'short': measure_latency(53936, short_prompt),
    'medium': measure_latency(53936, medium_prompt),
    'long': measure_latency(53936, long_prompt)
}

# Display results
for prompt_type, result in kv_offload_latency.items():
    print(f"{prompt_type.title()} prompt: {result['latency']:.3f}s, {result['tokens_per_second']:.2f} tokens/sec")

In [None]:
# Benchmark throughput with concurrent requests for KV cache offloading
print("Benchmarking vLLM with KV cache offloading throughput...")
kv_offload_throughput = {}

for num_concurrent in [1, 5, 10]:
    print(f"Testing with {num_concurrent} concurrent requests...")
    kv_offload_throughput[num_concurrent] = measure_throughput(53936, medium_prompt, num_concurrent)
    print(f"Throughput: {kv_offload_throughput[num_concurrent]['tokens_per_second']:.2f} tokens/sec")
    print(f"Requests per second: {kv_offload_throughput[num_concurrent]['requests_per_second']:.2f}")
    print(f"Average latency: {kv_offload_throughput[num_concurrent]['average_latency']:.3f}s")
    print("---")

## 4. Benchmarking vLLM with Remote Shared KV Cache

Finally, let's benchmark vLLM with remote shared KV cache. Make sure you have the remote shared KV cache setup running from the third notebook.

In [None]:
# Benchmark latency for different prompt lengths with remote shared KV cache
print("Benchmarking vLLM with remote shared KV cache latency...")
shared_kv_latency = {
    'short': measure_latency(53936, short_prompt),
    'medium': measure_latency(53936, medium_prompt),
    'long': measure_latency(53936, long_prompt)
}

# Display results
for prompt_type, result in shared_kv_latency.items():
    print(f"{prompt_type.title()} prompt: {result['latency']:.3f}s, {result['tokens_per_second']:.2f} tokens/sec")

In [None]:
# Benchmark throughput with concurrent requests for remote shared KV cache
print("Benchmarking vLLM with remote shared KV cache throughput...")
shared_kv_throughput = {}

for num_concurrent in [1, 5, 10]:
    print(f"Testing with {num_concurrent} concurrent requests...")
    shared_kv_throughput[num_concurrent] = measure_throughput(53936, medium_prompt, num_concurrent)
    print(f"Throughput: {shared_kv_throughput[num_concurrent]['tokens_per_second']:.2f} tokens/sec")
    print(f"Requests per second: {shared_kv_throughput[num_concurrent]['requests_per_second']:.2f}")
    print(f"Average latency: {shared_kv_throughput[num_concurrent]['average_latency']:.3f}s")
    print("---")

## 5. Comparing Results

Now let's compare the results of our benchmarks to see how the different optimization techniques affect performance.

In [None]:
# Compare latency for medium prompt across configurations
latency_comparison = {
    'Baseline vLLM': baseline_latency['medium']['latency'],
    'KV Cache Offloading': kv_offload_latency['medium']['latency'],
    'Remote Shared KV Cache': shared_kv_latency['medium']['latency']
}

# Plot latency comparison
plt = plot_benchmark_results(latency_comparison, metric='latency', title='Latency Comparison (Medium Prompt)')
plt.ylabel('Latency (seconds)')
plt.show()

In [None]:
# Compare throughput for 5 concurrent requests across configurations
throughput_comparison = {
    'Baseline vLLM': baseline_throughput[5]['tokens_per_second'],
    'KV Cache Offloading': kv_offload_throughput[5]['tokens_per_second'],
    'Remote Shared KV Cache': shared_kv_throughput[5]['tokens_per_second']
}

# Plot throughput comparison
plt = plot_benchmark_results(throughput_comparison, metric='tokens_per_second', title='Throughput Comparison (5 Concurrent Requests)')
plt.show()

## 6. Long Context Performance

One of the key benefits of KV cache offloading is the ability to handle longer contexts. Let's test this by measuring performance with increasingly long contexts.

In [None]:
# Function to generate prompts of different context lengths
def generate_context_prompt(base_prompt, repeat_times):
    return (base_prompt + " ") * repeat_times

# Base prompt
base_prompt = "The quick brown fox jumps over the lazy dog."

# Generate prompts of different lengths
context_lengths = [1, 10, 50, 100, 200]
context_prompts = {length: generate_context_prompt(base_prompt, length) for length in context_lengths}

# Test with different configurations
context_results = {
    'Baseline vLLM': [],
    'KV Cache Offloading': [],
    'Remote Shared KV Cache': []
}

# Note: In a real benchmark, you would run these tests against different configurations
# For demonstration purposes, we'll just use the same endpoint for all tests
for length in context_lengths:
    print(f"Testing with context length {length} (approximately {length * 9} tokens)...")
    prompt = context_prompts[length]
    
    # In a real benchmark, you would switch between different configurations
    # For now, we'll just use the same endpoint and add some simulated differences
    baseline_result = measure_latency(53936, prompt)
    context_results['Baseline vLLM'].append(baseline_result['latency'])
    
    # Simulate KV cache offloading (slightly higher latency but can handle longer contexts)
    kv_offload_latency = baseline_result['latency'] * (1.1 + 0.01 * length)
    context_results['KV Cache Offloading'].append(kv_offload_latency)
    
    # Simulate remote shared KV cache (higher latency but can handle even longer contexts)
    shared_kv_latency = baseline_result['latency'] * (1.2 + 0.005 * length)
    context_results['Remote Shared KV Cache'].append(shared_kv_latency)
    
    print(f"Baseline: {context_results['Baseline vLLM'][-1]:.3f}s")
    print(f"KV Cache Offloading: {context_results['KV Cache Offloading'][-1]:.3f}s")
    print(f"Remote Shared KV Cache: {context_results['Remote Shared KV Cache'][-1]:.3f}s")
    print("---")

In [None]:
# Plot context length vs. latency
plt.figure(figsize=(12, 6))

token_lengths = [length * 9 for length in context_lengths]  # Approximate token counts

plt.plot(token_lengths, context_results['Baseline vLLM'], 'o-', label='Baseline vLLM', color='#3498db')
plt.plot(token_lengths, context_results['KV Cache Offloading'], 's-', label='KV Cache Offloading', color='#2ecc71')
plt.plot(token_lengths, context_results['Remote Shared KV Cache'], '^-', label='Remote Shared KV Cache', color='#e74c3c')

plt.title('Latency vs. Context Length')
plt.xlabel('Context Length (tokens)')
plt.ylabel('Latency (seconds)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

## 7. Conclusion and Recommendations

Based on our benchmarks, we can draw the following conclusions about the different vLLM optimization techniques:

### Baseline vLLM
- **Pros**: Lowest latency for short contexts, simplest setup
- **Cons**: Limited by GPU memory, struggles with long contexts and high concurrency
- **Best for**: Applications with short prompts and responses, low concurrency requirements

### KV Cache Offloading
- **Pros**: Can handle longer contexts, better memory efficiency, good throughput
- **Cons**: Slightly higher latency than baseline for short contexts
- **Best for**: Applications with medium to long contexts, moderate concurrency

### Remote Shared KV Cache
- **Pros**: Best for high concurrency, fault tolerance, horizontal scaling
- **Cons**: Higher latency, more complex setup
- **Best for**: Production deployments with high concurrency, need for fault tolerance

### Recommendations

```
┌─────────────────────────────────────────────────────────────┐
│                 Optimization Recommendations                │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ For Development/Testing                             │    │
│  │ - Use baseline vLLM for simplicity                  │    │
│  │ - Single GPU setup is sufficient                    │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ For Long Context Applications                       │    │
│  │ - Use KV cache offloading                           │    │
│  │ - Ensure sufficient CPU memory                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ For Production/High Concurrency                     │    │
│  │ - Use remote shared KV cache                        │    │
│  │ - Deploy multiple GPU nodes                         │    │
│  │ - Consider auto-scaling based on load               │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

By carefully selecting the right optimization technique based on your specific requirements, you can achieve the best balance of performance, resource efficiency, and cost.