# vLLM Production Stack with KV Cache Offloading on Brev.dev

This notebook guides you through setting up a vLLM production stack on brev.dev with a focus on KV Cache offloading. We'll cover:

1. Setting up a Kubernetes environment with MicroK8s
2. Installing the vLLM production stack using Helm
3. Configuring KV Cache offloading to CPU
4. Setting up remote shared KV Cache storage
5. Testing and benchmarking the setup

## What is vLLM?

vLLM is a high-performance library for LLM inference and serving. It's designed to maximize throughput and minimize latency for LLM applications. Key features include:

- PagedAttention for efficient memory management
- Continuous batching to handle concurrent requests
- Optimized CUDA kernels for faster execution
- OpenAI-compatible API for easy integration

## What is KV Cache?

The KV (Key-Value) cache is a critical component in transformer-based language models:

- During generation, LLMs compute attention over previously generated tokens
- The KV cache stores the key and value tensors from previous tokens to avoid recomputation
- This significantly speeds up inference but consumes a lot of GPU memory
- As conversations get longer, the KV cache grows linearly with the sequence length

```
┌───────────────────┐
│     LLM Model     │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│    KV Cache in    │
│    GPU Memory     │◄── Memory bottleneck for long sequences
└───────────────────┘
```

## Why Offload KV Cache?

KV cache offloading addresses several challenges:

- **Memory Efficiency**: Frees up valuable GPU memory for model parameters
- **Longer Contexts**: Enables processing of much longer conversations
- **Cost Optimization**: Allows using smaller/fewer GPUs for the same workload
- **Higher Throughput**: Serves more concurrent users with the same hardware

Let's get started!

## 1. Environment Setup and Prerequisites

### What is Kubernetes and why use it for LLM deployment?

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. For LLM deployments, Kubernetes offers several advantages:

- **Scalability**: Easily scale your LLM services up or down based on demand
- **Resource Management**: Efficiently allocate GPU and CPU resources across workloads
- **High Availability**: Ensure your LLM services remain available even if individual components fail
- **Declarative Configuration**: Define your entire LLM infrastructure as code

We'll use MicroK8s, a lightweight Kubernetes distribution that's easy to set up and use on a single machine.

First, let's check if we have GPU support available:

In [None]:
!nvidia-smi

### 1.1 Install Required Tools

Let's create a script to install MicroK8s and set up our Kubernetes environment:

In [None]:
%%writefile setup_microk8s.sh
#!/bin/bash
# Function to check command status
check_status() {
    if [ $? -ne 0 ]; then
        echo "Error: $1 failed"
        exit 1
    fi
}

# Function to check if command exists
command_exists() {
    command -v "$1" >/dev/null 2>&1
}

# Check if nvidia-smi is available
echo "Checking NVIDIA GPU..."
if ! command_exists nvidia-smi; then
    echo "Warning: nvidia-smi not found. GPU support may not be available."
else
    nvidia-smi
    check_status "nvidia-smi"
fi

# Install MicroK8s
echo "Installing MicroK8s..."
sudo snap install microk8s --classic --channel=1.25/stable
check_status "MicroK8s installation"

# Add user to microk8s group
echo "Adding user to microk8s group..."
sudo usermod -a -G microk8s $USER
check_status "Adding user to microk8s group"

# Create and set permissions for .kube directory
echo "Setting up .kube directory..."
mkdir -p ~/.kube
chmod 0700 ~/.kube
sudo chown -f -R $USER ~/.kube
check_status "Setting up .kube directory"

# Wait for MicroK8s to be ready
echo "Waiting for MicroK8s to be ready..."
sudo microk8s status --wait-ready
check_status "MicroK8s ready check"

# Enable GPU and storage support
echo "Enabling GPU and hostpath-storage..."
sudo microk8s enable gpu hostpath-storage
check_status "Enabling MicroK8s addons"

# Double check status
echo "Checking final MicroK8s status..."
sudo microk8s status --wait-ready
check_status "Final MicroK8s status check"

# Set up Helm repositories
echo "Setting up Helm repositories..."
sudo microk8s helm repo remove nvidia || true  # Remove if exists
sudo microk8s helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
sudo microk8s helm repo update

# Activate the new group membership without requiring logout
echo "Activating microk8s group membership..."
if ! groups | grep -q microk8s; then
    exec sg microk8s -c '
        echo "Testing cluster access..."
        sudo microk8s kubectl get services
        sudo microk8s kubectl get nodes
        echo "Creating example nginx deployment..."
        sudo microk8s kubectl create deployment nginx --image=nginx
        echo "Checking pods..."
        sudo microk8s kubectl get pods
        echo "The kubectl and helm aliases are now active globally."
    '
else
    echo "Testing cluster access..."
    sudo microk8s kubectl get services
    sudo microk8s kubectl get nodes
    echo "Creating example nginx deployment..."
    sudo microk8s kubectl create deployment nginx --image=nginx
    echo "Checking pods..."
    sudo microk8s kubectl get pods
fi

echo "Setup of MicroK8s is complete!"

Now let's make the script executable and run it:

In [None]:
!chmod +x setup_microk8s.sh
!./setup_microk8s.sh

### 1.2 Clone the vLLM Production Stack Repository

Let's clone the vLLM production stack repository to get access to the necessary configuration files and examples:

In [None]:
!git clone https://github.com/vllm-project/production-stack.git
!cd production-stack && ls -la

## 2. Minimal vLLM Installation

### Understanding Helm and vLLM Stack

Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. Think of it as npm for Node.js or pip for Python, but for Kubernetes applications.

The vLLM production stack consists of several components:

```
┌─────────────────────────────────────────────────┐
│                 Client Request                  │
└───────────────────────┬─────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────┐
│                Router Service                   │
│  (Load balancing and request distribution)      │
└───────────────────────┬─────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────┐
│               vLLM Engine Pods                  │
│  (Model serving with GPU acceleration)          │
└─────────────────────────────────────────────────┘
```

Let's start with a minimal installation of vLLM to ensure everything is working correctly:

In [None]:
# Add the vLLM Helm repository
!sudo microk8s helm repo add vllm https://vllm-project.github.io/production-stack
!sudo microk8s helm repo update

Let's create a minimal configuration file for our initial deployment:

In [None]:
%%writefile minimal-vllm-config.yaml
servingEngineSpec:
  runtimeClassName: ""                  # Runtime class name (leave empty for default)
  modelSpec:
  - name: "opt125m"                     # Name for the deployment
    repository: "vllm/vllm-openai"      # Docker image for vLLM
    tag: "latest"                       # Image tag
    modelURL: "facebook/opt-125m"       # Small model for testing
    replicaCount: 1                     # Single replica
    requestCPU: 6                       # CPU cores requested
    requestMemory: "16Gi"               # Memory requested
    requestGPU: 1                       # Number of GPUs requested

Now let's deploy the minimal vLLM stack:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f minimal-vllm-config.yaml

Let's check the status of our deployment:

In [None]:
!sudo microk8s kubectl get pods
!sudo microk8s kubectl get services

Let's test our deployment by forwarding the service port and sending a request:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 30080:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the API by listing available models
!curl -o- http://localhost:30080/v1/models

In [None]:
# Test the completion endpoint
!curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time,",
    "max_tokens": 10
  }'

## 3. Configuring KV Cache Offloading to CPU

### How CPU Offloading Works

KV cache offloading to CPU works by moving the key-value tensors from GPU memory to CPU memory when they're not immediately needed, then bringing them back when required for inference.

```
┌───────────────────┐
│     LLM Model     │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐         ┌───────────────────┐
│  Active KV Cache  │◄───────►│ Offloaded KV Cache│
│   (GPU Memory)    │         │   (CPU Memory)    │
└───────────────────┘         └───────────────────┘
```

**Benefits of CPU Offloading:**
- Allows handling longer sequences than would fit in GPU memory alone
- Enables more concurrent requests with the same GPU resources
- Provides a good balance between performance and memory efficiency

We'll use LMCache, which integrates with vLLM to provide efficient KV cache offloading capabilities.

Now that we have a basic vLLM deployment working, let's configure KV cache offloading to CPU using LMCache. First, let's uninstall our previous deployment:

In [None]:
!sudo microk8s helm uninstall vllm

Now let's create a configuration file for KV cache offloading to CPU:

In [None]:
%%writefile cpu-offloading-config.yaml
servingEngineSpec:
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"  # HuggingFace model ID
    replicaCount: 1                      # Number of replicas to deploy
    requestCPU: 10                       # CPU cores requested
    requestMemory: "40Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: false         # Disable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
    
    hf_token: ""                         # HuggingFace token (if needed)

Now let's deploy the vLLM stack with KV cache offloading to CPU:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f cpu-offloading-config.yaml

Let's check the status of our deployment:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to verify that LMCache is active:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME | grep -i lmcache

Let's test our deployment with KV cache offloading:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 30080:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the API by listing available models
!curl -o- http://localhost:30080/v1/models

In [None]:
# Test the completion endpoint
!curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 100
  }'

## 4. Setting Up Remote Shared KV Cache Storage

### Understanding Remote Shared KV Cache

Remote shared KV cache takes the concept of offloading a step further by allowing multiple vLLM instances to share a common KV cache storage. This approach has several advantages:

```
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  vLLM Pod #1  │     │  vLLM Pod #2  │     │  vLLM Pod #3  │
└───────┬───────┘     └───────┬───────┘     └───────┬───────┘
        │                     │                     │
        └─────────────┬───────┴─────────────┬───────┘
                      │                     │
                      ▼                     ▼
        ┌───────────────────────┐ ┌───────────────────────┐
        │   Local KV Cache      │ │   Shared KV Cache     │
        │   (GPU Memory)        │ │   (Remote Storage)    │
        └───────────────────────┘ └───────────────────────┘
```

**Key Benefits:**
- **Resource Efficiency**: Multiple model instances can share cached KV pairs
- **Fault Tolerance**: If one pod fails, others can still access the shared cache
- **Horizontal Scaling**: Add more vLLM pods without duplicating cached data
- **Consistent Performance**: Provides more predictable performance across pods

Now let's set up remote shared KV cache storage. First, let's uninstall our previous deployment:

In [None]:
!sudo microk8s helm uninstall vllm

Now let's create a configuration file for remote shared KV cache storage:

In [None]:
%%writefile remote-shared-storage-config.yaml
servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"  # HuggingFace model ID
    replicaCount: 2                      # Deploy 2 replicas to demonstrate sharing
    requestCPU: 10                       # CPU cores requested
    requestMemory: "40Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: false         # Disable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
    
    hf_token: ""                         # HuggingFace token (if needed)

# This section configures the shared cache server
cacheserverSpec:
  replicaCount: 1                        # Number of cache server replicas
  containerPort: 8080                    # Container port
  servicePort: 81                        # Service port
  serde: "naive"                         # Serialization/deserialization method
  repository: "lmcache/vllm-openai"      # Docker image
  tag: "latest"                          # Image tag
  resources:                             # Resource requests and limits
    requests:
      cpu: "4"                           # CPU cores requested
      memory: "8G"                       # Memory requested
    limits:
      cpu: "4"                           # CPU cores limit
      memory: "10G"                      # Memory limit
  labels:                                # Kubernetes labels
    environment: "cacheserver"           # Environment label
    release: "cacheserver"               # Release label

Now let's deploy the vLLM stack with remote shared KV cache storage:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f remote-shared-storage-config.yaml

Let's check the status of our deployment:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to verify that LMCache with remote shared storage is active:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}' | head -1) && \
sudo microk8s kubectl logs $POD_NAME | grep -i lmcache

Let's test our deployment with remote shared KV cache storage:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 30080:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the completion endpoint
!curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 100
  }'

## 5. Benchmarking KV Cache Offloading

### Why Benchmark KV Cache Offloading?

Benchmarking helps us understand the performance characteristics and trade-offs of KV cache offloading:

- **First Request Latency**: Initial requests might be slower due to cache setup
- **Subsequent Request Performance**: Later requests should benefit from cached KV pairs
- **Memory Usage Patterns**: How memory is distributed between GPU and CPU/remote storage
- **Throughput Under Load**: How the system performs with concurrent requests

Our benchmark will focus on measuring request latency across multiple identical requests, which should demonstrate the benefits of KV cache reuse.

```
Request 1: [■■■■■■■■■■] 100% (Initial - No cache benefit)
Request 2: [■■■■■■■■  ] 80%  (Some cache hits)
Request 3: [■■■■■■    ] 60%  (More cache hits)
Request 4: [■■■■      ] 40%  (Significant cache hits)
Request 5: [■■■       ] 30%  (Maximum cache benefit)
```

Let's create a simple benchmark to test the performance of KV cache offloading. We'll send multiple requests with the same prompt to see how the KV cache offloading affects performance:

In [None]:
%%writefile benchmark.py
import requests
import time
import json
import statistics
import matplotlib.pyplot as plt
import numpy as np

def send_request(prompt, max_tokens=100):
    """Send a completion request to the vLLM API and measure response time"""
    url = "http://localhost:30080/v1/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "mistralai/Mistral-7B-Instruct-v0.2",
        "prompt": prompt,
        "max_tokens": max_tokens
    }
    
    start_time = time.time()
    response = requests.post(url, headers=headers, json=data)
    end_time = time.time()
    
    return {
        "status_code": response.status_code,
        "response": response.json() if response.status_code == 200 else None,
        "time": end_time - start_time
    }

def run_benchmark(prompt, num_requests=5):
    """Run a benchmark with multiple identical requests to measure KV cache benefits"""
    print(f"Running benchmark with {num_requests} requests...")
    times = []
    
    for i in range(num_requests):
        print(f"Request {i+1}/{num_requests}")
        result = send_request(prompt)
        if result["status_code"] == 200:
            times.append(result["time"])
            print(f"Request {i+1} completed in {result['time']:.2f} seconds")
        else:
            print(f"Request {i+1} failed with status code {result['status_code']}")
    
    if times:
        print("\nBenchmark Results:")
        print(f"Average time: {statistics.mean(times):.2f} seconds")
        print(f"Median time: {statistics.median(times):.2f} seconds")
        print(f"Min time: {min(times):.2f} seconds")
        print(f"Max time: {max(times):.2f} seconds")
        if len(times) > 1:
            print(f"Standard deviation: {statistics.stdev(times):.2f} seconds")
        
        # Create a visualization of the results
        try:
            plt.figure(figsize=(10, 6))
            plt.plot(range(1, len(times) + 1), times, marker='o', linestyle='-', color='blue')
            plt.title('Request Latency Over Time (KV Cache Effect)')
            plt.xlabel('Request Number')
            plt.ylabel('Time (seconds)')
            plt.grid(True, linestyle='--', alpha=0.7)
            plt.savefig('kv_cache_benchmark.png')
            print("\nBenchmark visualization saved to 'kv_cache_benchmark.png'")
        except Exception as e:
            print(f"\nCouldn't create visualization: {e}")
    else:
        print("No successful requests to analyze.")

if __name__ == "__main__":
    # Using a consistent prompt to demonstrate KV cache benefits
    prompt = "Explain the significance of KV cache in language models. Provide details about how it improves inference performance and what are the trade-offs involved."
    run_benchmark(prompt, num_requests=5)

Now let's run the benchmark:

In [None]:
!python benchmark.py

## 6. Cleanup

Let's clean up our deployment:

In [None]:
!sudo microk8s helm uninstall vllm

## Conclusion

In this notebook, we've successfully:

1. Set up a Kubernetes environment with MicroK8s
2. Installed the vLLM production stack using Helm
3. Configured KV Cache offloading to CPU using LMCache
4. Set up remote shared KV Cache storage
5. Tested and benchmarked the setup

### Key Takeaways

- **Memory Efficiency**: KV cache offloading allows you to serve larger models and longer sequences with the same GPU resources
- **Scalability**: Remote shared KV cache enables efficient horizontal scaling of your LLM deployment
- **Performance Trade-offs**: There's a balance between keeping KV cache in GPU memory (faster) vs. offloading (more memory efficient)
- **Production Readiness**: The vLLM production stack with LMCache provides a robust, Kubernetes-native solution for deploying LLMs at scale

### When to Use KV Cache Offloading

Consider implementing KV cache offloading when:
- You need to support long conversations or documents
- You want to maximize the number of concurrent users per GPU
- You're working with limited GPU memory but have ample CPU memory
- You need to scale horizontally while maintaining efficient resource usage

### Next Steps

To further enhance your vLLM deployment, consider exploring:
- Fine-tuning the offloading parameters based on your specific workload
- Implementing monitoring and observability for your deployment
- Setting up auto-scaling based on request load
- Exploring other vLLM features like LoRA adapters for model customization

The vLLM production stack with LMCache provides a robust solution for deploying LLMs in production with KV Cache offloading capabilities, enabling you to build more efficient and scalable AI applications.