# KV Cache Offloading with vLLM Production Stack

This notebook demonstrates how to implement KV cache offloading in the vLLM production stack to optimize memory usage and improve performance for large language models.

## What is KV Cache?

The KV (Key-Value) cache is a critical component in transformer-based language models:

- During generation, LLMs compute attention over previously generated tokens
- The KV cache stores the key and value tensors from previous tokens to avoid recomputation
- This significantly speeds up inference but consumes a lot of GPU memory
- As conversations get longer, the KV cache grows linearly with the sequence length

```
┌───────────────────┐
│     LLM Model     │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│    KV Cache in    │
│    GPU Memory     │◄── Memory bottleneck for long sequences
└───────────────────┘
```

## Why Offload KV Cache?

KV cache offloading addresses several challenges:

- **Memory Efficiency**: Frees up valuable GPU memory for model parameters
- **Longer Contexts**: Enables processing of much longer conversations
- **Cost Optimization**: Allows using smaller/fewer GPUs for the same workload
- **Higher Throughput**: Serves more concurrent users with the same hardware

## Recommended GPU VM Configuration

For this tutorial, we recommend:

- **GPU**: NVIDIA A10G or better (24GB+ VRAM)
- **CPU**: 16+ cores (important for CPU offloading)
- **RAM**: 64GB+ (critical for CPU offloading)
- **Storage**: 100GB+ SSD

Let's get started!

## Prerequisites

Before proceeding, make sure you have completed the setup in the first notebook (`01_setup_vllm_production_stack.ipynb`). Let's verify that our Kubernetes environment is running:

In [None]:
!sudo microk8s status

## 1. How CPU Offloading Works

KV cache offloading to CPU works by moving the key-value tensors from GPU memory to CPU memory when they're not immediately needed, then bringing them back when required for inference.

```
┌───────────────────┐
│     LLM Model     │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐         ┌───────────────────┐
│  Active KV Cache  │◄───────►│ Offloaded KV Cache│
│   (GPU Memory)    │         │   (CPU Memory)    │
└───────────────────┘         └───────────────────┘
```

**Benefits of CPU Offloading:**
- Allows handling longer sequences than would fit in GPU memory alone
- Enables more concurrent requests with the same GPU resources
- Provides a good balance between performance and memory efficiency

We'll use LMCache, which integrates with vLLM to provide efficient KV cache offloading capabilities.

## 2. Configuring KV Cache Offloading to CPU

Let's create a configuration file for KV cache offloading to CPU. We'll use Llama-2-7b-chat-hf as our model, which is a good balance between size and performance for demonstrating KV cache offloading:

In [None]:
%%writefile cpu-offloading-config.yaml
servingEngineSpec:
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.3"  # HuggingFace model ID
    replicaCount: 1                      # Number of replicas to deploy
    requestCPU: 12                       # CPU cores requested
    requestMemory: "48Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: true          # Enable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
    
    hf_token: ""                         # HuggingFace token (if needed)

### Key Configuration Parameters Explained

- **repository**: We use `lmcache/vllm-openai` instead of the standard vLLM image to get LMCache support
- **lmcacheConfig.enabled**: Enables the LMCache functionality
- **cpuOffloadingBufferSize**: Specifies how much CPU memory (in GB) to allocate for KV cache offloading
- **maxModelLen**: Sets the maximum sequence length; longer sequences benefit more from offloading

Now let's deploy the vLLM stack with KV cache offloading to CPU:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f cpu-offloading-config.yaml

Let's check the status of our deployment. This might take a few minutes as the model is downloaded and loaded:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to verify that LMCache is active:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-llama2-deployment | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME | grep -i lmcache

## 3. Testing KV Cache Offloading

Let's test our deployment with KV cache offloading by sending a request with a long prompt:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 30080:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the API by listing available models
!curl -o- http://localhost:30080/v1/models

Now let's create a long prompt to test KV cache offloading. We'll create a conversation with multiple turns to build up a large KV cache:

In [None]:
long_prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Can you explain what KV cache is in transformer models? [/INST]

In transformer models like those used in large language models (LLMs), the KV cache (Key-Value cache) is an optimization technique that significantly improves inference speed, especially during text generation.

When a transformer generates text token by token, it needs to compute attention over all previous tokens for each new token. This involves calculating key (K) and value (V) vectors for each token through the model's layers. Without caching, these K and V vectors would need to be recomputed for all previous tokens every time a new token is generated, which is computationally expensive.

The KV cache stores these key and value vectors after they're computed the first time, so they don't need to be recalculated for subsequent token generations. This makes the generation of each new token much faster after the first one.

However, the KV cache does consume memory proportionally to the sequence length, which can become a limitation for very long contexts or when serving many concurrent requests.

[INST] That's interesting! How much memory does the KV cache typically use? [/INST]

The memory usage of the KV cache depends on several factors, but I can give you a general idea of how it scales:

1. **Model size**: Larger models with more layers and wider attention heads use more memory for the KV cache.

2. **Sequence length**: The KV cache grows linearly with the number of tokens in the sequence. This is why very long contexts can be challenging to handle.

3. **Precision**: Whether the model is running in FP16, BF16, or FP32 precision affects memory usage.

For a concrete example, let's consider a model like Llama 2 (7B) running with 16-bit precision:
- Each key and value vector might use about 2 bytes per parameter
- For each token in the sequence, the model stores K and V tensors for each layer
- With 32 layers and a hidden size of 4096 (divided into 32 attention heads), the memory per token is substantial

A rough calculation for a 7B parameter model with a 2048 token context might use around 1-2 GB just for the KV cache. For larger models like a 70B parameter model, this could be 10+ GB.

This is why techniques like KV cache offloading (moving parts of the cache to CPU memory) or attention sinking (discarding some KV cache entries for tokens early in the context) are becoming important for efficient deployment.

[INST] What are the main techniques for optimizing KV cache usage? [/INST]
"""

# Write the prompt to a file for easier use with curl
with open("long_prompt.txt", "w") as f:
    f.write(long_prompt)

In [None]:
# Test the completion endpoint with a long prompt
!curl -X POST http://localhost:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"mistralai/Mistral-7B-Instruct-v0.3\",
    \"prompt\": $(cat long_prompt.txt | jq -Rs .),
    \"max_tokens\": 500
  }"

Let's check the logs again to see if we can observe KV cache offloading in action:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-llama2-deployment | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME | grep -i "cache\|offload" | tail -20

## 4. Benchmarking KV Cache Offloading

Let's create a simple benchmark to test the performance of KV cache offloading. We'll send multiple requests with the same prompt to see how the KV cache offloading affects performance:

In [None]:
%%writefile benchmark.py
import requests
import time
import json
import statistics
import matplotlib.pyplot as plt
import numpy as np

def send_request(prompt, max_tokens=100):
    """Send a completion request to the vLLM API and measure response time"""
    url = "http://localhost:30080/v1/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": prompt,
        "max_tokens": max_tokens
    }
    
    start_time = time.time()
    response = requests.post(url, headers=headers, json=data)
    end_time = time.time()
    
    return {
        "status_code": response.status_code,
        "response": response.json() if response.status_code == 200 else None,
        "time": end_time - start_time
    }

def run_benchmark(prompt, num_requests=5):
    """Run a benchmark with multiple identical requests to measure KV cache benefits"""
    print(f"Running benchmark with {num_requests} requests...")
    times = []
    
    for i in range(num_requests):
        print(f"Request {i+1}/{num_requests}")
        result = send_request(prompt)
        if result["status_code"] == 200:
            times.append(result["time"])
            print(f"Request {i+1} completed in {result['time']:.2f} seconds")
        else:
            print(f"Request {i+1} failed with status code {result['status_code']}")
    
    if times:
        print("\nBenchmark Results:")
        print(f"Average time: {statistics.mean(times):.2f} seconds")
        print(f"Median time: {statistics.median(times):.2f} seconds")
        print(f"Min time: {min(times):.2f} seconds")
        print(f"Max time: {max(times):.2f} seconds")
        if len(times) > 1:
            print(f"Standard deviation: {statistics.stdev(times):.2f} seconds")
        
        # Create a visualization of the results
        try:
            plt.figure(figsize=(10, 6))
            plt.plot(range(1, len(times) + 1), times, marker='o', linestyle='-', color='blue')
            plt.title('Request Latency Over Time (KV Cache Effect)')
            plt.xlabel('Request Number')
            plt.ylabel('Time (seconds)')
            plt.grid(True, linestyle='--', alpha=0.7)
            plt.savefig('kv_cache_benchmark.png')
            print("\nBenchmark visualization saved to 'kv_cache_benchmark.png'")
        except Exception as e:
            print(f"\nCouldn't create visualization: {e}")
    else:
        print("No successful requests to analyze.")

if __name__ == "__main__":
    # Using a consistent prompt to demonstrate KV cache benefits
    with open("long_prompt.txt", "r") as f:
        prompt = f.read()
    
    run_benchmark(prompt, num_requests=5)

Now let's run the benchmark:

In [None]:
# Install matplotlib if needed
!pip install matplotlib

In [None]:
!python benchmark.py

Let's display the benchmark visualization:

In [None]:
from IPython.display import Image
Image(filename='kv_cache_benchmark.png')

## 5. Monitoring GPU Memory Usage

Let's monitor the GPU memory usage during inference to see the effect of KV cache offloading:

In [None]:
!nvidia-smi

## 6. Cleanup

When you're done, you can clean up the deployment:

In [None]:
!sudo microk8s helm uninstall vllm

## Conclusion

In this notebook, we've successfully:

1. Configured KV cache offloading to CPU using LMCache
2. Deployed a Llama-2-7b model with KV cache offloading
3. Tested the deployment with a long prompt
4. Benchmarked the performance of KV cache offloading

### Key Takeaways

- KV cache offloading allows you to serve larger models and longer sequences with the same GPU resources
- The performance impact of offloading is minimal for most use cases
- CPU memory becomes an important resource when using KV cache offloading
- This technique is particularly valuable for applications that require long conversations or document processing

### When to Use KV Cache Offloading

Consider implementing KV cache offloading when:
- You need to support long conversations or documents
- You want to maximize the number of concurrent users per GPU
- You're working with limited GPU memory but have ample CPU memory
- You need to scale horizontally while maintaining efficient resource usage

In the next notebook, we'll explore remote shared KV cache storage, which takes offloading a step further by allowing multiple vLLM instances to share a common KV cache.