# KV Cache Offloading with vLLM and Mistral-7B-Instruct-v0.3

This notebook demonstrates how to implement KV cache offloading in vLLM to optimize memory usage and improve performance for Mistral-7B-Instruct-v0.3.

## What is KV Cache?

The KV (Key-Value) cache is a critical component in transformer-based language models:

```
┌─────────────────────────────────────────────────────────────┐
│                     KV Cache Overview                       │
│                                                             │
│  ┌─────────────┐       ┌─────────────┐      ┌──────────┐   │
│  │ Input       │       │ Transformer │      │ Output   │   │
│  │ Tokens      │─────►│ Layer       │─────►│ Tokens   │   │
│  └─────────────┘       └─────┬───────┘      └──────────┘   │
│                             │                              │
│                             ▼                              │
│                    ┌─────────────────┐                     │
│                    │   KV Cache      │                     │
│                    │ (GPU Memory)    │                     │
│                    └─────────────────┘                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

- During generation, LLMs compute attention over previously generated tokens
- The KV cache stores the key and value tensors from previous tokens
- This speeds up inference but consumes a lot of GPU memory
- Cache size grows linearly with sequence length

## Why Offload KV Cache?

KV cache offloading addresses several challenges:

```
┌─────────────────────────────────────────────────────────────┐
│                 KV Cache Offloading Benefits                │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Memory Efficiency   │      │ Longer Contexts         │   │
│  │ - Free GPU memory   │      │ - Handle long chats     │   │
│  │ - Use for params    │      │ - Process documents     │   │
│  │ - Better utilization│      │ - Maintain history      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Cost Optimization   │      │ Higher Throughput       │   │
│  │ - Use smaller GPUs  │      │ - More concurrent users │   │
│  │ - Fewer GPUs needed │      │ - Better resource use   │   │
│  │ - Lower TCO         │      │ - Faster responses      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

## Recommended GPU VM Configuration

For KV cache offloading with Mistral-7B-Instruct-v0.3, we recommend:

- **GPU**: NVIDIA A10G or better (24GB+ VRAM)
- **CPU**: 12+ cores (important for CPU offloading)
- **RAM**: 48GB+ (critical for CPU offloading)
- **Storage**: 100GB+ SSD

Let's get started!

## 1. Setting Up KV Cache Offloading

First, let's create a configuration file that enables KV cache offloading:

In [None]:
%%writefile kv-offload-config.yaml
servingEngineSpec:
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.3"  # HuggingFace model ID
    replicaCount: 1                      # Number of replicas to deploy
    requestCPU: 12                       # CPU cores requested
    requestMemory: "48Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: false         # Disable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
    
    hf_token: ""                         # HuggingFace token (if needed)

### Understanding the Configuration

Let's break down the key configuration parameters:

```
┌─────────────────────────────────────────────────────────────┐
│                Configuration Parameters                     │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Resource Config     │      │ Model Config            │   │
│  │ - 12 CPU cores     │      │ - Mistral-7B-Instruct   │   │
│  │ - 48GB RAM         │      │ - 16K context length     │   │
│  │ - 1 GPU            │      │ - No prefix caching      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ LMCache Config      │      │ Storage Config          │   │
│  │ - Enabled          │      │ - 50GB PVC              │   │
│  │ - 20GB CPU buffer  │      │ - Persistent storage    │   │
│  │ - Auto-offloading  │      │ - Model checkpoints     │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

Key differences from the basic setup:
- Using `lmcache/vllm-openai` image for LMCache support
- Increased CPU and memory resources for offloading
- Enabled LMCache with 20GB CPU buffer
- Increased maximum sequence length to 16K tokens

## 2. Deploying with KV Cache Offloading

First, let's uninstall our previous deployment:

In [None]:
!sudo microk8s helm uninstall vllm

Now let's deploy with KV cache offloading enabled:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f kv-offload-config.yaml

Let's check the status of our deployment:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to see if KV cache offloading is working:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME --tail=50

## 3. Testing Long Context Performance

Let's test how our setup handles long contexts by sending a request with a long prompt:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 53936:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Create a long context by repeating a paragraph
long_context = """The development of artificial intelligence has been one of the most significant technological advances in recent history. 
It has transformed industries, enhanced scientific research, and changed how we interact with technology. 
From machine learning algorithms that power recommendation systems to natural language processing models that enable human-like conversations, 
AI continues to push the boundaries of what's possible. However, with these advances come important ethical considerations and 
responsibilities that we must carefully consider as we move forward.\n\n"" * 20

# Send a request with the long context
!curl -X POST http://localhost:53936/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"mistralai/Mistral-7B-Instruct-v0.3\",
    \"messages\": [
      {\"role\": \"system\", \"content\": \"You are a helpful AI assistant.\"}, 
      {\"role\": \"user\", \"content\": \"$long_context\nGiven this context about AI development, what are the three most important ethical considerations we should focus on? Please provide a concise answer.\"}
    ],
    \"temperature\": 0.7,
    \"max_tokens\": 200
  }"

## 4. Monitoring Memory Usage

Let's monitor the GPU and CPU memory usage during inference to see the effect of KV cache offloading:

In [None]:
# Monitor GPU memory
!nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv

In [None]:
# Monitor CPU memory
!free -h

## 5. Testing Concurrent Requests

Let's test how our setup handles multiple concurrent requests:

In [None]:
import requests
import time
import json
import statistics
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Image
from concurrent.futures import ThreadPoolExecutor

def send_request():
    url = "http://localhost:53936/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "messages": [
            {"role": "system", "content": "You are a helpful AI assistant."}, 
            {"role": "user", "content": "What are three key benefits of KV cache offloading in LLM inference?"}
        ],
        "temperature": 0.7,
        "max_tokens": 100
    }
    
    start_time = time.time()
    response = requests.post(url, headers=headers, json=data)
    end_time = time.time()
    
    return end_time - start_time

# Test with different numbers of concurrent requests
concurrent_requests = [1, 2, 4, 8]
latencies = []

for n_requests in concurrent_requests:
    print(f"Testing with {n_requests} concurrent requests...")
    with ThreadPoolExecutor(max_workers=n_requests) as executor:
        batch_latencies = list(executor.map(lambda _: send_request(), range(n_requests)))
    
    avg_latency = statistics.mean(batch_latencies)
    latencies.append(avg_latency)
    print(f"Average latency: {avg_latency:.2f} seconds\n")

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(concurrent_requests, latencies, 'o-')
plt.title('Latency vs Concurrent Requests')
plt.xlabel('Number of Concurrent Requests')
plt.ylabel('Average Latency (seconds)')
plt.grid(True)
plt.show()

## 6. Conclusion

We've successfully implemented KV cache offloading with Mistral-7B-Instruct-v0.3. Here are the key takeaways:

```
┌─────────────────────────────────────────────────────────────┐
│                       Key Takeaways                         │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Performance Impact  │      │ Resource Usage          │   │
│  │ - Longer contexts   │      │ - Lower GPU memory      │   │
│  │ - More concurrent   │      │ - Higher CPU usage      │   │
│  │   requests         │      │ - Better efficiency     │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Best Use Cases      │      │ Considerations          │   │
│  │ - Long conversations│      │ - CPU memory needed     │   │
│  │ - Document analysis │      │ - Latency trade-off     │   │
│  │ - Multi-user serving│      │ - Resource balance      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

When to use KV cache offloading:
- You need to handle long conversations or documents
- GPU memory is a bottleneck
- You have sufficient CPU memory available
- You want to serve more concurrent users

Next, you can explore remote shared KV cache (03_remote_shared_kv_cache.ipynb) for even more advanced optimizations.