# Remote Shared KV Cache with vLLM and Mistral-7B-Instruct-v0.3

This notebook demonstrates how to set up remote shared KV cache storage in vLLM, allowing multiple instances to share a common KV cache for improved efficiency and fault tolerance.

## Understanding Remote Shared KV Cache

Remote shared KV cache takes the concept of offloading a step further by allowing multiple vLLM instances to share a common KV cache storage:

```
┌─────────────────────────────────────────────────────────────┐
│                Remote Shared KV Cache                       │
│                                                             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐   │
│  │ vLLM Pod #1 │     │ vLLM Pod #2 │     │ vLLM Pod #3 │   │
│  └─────┬───────┘     └─────┬───────┘     └─────┬───────┘   │
│        │                   │                   │           │
│        └─────────┬─────────┴─────────┬────────┘           │
│                  │                   │                     │
│                  ▼                   ▼                     │
│      ┌───────────────────┐ ┌───────────────────┐          │
│      │   Local KV Cache  │ │  Shared KV Cache  │          │
│      │   (GPU Memory)    │ │  (Remote Storage) │          │
│      └───────────────────┘ └───────────────────┘          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

**Key Benefits:**
- Multiple model instances can share cached KV pairs
- Fault tolerance: if one pod fails, others can still access the cache
- Horizontal scaling: add more vLLM pods without duplicating cached data
- Consistent performance across pods

## Recommended GPU VM Configuration

For remote shared KV cache with Mistral-7B-Instruct-v0.3, we recommend:

- **GPU**: 2x NVIDIA A10G or better (24GB+ VRAM each)
- **CPU**: 10+ cores
- **RAM**: 40GB+
- **Storage**: 100GB+ SSD

Let's get started!

## 1. Setting Up Remote Shared KV Cache

First, let's create a configuration file that enables remote shared KV cache:

In [None]:
%%writefile shared-kv-config.yaml
servingEngineSpec:
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.3"  # HuggingFace model ID
    replicaCount: 2                      # Number of replicas to deploy
    requestCPU: 10                       # CPU cores requested
    requestMemory: "40Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: true          # Enable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
      remoteSharedCache:                 # Remote shared cache configuration
        enabled: true                    # Enable remote shared cache
        endpoint: "redis:6379"           # Redis endpoint
        maxCacheSize: "50Gi"             # Maximum shared cache size
    
    hf_token: ""                         # HuggingFace token (if needed)

### Understanding the Configuration

Let's break down the key configuration parameters:

```
┌─────────────────────────────────────────────────────────────┐
│                Configuration Parameters                     │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Resource Config     │      │ Model Config            │   │
│  │ - 10 CPU cores     │      │ - Mistral-7B-Instruct   │   │
│  │ - 40GB RAM         │      │ - 16K context length     │   │
│  │ - 1 GPU per pod    │      │ - Prefix caching on     │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ LMCache Config      │      │ Shared Cache Config     │   │
│  │ - Enabled          │      │ - Redis backend         │   │
│  │ - 20GB CPU buffer  │      │ - 50GB max size        │   │
│  │ - Auto-offloading  │      │ - Shared across pods   │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

Key differences from KV cache offloading:
- Multiple replicas (2 pods)
- Remote shared cache enabled
- Redis backend for shared storage
- Prefix caching enabled for better performance

## 2. Setting Up Redis for Shared Cache

First, let's deploy Redis for our shared cache storage:

In [None]:
%%writefile redis-config.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:latest
        ports:
        - containerPort: 6379
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
---
apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  selector:
    app: redis
  ports:
  - port: 6379
    targetPort: 6379

In [None]:
!sudo microk8s kubectl apply -f redis-config.yaml

## 3. Deploying with Remote Shared KV Cache

First, let's uninstall our previous deployment:

In [None]:
!sudo microk8s helm uninstall vllm

Now let's deploy with remote shared KV cache enabled:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f shared-kv-config.yaml

Let's check the status of our deployment:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to see if remote shared KV cache is working:

In [None]:
# Get the pod names for the vLLM deployment
!POD_NAMES=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
for pod in $POD_NAMES; do \
    echo "\nLogs for $pod:" && \
    sudo microk8s kubectl logs $pod --tail=20; \
done

## 4. Testing Load Distribution

Let's test how our setup handles requests across multiple pods:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 53936:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
import requests
import time
import json
import random
import threading
from concurrent.futures import ThreadPoolExecutor

# List of prompts to test with
prompts = [
    "What are the key benefits of remote shared KV cache?",
    "How does vLLM improve inference performance?",
    "Explain the concept of KV cache offloading.",
    "What are the best practices for LLM deployment?"
]

def send_request(prompt):
    url = "http://localhost:53936/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "messages": [
            {"role": "system", "content": "You are a helpful AI assistant."}, 
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 100
    }
    
    start_time = time.time()
    response = requests.post(url, headers=headers, json=data)
    end_time = time.time()
    
    return {
        'latency': end_time - start_time,
        'status': response.status_code,
        'prompt': prompt
    }

# Test with multiple concurrent requests
n_requests = 8
print(f"Sending {n_requests} concurrent requests...\n")

with ThreadPoolExecutor(max_workers=n_requests) as executor:
    # Generate random prompts for testing
    test_prompts = [random.choice(prompts) for _ in range(n_requests)]
    results = list(executor.map(send_request, test_prompts))

# Analyze results
total_latency = sum(r['latency'] for r in results)
avg_latency = total_latency / len(results)
success_rate = sum(1 for r in results if r['status'] == 200) / len(results) * 100

print(f"Average latency: {avg_latency:.2f} seconds")
print(f"Success rate: {success_rate:.1f}%")
print(f"Total throughput: {n_requests / total_latency:.2f} requests/second")

## 5. Testing Cache Sharing

Let's test if the KV cache is being shared effectively by sending similar prompts to different pods:

In [None]:
# Base prompt that will be slightly modified for each request
base_prompt = """The development of artificial intelligence has been one of the most significant technological advances in recent history. 
It has transformed industries, enhanced scientific research, and changed how we interact with technology. 
From machine learning algorithms that power recommendation systems to natural language processing models that enable human-like conversations, 
AI continues to push the boundaries of what's possible."""

# List of questions to append to the base prompt
questions = [
    "What are the key ethical considerations in AI development?",
    "How can we ensure responsible AI development?",
    "What are the main challenges in AI governance?",
    "How can we address AI bias and fairness?"
]

# Send requests with similar prompts
results = []
for question in questions:
    prompt = f"{base_prompt}\n\n{question}"
    result = send_request(prompt)
    results.append(result)
    print(f"Question: {question}")
    print(f"Latency: {result['latency']:.2f} seconds\n")

## 6. Conclusion

We've successfully implemented remote shared KV cache with Mistral-7B-Instruct-v0.3. Here are the key takeaways:

```
┌─────────────────────────────────────────────────────────────┐
│                       Key Takeaways                         │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Performance Impact  │      │ Resource Usage          │   │
│  │ - Shared caching   │      │ - Distributed load      │   │
│  │ - Better scaling   │      │ - Efficient storage     │   │
│  │ - Load balancing   │      │ - Resource sharing      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Best Use Cases      │      │ Considerations          │   │
│  │ - High availability │      │ - Network latency       │   │
│  │ - Multi-pod deploy  │      │ - Cache consistency     │   │
│  │ - Fault tolerance   │      │ - Storage requirements  │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

When to use remote shared KV cache:
- You need high availability and fault tolerance
- You're running multiple vLLM instances
- You want to optimize resource usage across pods
- You need horizontal scaling capabilities

Next, you can explore performance benchmarking (04_performance_benchmarking.ipynb) to measure and compare different configurations.