# Remote Shared KV Cache with vLLM Production Stack

This notebook demonstrates how to set up remote shared KV cache storage in the vLLM production stack, allowing multiple vLLM instances to share a common KV cache for improved efficiency and fault tolerance.

## Understanding Remote Shared KV Cache

Remote shared KV cache takes the concept of offloading a step further by allowing multiple vLLM instances to share a common KV cache storage. This approach has several advantages:

```
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  vLLM Pod #1  │     │  vLLM Pod #2  │     │  vLLM Pod #3  │
└───────┬───────┘     └───────┬───────┘     └───────┬───────┘
        │                     │                     │
        └─────────────┬───────┴─────────────┬───────┘
                      │                     │
                      ▼                     ▼
        ┌───────────────────────┐ ┌───────────────────────┐
        │   Local KV Cache      │ │   Shared KV Cache     │
        │   (GPU Memory)        │ │   (Remote Storage)    │
        └───────────────────────┘ └───────────────────────┘
```

**Key Benefits:**
- **Resource Efficiency**: Multiple model instances can share cached KV pairs
- **Fault Tolerance**: If one pod fails, others can still access the shared cache
- **Horizontal Scaling**: Add more vLLM pods without duplicating cached data
- **Consistent Performance**: Provides more predictable performance across pods

## Recommended GPU VM Configuration

For this tutorial, we recommend:

- **GPU**: 2x NVIDIA A10G or better (24GB+ VRAM each)
- **CPU**: 16+ cores
- **RAM**: 64GB+
- **Storage**: 100GB+ SSD

Let's get started!

## Prerequisites

Before proceeding, make sure you have completed the setup in the first notebook (`01_setup_vllm_production_stack.ipynb`). Let's verify that our Kubernetes environment is running:

In [None]:
!sudo microk8s status

## 1. How Remote Shared KV Cache Works

Remote shared KV cache works by setting up a dedicated cache server that multiple vLLM instances can connect to. When a vLLM instance generates tokens, it stores the KV cache entries in the shared cache server, and other instances can retrieve and reuse these entries.

This is particularly useful for:
- **Stateless scaling**: Instances can be added or removed without losing cached data
- **Load balancing**: Requests can be distributed across instances while maintaining cache benefits
- **High availability**: If one instance fails, others can continue serving requests using the shared cache

In the vLLM production stack, this is implemented using LMCache's remote storage capabilities.

## 2. Configuring Remote Shared KV Cache

Let's create a configuration file for remote shared KV cache. We'll deploy two vLLM instances that share a common KV cache server:

In [None]:
%%writefile remote-shared-storage-config.yaml
servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "mistral"                      # Name for the deployment
    repository: "lmcache/vllm-openai"    # Docker image with LMCache support
    tag: "latest"                        # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.3"  # HuggingFace model ID
    replicaCount: 2                      # Deploy 2 replicas to demonstrate sharing
    requestCPU: 10                       # CPU cores requested
    requestMemory: "40Gi"                # Memory requested
    requestGPU: 1                        # Number of GPUs requested
    pvcStorage: "50Gi"                   # Persistent volume size
    vllmConfig:                          # vLLM-specific configuration
      enableChunkedPrefill: false        # Disable chunked prefill
      enablePrefixCaching: false         # Disable prefix caching
      maxModelLen: 16384                 # Maximum sequence length
    
    lmcacheConfig:                       # LMCache configuration
      enabled: true                      # Enable LMCache
      cpuOffloadingBufferSize: "20"      # 20GB of CPU memory for KV cache
    
    hf_token: ""                         # HuggingFace token (if needed)

# This section configures the shared cache server
cacheserverSpec:
  replicaCount: 1                        # Number of cache server replicas
  containerPort: 8080                    # Container port
  servicePort: 81                        # Service port
  serde: "naive"                         # Serialization/deserialization method
  repository: "lmcache/vllm-openai"      # Docker image
  tag: "latest"                          # Image tag
  resources:                             # Resource requests and limits
    requests:
      cpu: "4"                           # CPU cores requested
      memory: "8G"                       # Memory requested
    limits:
      cpu: "4"                           # CPU cores limit
      memory: "10G"                      # Memory limit
  labels:                                # Kubernetes labels
    environment: "cacheserver"           # Environment label
    release: "cacheserver"               # Release label

### Key Configuration Parameters Explained

- **replicaCount: 2**: We're deploying two vLLM instances to demonstrate shared caching
- **cacheserverSpec**: This section configures the dedicated cache server
  - **replicaCount: 1**: We're deploying one cache server instance
  - **serde: "naive"**: The serialization method for cache entries
  - **resources**: CPU and memory resources for the cache server

Now let's deploy the vLLM stack with remote shared KV cache:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f remote-shared-storage-config.yaml

Let's check the status of our deployment. This might take a few minutes as the model is downloaded and loaded:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to verify that LMCache with remote shared storage is active:

In [None]:
# Get the pod name for the first vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}' | head -1) && \
sudo microk8s kubectl logs $POD_NAME | grep -i lmcache

Let's also check the logs of the cache server:

In [None]:
# Get the pod name for the cache server
!POD_NAME=$(sudo microk8s kubectl get pods | grep cacheserver | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME | head -20

## 3. Testing Remote Shared KV Cache

Let's test our deployment with remote shared KV cache by sending requests to both vLLM instances:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 30080:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the API by listing available models
!curl -o- http://localhost:30080/v1/models

Now let's create a script to send multiple requests and observe the shared cache behavior:

In [None]:
%%writefile test_shared_cache.py
import requests
import time
import json
import random
import threading

def send_request(prompt, session_id, max_tokens=100):
    """Send a completion request to the vLLM API"""
    url = "http://localhost:30080/v1/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": prompt,
        "max_tokens": max_tokens,
        # Use a consistent session ID to simulate the same user across different pods
        "user": f"user-{session_id}"
    }
    
    start_time = time.time()
    response = requests.post(url, headers=headers, json=data)
    end_time = time.time()
    
    return {
        "status_code": response.status_code,
        "response": response.json() if response.status_code == 200 else None,
        "time": end_time - start_time
    }

def run_session(session_id, prompt, num_requests=3):
    """Run a series of requests with the same session ID"""
    print(f"Starting session {session_id}")
    times = []
    
    for i in range(num_requests):
        print(f"Session {session_id}, Request {i+1}/{num_requests}")
        result = send_request(prompt, session_id)
        if result["status_code"] == 200:
            times.append(result["time"])
            print(f"Session {session_id}, Request {i+1} completed in {result['time']:.2f} seconds")
            # Add a small delay to simulate real-world usage
            time.sleep(random.uniform(0.5, 1.5))
        else:
            print(f"Session {session_id}, Request {i+1} failed with status code {result['status_code']}")
    
    if times:
        print(f"\nSession {session_id} Results:")
        print(f"Average time: {sum(times)/len(times):.2f} seconds")
        print(f"First request: {times[0]:.2f} seconds")
        print(f"Last request: {times[-1]:.2f} seconds")
        print(f"Improvement: {(times[0] - times[-1])/times[0]*100:.2f}%")
    else:
        print(f"No successful requests for session {session_id}.")

if __name__ == "__main__":
    # Create a prompt that will benefit from KV cache
    prompt = """Explain the concept of distributed systems and how they relate to cloud computing. 
    Include information about consistency models, fault tolerance, and scalability."""
    
    # Run multiple sessions in parallel to simulate different users
    threads = []
    for i in range(3):
        thread = threading.Thread(target=run_session, args=(i, prompt))
        threads.append(thread)
        thread.start()
    
    # Wait for all sessions to complete
    for thread in threads:
        thread.join()

Now let's run the test script:

In [None]:
!python test_shared_cache.py

Let's check the logs of both vLLM instances and the cache server to see the shared cache in action:

In [None]:
# Get the pod names for the vLLM deployments
!POD_NAMES=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
for POD in $POD_NAMES; do \
    echo "\nLogs for $POD:" && \
    sudo microk8s kubectl logs $POD | grep -i "cache\|hit\|miss" | tail -10; \
done

In [None]:
# Check the cache server logs
!POD_NAME=$(sudo microk8s kubectl get pods | grep cacheserver | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME | grep -i "cache\|hit\|miss" | tail -20

## 4. Demonstrating Fault Tolerance

One of the key benefits of remote shared KV cache is fault tolerance. Let's demonstrate this by deleting one of the vLLM pods and observing how the system continues to function:

In [None]:
# Get the pod names for the vLLM deployments
!POD_NAMES=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
POD_TO_DELETE=$(echo $POD_NAMES | awk '{print $1}') && \
echo "Deleting pod $POD_TO_DELETE" && \
sudo microk8s kubectl delete pod $POD_TO_DELETE

Let's wait for Kubernetes to create a new pod to replace the deleted one:

In [None]:
# Wait for the new pod to be created
import time
print("Waiting for new pod to be created...")
for i in range(10):
    !sudo microk8s kubectl get pods | grep vllm-mistral-deployment
    time.sleep(5)

Now let's run our test script again to see if the shared cache is still working:

In [None]:
!python test_shared_cache.py

## 5. Cleanup

When you're done, you can clean up the deployment:

In [None]:
!sudo microk8s helm uninstall vllm

## Conclusion

In this notebook, we've successfully:

1. Configured remote shared KV cache using LMCache
2. Deployed multiple vLLM instances that share a common KV cache server
3. Tested the deployment with parallel requests
4. Demonstrated the fault tolerance of the system

### Key Takeaways

- Remote shared KV cache allows multiple vLLM instances to share cached data
- This improves resource efficiency and enables horizontal scaling
- The system is more fault-tolerant, as cached data persists even if individual pods fail
- Performance is more consistent across instances

### When to Use Remote Shared KV Cache

Consider implementing remote shared KV cache when:
- You need to deploy multiple vLLM instances for high availability
- You want to scale horizontally while maintaining efficient resource usage
- You need fault tolerance for your LLM deployment
- You have many users sending similar queries

Remote shared KV cache is a powerful technique for building scalable, efficient, and fault-tolerant LLM deployments.