# Setting Up vLLM Production Stack with Mistral-7B-Instruct-v0.3

This notebook guides you through setting up a vLLM production stack with Mistral-7B-Instruct-v0.3. We'll cover:

1. Setting up a Kubernetes environment with MicroK8s
2. Installing the vLLM production stack using Helm
3. Deploying and testing Mistral-7B-Instruct-v0.3

## What is vLLM?

vLLM is a high-performance library for LLM inference and serving. It's designed to maximize throughput and minimize latency for LLM applications.

```
┌─────────────────────────────────────────────────────────────┐
│                        vLLM Architecture                    │
│                                                             │
│  ┌─────────────┐       ┌─────────────┐      ┌──────────┐   │
│  │ API Server  │◄─────►│ Scheduler   │◄────►│ Worker 1 │   │
│  └─────────────┘       └─────────────┘      └──────────┘   │
│         ▲                     ▲                  ▲         │
│         │                     │                  │         │
│         ▼                     ▼                  ▼         │
│  ┌─────────────┐       ┌─────────────┐      ┌──────────┐   │
│  │ Client      │       │ PagedAttn   │      │ Worker 2 │   │
│  │ Applications│       │ Memory Mgmt │      └──────────┘   │
│  └─────────────┘       └─────────────┘                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

Key features include:
- PagedAttention for efficient memory management
- Continuous batching to handle concurrent requests
- Optimized CUDA kernels for faster execution
- OpenAI-compatible API for easy integration

## Recommended GPU VM Configuration

For running Mistral-7B-Instruct-v0.3, we recommend:

- **GPU**: NVIDIA A10G or better (24GB+ VRAM)
- **CPU**: 8+ cores
- **RAM**: 32GB+ (64GB recommended)
- **Storage**: 100GB+ SSD

Let's get started!

## 1. Environment Setup and Prerequisites

### What is Kubernetes and why use it for LLM deployment?

Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. For LLM deployments, Kubernetes offers several advantages:

```
┌─────────────────────────────────────────────────────────────┐
│                 Kubernetes Benefits for LLMs                │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ Scalability        │      │ Resource Management      │   │
│  │ - Auto-scaling     │      │ - GPU allocation        │   │
│  │ - Load balancing   │      │ - Memory limits         │   │
│  │ - Rolling updates  │      │ - Resource quotas       │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
│  ┌─────────────────────┐      ┌─────────────────────────┐   │
│  │ High Availability  │      │ Infrastructure as Code   │   │
│  │ - Pod replication  │      │ - Declarative configs   │   │
│  │ - Self-healing     │      │ - Version control       │   │
│  │ - Node failover    │      │ - Easy replication      │   │
│  └─────────────────────┘      └─────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

First, let's check if we have GPU support available:

In [None]:
!nvidia-smi

### 1.1 Install Required Tools

Let's create a script to install MicroK8s and set up our Kubernetes environment:

In [None]:
%%writefile setup_microk8s.sh
#!/bin/bash
# Function to check command status
check_status() {
    if [ $? -ne 0 ]; then
        echo "Error: $1 failed"
        exit 1
    fi
}

# Function to check if command exists
command_exists() {
    command -v "$1" >/dev/null 2>&1
}

# Check if nvidia-smi is available
echo "Checking NVIDIA GPU..."
if ! command_exists nvidia-smi; then
    echo "Warning: nvidia-smi not found. GPU support may not be available."
else
    nvidia-smi
    check_status "nvidia-smi"
fi

# Install MicroK8s
echo "Installing MicroK8s..."
sudo snap install microk8s --classic --channel=1.25/stable
check_status "MicroK8s installation"

# Add user to microk8s group
echo "Adding user to microk8s group..."
sudo usermod -a -G microk8s $USER
check_status "Adding user to microk8s group"

# Create and set permissions for .kube directory
echo "Setting up .kube directory..."
mkdir -p ~/.kube
chmod 0700 ~/.kube
sudo chown -f -R $USER ~/.kube
check_status "Setting up .kube directory"

# Wait for MicroK8s to be ready
echo "Waiting for MicroK8s to be ready..."
sudo microk8s status --wait-ready
check_status "MicroK8s ready check"

# Enable GPU and storage support
echo "Enabling GPU and hostpath-storage..."
sudo microk8s enable gpu hostpath-storage
check_status "Enabling MicroK8s addons"

# Double check status
echo "Checking final MicroK8s status..."
sudo microk8s status --wait-ready
check_status "Final MicroK8s status check"

# Set up Helm repositories
echo "Setting up Helm repositories..."
sudo microk8s helm repo remove nvidia || true  # Remove if exists
sudo microk8s helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
sudo microk8s helm repo update

# Activate the new group membership without requiring logout
echo "Activating microk8s group membership..."
if ! groups | grep -q microk8s; then
    exec sg microk8s -c '
        echo "Testing cluster access..."
        sudo microk8s kubectl get services
        sudo microk8s kubectl get nodes
        echo "Creating example nginx deployment..."
        sudo microk8s kubectl create deployment nginx --image=nginx
        echo "Checking pods..."
        sudo microk8s kubectl get pods
        echo "The kubectl and helm aliases are now active globally."
    '
else
    echo "Testing cluster access..."
    sudo microk8s kubectl get services
    sudo microk8s kubectl get nodes
    echo "Creating example nginx deployment..."
    sudo microk8s kubectl create deployment nginx --image=nginx
    echo "Checking pods..."
    sudo microk8s kubectl get pods
fi

echo "Setup of MicroK8s is complete!"

Now let's make the script executable and run it:

In [None]:
!chmod +x setup_microk8s.sh
!./setup_microk8s.sh

## 2. Setting Up vLLM with Helm

Now that we have our Kubernetes environment ready, let's set up vLLM using Helm. First, we'll add the vLLM Helm repository:

In [None]:
# Add the vLLM Helm repository
!sudo microk8s helm repo add vllm https://vllm-project.github.io/production-stack
!sudo microk8s helm repo update

## 3. Deploying Mistral-7B-Instruct-v0.3

Now let's create a configuration file for deploying the Mistral-7B-Instruct-v0.3 model:

In [None]:
%%writefile mistral-config.yaml
servingEngineSpec:
  runtimeClassName: ""                  # Runtime class name (leave empty for default)
  modelSpec:
  - name: "mistral"                     # Name for the deployment
    repository: "vllm/vllm-openai"      # Docker image for vLLM
    tag: "latest"                       # Image tag
    modelURL: "mistralai/Mistral-7B-Instruct-v0.3"  # Mistral model
    hf_token: ""                        # Your HuggingFace token (if needed)
    replicaCount: 1                     # Single replica
    requestCPU: 8                       # CPU cores requested
    requestMemory: "32Gi"               # Memory requested
    requestGPU: 1                       # Number of GPUs requested
    pvcStorage: "50Gi"                  # Persistent volume size
    vllmConfig:                         # vLLM-specific configuration
      enableChunkedPrefill: false       # Disable chunked prefill
      enablePrefixCaching: true         # Enable prefix caching
      maxModelLen: 8192                 # Maximum sequence length

Now let's deploy the Mistral model:

In [None]:
!sudo microk8s helm install vllm vllm/vllm-stack -f mistral-config.yaml

Let's check the status of our deployment. This might take a few minutes as the model is downloaded and loaded:

In [None]:
!sudo microk8s kubectl get pods

Let's check the logs to see the progress of the model loading:

In [None]:
# Get the pod name for the vLLM deployment
!POD_NAME=$(sudo microk8s kubectl get pods | grep vllm-mistral-deployment | awk '{print $1}') && \
sudo microk8s kubectl logs $POD_NAME --tail=50

## 4. Testing the Deployment

Once the pod is running, let's test our deployment by forwarding the service port and sending a request:

In [None]:
# This will run in the background
!sudo microk8s kubectl port-forward svc/vllm-router-service 53936:80 > port_forward.log 2>&1 &

In [None]:
# Wait a moment for the port forwarding to establish
import time
time.sleep(5)

In [None]:
# Test the API by listing available models
!curl -o- http://localhost:53936/v1/models

In [None]:
# Test the completion endpoint
!curl -X POST http://localhost:53936/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": "Write a short poem about artificial intelligence.",
    "max_tokens": 150,
    "temperature": 0.7
  }'

## 5. Testing the Chat Endpoint

Let's also test the chat endpoint, which is more appropriate for instruction-tuned models like Mistral-7B-Instruct-v0.3:

In [None]:
!curl -X POST http://localhost:53936/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "Explain how vLLM improves LLM inference performance in 3 bullet points."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }'

## 6. Next Steps

Now that you have a basic vLLM deployment working with Mistral-7B-Instruct-v0.3, you can explore advanced optimization techniques:

1. **KV Cache Offloading** (02_kv_cache_offloading.ipynb)
   - Offload key-value cache to CPU memory
   - Handle longer sequences
   - Optimize GPU memory usage

2. **Remote Shared KV Cache** (03_remote_shared_kv_cache.ipynb)
   - Share KV cache across multiple instances
   - Improve fault tolerance
   - Enable horizontal scaling

3. **Performance Benchmarking** (04_performance_benchmarking.ipynb)
   - Measure throughput and latency
   - Compare different configurations
   - Optimize for your use case

These techniques will help you get the most out of your GPU resources and optimize your LLM deployment for production use.