# Distributed Inference with Dynamo

This interactive notebook guides you through deploying distributed inference with Dynamo on Kubernetes.

## Prerequisites

Before starting, ensure you have:
- ✅ Kubernetes cluster with GPU support
- ✅ `kubectl` and `helm` 3.x installed
- ✅ HuggingFace token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

---


## Part 1: Single-Node-Sized Models with Aggregated Serving

Deploy multiple replicas of a model with KV cache-based routing for load balancing.

### Configuration

Set your configuration variables:


In [None]:
# Set your configuration
export RELEASE_VERSION=0.5.0
export NAMESPACE=your-namespace-here  # Replace with your namespace
export HF_TOKEN=your_huggingface_token  # Replace with your HuggingFace token
export CACHE_PATH=/data/huggingface-cache  # Replace with your cache path

echo "✓ Configuration set:"
echo "  Release Version: $RELEASE_VERSION"
echo "  Namespace: $NAMESPACE"
echo "  Cache Path: $CACHE_PATH"


### Step 1: Install Dynamo CRDs

**Note:** CRDs are cluster-wide resources and only need to be installed **once per cluster**. If already installed, skip to Step 2.


In [None]:
# Check if CRDs already exist
if kubectl get crd dynamographdeployments.nvidia.com &>/dev/null && \
   kubectl get crd dynamocomponentdeployments.nvidia.com &>/dev/null; then
    echo "✓ CRDs already installed, skipping to Step 2"
else
    echo "Installing Dynamo CRDs..."
    helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
    helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
    
    echo ""
    echo "Verifying CRD installation:"
    kubectl get crd | grep nvidia.com
fi


### Step 2: Install Dynamo Platform

This installs ETCD, NATS, and the Dynamo Operator Controller in your namespace.


In [None]:
# Create namespace
kubectl create namespace ${NAMESPACE} 2>/dev/null || echo "Namespace ${NAMESPACE} already exists"

# Download platform chart
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz

# Install or upgrade
if helm list -n ${NAMESPACE} | grep -q dynamo-platform; then
    echo "Upgrading Dynamo platform..."
    helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
else
    echo "Installing Dynamo platform..."
    helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
fi

echo ""
echo "Platform installation initiated. Checking status..."
kubectl get pods -n ${NAMESPACE}


### Step 3: Configure and Deploy Model

**⚠️ IMPORTANT:** Before deploying, we need to update the YAML configuration files with your specific values.


In [None]:
# Update agg_router.yaml with your configuration
cd examples/basics/kubernetes/Distributed_Inference

# Replace my-tag with actual version
sed -i '' "s/my-tag/${RELEASE_VERSION}/g" agg_router.yaml

# Replace cache path
sed -i '' "s|/YOUR/LOCAL/CACHE/FOLDER|${CACHE_PATH}|g" agg_router.yaml

echo "✓ Configuration updated in agg_router.yaml"
echo ""
echo "Verify image tags (should show version, not my-tag):"
grep "image:" agg_router.yaml


Create HuggingFace secret and deploy:


In [None]:
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
    --from-literal=HF_TOKEN=${HF_TOKEN} \
    --namespace ${NAMESPACE} 2>/dev/null || echo "Secret already exists"

# Deploy the model
kubectl apply -f agg_router.yaml --namespace ${NAMESPACE}

echo ""
echo "✓ Deployment created. This will take 4-6 minutes for first run."
echo "  - Pulling container images"
echo "  - Downloading model from HuggingFace"
echo "  - Loading model and running torch.compile"


Monitor deployment progress:


In [None]:
# Check deployment status
kubectl get dynamographdeployment -n ${NAMESPACE}

echo ""
echo "Pod status (wait for all pods to be 1/1 Ready):"
kubectl get pods -n ${NAMESPACE} | grep vllm

# To watch in real-time, uncomment the line below:
# kubectl get pods -n ${NAMESPACE} -w


### Step 4: Test the Deployment

Once all pods are `1/1 Ready`, forward the service port (run this in a separate terminal or background):


In [None]:
# Forward the service port (run in background with &)
kubectl port-forward deployment/vllm-agg-router-frontend 8000:8000 -n ${NAMESPACE} &

echo "✓ Port forward started on localhost:8000"
echo "  (To stop: use 'pkill -f port-forward' or press Ctrl+C in the terminal running it)"
sleep 5  # Give it time to start


#### Test 1: Simple Non-Streaming Request


In [None]:
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "stream": false,
    "max_tokens": 50
  }'


#### Test 2: Streaming Request


In [None]:
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Write a short poem about AI"}],
    "stream": true,
    "max_tokens": 100
  }'


---

## Part 2: Deploy with AIConfigurator

AIConfigurator helps find optimal configurations for disaggregated serving by analyzing your model and hardware.

### Step 1: Install AIConfigurator


In [None]:
pip3 install aiconfigurator


### Step 2: Run Configuration Analysis

Example: Find optimal configuration for Llama 3.1-70B on 16 H200 GPUs


In [None]:
aiconfigurator cli default --model LLAMA3.1_70B --total_gpus 16 --system h200_sxm


### Step 3: Deploy with Recommended Settings

Based on AIConfigurator output, update and deploy `disagg_router.yaml`:


In [None]:
# Update disagg_router.yaml
sed -i '' "s/my-tag/${RELEASE_VERSION}/g" disagg_router.yaml
sed -i '' "s|/YOUR/LOCAL/CACHE/FOLDER|${CACHE_PATH}|g" disagg_router.yaml

echo "✓ Configuration updated"
grep "image:" disagg_router.yaml

# Deploy
kubectl apply -f disagg_router.yaml --namespace ${NAMESPACE}


---

## Troubleshooting

### Check if pods are stuck in ImagePullBackOff


In [None]:
# Check for image pull errors
POD=$(kubectl get pods -n ${NAMESPACE} | grep vllm | grep -v Running | head -1 | awk '{print $1}')

if [ -n "$POD" ]; then
    echo "Checking pod: $POD"
    kubectl describe pod $POD -n ${NAMESPACE} | grep -A 5 "Failed"
else
    echo "✓ All pods are running successfully"
fi


### View logs from a worker pod


In [None]:
# Get logs from first worker pod
WORKER_POD=$(kubectl get pods -n ${NAMESPACE} | grep vllmdecodeworker | head -1 | awk '{print $1}')

if [ -n "$WORKER_POD" ]; then
    echo "Viewing logs from: $WORKER_POD"
    echo "Look for:"
    echo "  - 'Loading model weights...' (downloading)"
    echo "  - 'Model loading took X.XX GiB' (loaded)"
    echo "  - 'torch.compile takes X.X s' (ready)"
    echo ""
    kubectl logs $WORKER_POD -n ${NAMESPACE} --tail=50
else
    echo "No worker pods found yet"
fi


---

## Cleanup

To remove the deployment when done:


In [None]:
# Delete deployment
kubectl delete dynamographdeployment vllm-agg-router -n ${NAMESPACE}
kubectl delete secret hf-token-secret -n ${NAMESPACE}

# (Optional) Uninstall platform
# helm uninstall dynamo-platform -n ${NAMESPACE}

# (Optional) Delete namespace
# kubectl delete namespace ${NAMESPACE}

echo "✓ Cleanup complete"


---

## Additional Resources

- 📖 [Dynamo Documentation](https://docs.dynamo.nvidia.com)
- 🔧 [AIPerf Benchmarking Tool](https://github.com/ai-dynamo/aiperf)
- 📦 [NGC Container Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- 🎯 [vLLM Backend Guide](../../../components/backends/vllm/deploy/README.md)

---

**Congratulations! 🎉** You've successfully deployed Dynamo distributed inference on Kubernetes!
