# Lab 3.2: Wide EP Production Deployment

## Overview

In this lab, you will deploy Expert Parallelism concepts from Lab 3.1 into production environments.

**Prerequisites**: Complete Lab 3.1 (Expert Parallelism Foundations)

**Duration**: 60-75 minutes

### What You'll Learn
- Deploy Dynamo with Kubernetes and the Dynamo Operator
- Configure multi-node SGLang deployments with Wide EP and EPLB
- Set up TensorRT-LLM for optimized MoE inference
- Monitor, troubleshoot, and tune production deployments
- Benchmark and optimize performance

---

## Table of Contents

**Getting Started**
- [Quick Recap: Lab 3.1 Concepts](#Quick-Recap:-Lab-3.1-Concepts)

**Section 1: Concepts**
- [Wide EP Deep Dive & MoE Deployment](#Section-1:-Wide-EP-Deep-Dive-&-MoE-Deployment)

**Section 2: Infrastructure**
- [Kubernetes Deployment with Dynamo Operator](#Section-2:-Kubernetes-Deployment-with-Dynamo-Operator)

**Section 3: SGLang Backend**
- [Deploying MoE Models with SGLang](#Section-3:-Deploying-MoE-Models-with-SGLang-and-Expert-Parallelism)
  - [Configuration Files Overview](#Configuration-Files-Overview)
  - [Multi-Node WideEP Deployment](#Example-2:-Multi-Node-WideEP-Deployment-with-DeepSeek-R1)
  - [Monitoring Expert Parallelism and EPLB](#Monitoring-Expert-Parallelism-and-EPLB)

**Section 4: TensorRT-LLM Backend**
- [TensorRT-LLM Wide EP Implementation](#Section-4:-TensorRT-LLM-Wide-EP-Implementation)
  - [Architecture Comparison](#Architecture:-TensorRT-LLM-vs-SGLang)
  - [Configuration Comparison](#Configuration-Comparison)
  - [Hands-On Deployment](#Hands-On:-Deploying-DeepSeek-R1-with-TensorRT-LLM-Wide-EP)
  - [Performance Comparison](#Performance-Comparison:-TensorRT-LLM-vs-SGLang)
  - [When to Use Each Backend](#When-to-Use-Each)

**Section 5: Performance**
- [Performance Benchmarking](#Section-5:-Performance-Benchmarking-for-EP-Deployments)

**Wrap-Up**
- [Summary](#Summary)

---

## Quick Recap: Lab 3.1 Concepts

In Lab 3.1, you learned the foundations of Expert Parallelism:

**Key Concepts**:
- **MoE Models**: Activate only a subset of experts per token (e.g., top-2 out of 256)
- **Expert Parallelism (EP)**: Distribute experts across GPUs to scale capacity
- **Wide EP**: Horizontal scaling with experts distributed across many nodes
- **EPLB**: Dynamic load balancing to prevent expert hotspots
- **NVL72**: High-bandwidth interconnect enabling 1.8× throughput gains

**EP Variants**:
- **Standard EP**: Static expert placement
- **Wide EP**: Distributed across clusters for throughput
- **Deep EP**: Hierarchical experts for specialization
- **Dynamic EP (EPLB)**: Runtime load balancing

Now let's deploy these concepts in production!

---

## Section 1: Wide EP Deep Dive & MoE Deployment

### What is Wide EP (Elastic Parallelism)?

Wide EP enables **horizontal scaling** of LLM inference by deploying multiple complete model replicas across many nodes. Unlike tensor or pipeline parallelism that split a single model, Wide EP creates independent replicas that can process requests in parallel.

### Important: "Wide EP" is MoE-Specific Terminology

⚠️ **Clarification**: The term "**Wide EP**" specifically refers to **Expert Parallelism** for **MoE models**:
- **EP** = **Expert Parallelism** (distributing experts across GPUs)
- **Wide EP** = Multiple replicas, each using EP internally

For **non-MoE models** (like Llama, GPT), there are **no experts**, so:
- ✅ Use: "**Multi-Replica Deployment**" or "**Wide DP**" (Wide Data Parallelism)
- ❌ Don't use: "Wide EP" (there are no experts to parallelize!)

### Multi-Replica Deployment Patterns

#### 1. **Multi-Replica Deployment** (Non-MoE Models)
For standard dense models without experts:

```
Load Balancer → Replica 1 (1-8 GPUs)  → Complete Llama-2-7B model
              → Replica 2 (1-8 GPUs)  → Complete Llama-2-7B model
              → Replica N (1-8 GPUs)  → Complete Llama-2-7B model
```

**Characteristics**:
- No experts (dense model)
- May use **TP** (Tensor Parallelism) if model is large
- May use **PP** (Pipeline Parallelism) for very deep models
- Horizontal scaling via replication
- **Load Balancer** distributes requests across replicas (infrastructure level)
- Examples: Llama-2-7B, Mistral-7B, GPT-3

#### 2. **Wide EP Deployment** (MoE Models) ⭐
For Mixture-of-Experts models with expert parallelism:

```
Load Balancer → Replica 1 (8 GPUs)  → DeepSeek-R1 with 256 experts via EP
              → Replica 2 (8 GPUs)  → DeepSeek-R1 with 256 experts via EP
              → Replica N (8 GPUs)  → DeepSeek-R1 with 256 experts via EP
                                      ↑
                                      Each replica has a Router (model level)
                                      that selects top-K experts per token
```

**Characteristics**:
- Has experts (MoE architecture)
- Uses **EP** (Expert Parallelism) to distribute experts across GPUs
- Uses **EPLB** for load balancing experts
- Combines horizontal (replicas) + vertical (EP) scaling
- **Load Balancer** distributes requests (infrastructure)
- **Router** selects experts within each replica (model architecture)
- Examples: DeepSeek-R1, Mixtral-8x7B, DeepSeek-V3

### Key Advantages for Production

1. **Horizontal Throughput Scaling**: Add more replicas = more throughput
2. **Fault Tolerance**: Loss of one replica doesn't affect others
3. **Load Balancing**: Load balancer distributes requests across healthy replicas
4. **Independent Operation**: No inter-replica synchronization needed
5. **Geographic Distribution**: Can place replicas in different datacenter zones

Let's deploy Wide EP with Dynamo and SGLang!


## Section 2: Kubernetes Deployment with Dynamo Operator

### Using Pre-Created Kubernetes Manifests

For production deployments, we use the **Dynamo Kubernetes Operator** with `DynamoGraphDeployment` CRDs. 

All necessary Kubernetes manifests are **pre-created** in the `k8s/` directory:

**Available Manifests**:
- `k8s/deepseek-r1-wideep.yaml` - SGLang backend with Wide EP
- `k8s/deepseek-r1-trtllm.yaml` - TensorRT-LLM backend with Wide EP  
- `k8s/README.md` - Detailed deployment guide

These manifests use the Dynamo Operator to manage:
- Multi-node coordination
- Service discovery via etcd
- Automatic health checks
- Resource allocation
- Pod scheduling with anti-affinity

### Quick Deployment Steps

**Step 1: Install Dynamo Platform** (if not already installed):
```bash
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.3.2  # Use latest version
helm install dynamo-platform nvidia/dynamo-platform \
  --namespace ${NAMESPACE} --create-namespace
```

**Step 2: Create Secrets**:
```bash
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN='your_hf_token_here' -n dynamo
```

**Step 3: Deploy DeepSeek-R1 with SGLang**:
```bash
# Deploy using the pre-created manifest
kubectl apply -f k8s/deepseek-r1-wideep.yaml -n dynamo

# Monitor deployment
kubectl get dynamographdeployment -n dynamo
kubectl get pods -n dynamo -w
```

**Step 4: Test the Deployment**:
```bash
# Port forward to access the frontend
kubectl port-forward svc/deepseek-r1-wideep-frontend 8000:8000 -n dynamo

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "Explain MoE models"}],
    "max_tokens": 100
  }'
```

For detailed configuration options and customization, see `k8s/README.md`.

---


## Section 3: Deploying MoE Models with SGLang and Expert Parallelism

Now that you understand how to deploy with Kubernetes, let's dive deeper into hands-on deployment of MoE models with Expert Parallelism using Dynamo's **SGLang backend**.

**In this section, you'll learn:**
- How to configure SGLang for Expert Parallelism
- Single-node vs multi-node deployment strategies
- EPLB configuration and tuning
- Monitoring and troubleshooting EP deployments

### Prerequisites for MoE Deployment

**What you need**:
- Multiple GPUs (minimum 4 GPUs for this example)
- NATS and etcd running (infrastructure from Lab 2)
- Model that fits with EP distribution
- High-bandwidth interconnect (InfiniBand or NVLink preferred)

**Check GPU availability**:


### Configuration Files Overview

All configuration files for this lab are pre-created in the `configs/trtllm/` directory:

**TensorRT-LLM Configurations**:
- `eplb.yaml` - EPLB settings (algorithm, redundant experts, rebalancing)
- `wide_ep_prefill.yaml` - Prefill worker configuration
- `wide_ep_decode.yaml` - Decode worker configuration  
- `wide_ep_agg.yaml` - Aggregated mode (optional)

These files are ready to use and include detailed comments explaining each parameter.

**SGLang Deployments**: 
Command examples are provided directly in the cells below for easy copy-paste execution.

### Example 1: Single-Node MoE with DP Attention (4 GPUs)

This example deploys a MoE model with Expert Parallelism and DP (Data Parallel) attention on a single node with 4 GPUs.

**Configuration**:
- Model: DeepSeek-R1-Distill-Llama-8B (smaller MoE for learning)
- Topology: Disaggregated prefill/decode
- Parallelism: TP=1, DP=4, DP attention enabled
- Mode: Prefill worker

**Why this configuration**:
- DP attention allows parallel processing of attention across multiple requests
- Expert parallelism distributes experts across the 4 GPUs
- Good for learning concepts before scaling to multi-node


In [None]:
# List available configuration files
import os
from pathlib import Path

print("=" * 60)
print("Lab 3 Configuration Files")
print("=" * 60)

configs_dir = Path("configs")

# TensorRT-LLM configs
print("\nTensorRT-LLM Configurations (configs/trtllm/):")
print("-" * 60)
trtllm_dir = configs_dir / "trtllm"
if trtllm_dir.exists():
    for file in sorted(trtllm_dir.glob("*.yaml")):
        size = file.stat().st_size
        print(f"  ✓ {file.name:<30} ({size:>6,} bytes)")
else:
    print("  ⚠️  Directory not found")

# Kubernetes manifests
print("\nKubernetes Manifests (k8s/):")
print("-" * 60)
k8s_dir = Path("k8s")
if k8s_dir.exists():
    for file in sorted(k8s_dir.glob("*.yaml")):
        size = file.stat().st_size
        print(f"  ✓ {file.name:<30} ({size:>6,} bytes)")
    if (k8s_dir / "README.md").exists():
        print(f"  ✓ README.md (deployment guide)")
else:
    print("  ⚠️  Directory not found")

print("=" * 60)


### Example 2: Multi-Node WideEP Deployment with DeepSeek-R1

This is a production-scale deployment of DeepSeek-R1 with WideEP across multiple nodes.

**Architecture**:
```
┌──────────────────────────────────────────────────────────────┐
│                    Load Balancer                     │
└──────────────────────┬───────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        │                             │
┌───────▼─────────────┐    ┌──────────▼────────────┐
│  Prefill Cluster    │    │  Decode Cluster       │
│  (4 nodes)          │    │  (4 nodes)            │
│                     │    │                       │
│  32 GPUs total      │    │  32 GPUs total        │
│  TP=32, DP=32       │───▶│  TP=32, DP=32         │
│  + DP Attention     │    │  + DP Attention       │
│  + EPLB             │    │  + EPLB               │
│  + DeepEP backend   │    │  + DeepEP backend     │
└─────────────────────┘    └───────────────────────┘
```

**Configuration Details**:

**Prefill Workers** (4 nodes × 8 GPUs = 32 GPUs):
- Model: DeepSeek-R1 (671B parameters, 256 experts)
- TP Size: 32 (tensor parallelism across 32 GPUs)
- DP Size: 32 (data parallelism)
- DP Attention: Enabled for efficient attention computation
- Expert Parallelism: Experts distributed across GPUs
- EPLB: 32 redundant experts with dynamic load balancing
- MoE Backend: DeepEP (high-performance all-to-all)
- KV Transfer: NIXL (RDMA-based transfer to decode)

**Decode Workers** (4 nodes × 8 GPUs = 32 GPUs):
- Same model and parallelism configuration
- Optimized for decode phase with CUDA graphs
- Receives KV cache from prefill via NIXL

**Key Parameters**:
```bash
# Common parameters
--model-path deepseek-ai/DeepSeek-R1
--tp-size 32
--dp-size 32
--enable-dp-attention
--trust-remote-code

# Expert parallelism parameters
--ep-num-redundant-experts 32       # Create 32 additional expert copies
--eplb-algorithm deepseek            # Use DeepSeek's EPLB algorithm
--moe-a2a-backend deepep             # Use DeepEP for expert communication
--moe-dense-tp-size 1                # TP size for dense layers
--enable-dp-lm-head                  # Enable DP for LM head

# Memory and performance
--mem-fraction-static 0.85           # Reserve 85% GPU memory
--page-size 1                        # KV cache page size
--disable-radix-cache                # Disable radix cache for disagg
--enable-two-batch-overlap           # Overlap computation
--watchdog-timeout 1000000           # Long timeout for large model
```


In [None]:
# Multi-Node DeepSeek-R1 WideEP Deployment with SGLang
# This example shows the full deployment command for reference
# In production, use the Kubernetes manifests in k8s/

print("""
=================================================================
Multi-Node DeepSeek-R1 Deployment Example
=================================================================

This deployment uses:
- 4 prefill nodes × 8 GPUs = 32 GPUs
- 4 decode nodes × 8 GPUs = 32 GPUs  
- TP=32, DP=32, DP Attention
- DeepEP backend with EPLB
- NIXL for KV transfer

For production, use: kubectl apply -f k8s/deepseek-r1-wideep.yaml

Command reference (for understanding the configuration):
""")

# Prefill worker command (runs on 4 nodes)
prefill_cmd = """python -m dynamo.sglang \
  --model-path /path/to/DeepSeek-R1 \
  --served-model-name deepseek-ai/DeepSeek-R1 \
  --skip-tokenizer-init \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend nixl \
  --host 0.0.0.0 \
  --disaggregation-bootstrap-port 30001 \
  --tp-size 32 \
  --dp-size 32 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --ep-num-redundant-experts 32 \
  --eplb-algorithm deepseek \
  --trust-remote-code"""

print("\nPrefill Worker Command:")
print(prefill_cmd)

print("\n✅ See k8s/deepseek-r1-wideep.yaml for full Kubernetes deployment")


### Monitoring Expert Parallelism and EPLB

When running MoE models with EP and EPLB, monitoring is crucial to ensure optimal performance.

#### Key Metrics to Monitor

**1. Expert Usage Distribution**
```python
# SGLang automatically logs expert usage statistics
# Look for logs like:
# "Expert usage: [0.05, 0.12, 0.03, 0.15, ...]"
# These show the fraction of tokens routed to each expert
```

**2. GPU Utilization per Expert**
```bash
# Use nvidia-smi to check GPU utilization
watch -n 1 nvidia-smi

# For detailed metrics, use DCGM:
dcgmi dmon -e 155,156,203,204 -d 1
# 155 = GPU Utilization
# 156 = Memory Utilization
# 203 = Tensor Core Utilization
# 204 = FP16 Activity
```

**3. EPLB Rebalancing Events**
```python
# Enable verbose logging to see EPLB rebalancing
# Set environment variable: DYNAMO_LOG=debug

# Look for logs like:
# "EPLB: Rebalancing experts after 100 iterations"
# "EPLB: Expert 5 replicated to GPU 2 (high usage: 0.25)"
# "EPLB: Expert 17 removed from GPU 3 (low usage: 0.01)"
```

**4. Network Bandwidth (for Multi-Node)**
```bash
# Monitor InfiniBand bandwidth
ibstat

# Monitor network throughput
iftop -i ib0  # Replace ib0 with your IB interface
```

#### Troubleshooting Common Issues

**Issue 1: Uneven GPU Utilization**
```
Symptoms:
- Some GPUs at 100%, others at <50%
- Throughput lower than expected
- Long token generation times

Solution:
- Enable EPLB: --enable-eplb
- Increase redundant experts: --ep-num-redundant-experts 32
- Adjust rebalancing frequency: --eplb-rebalance-num-iterations 50
```

**Issue 2: High Memory Usage**
```
Symptoms:
- OOM errors
- Cannot create redundant experts

Solution:
- Reduce memory fraction: --mem-fraction-static 0.80 (from 0.85)
- Reduce redundant experts: --ep-num-redundant-experts 16
- Disable features: --disable-radix-cache
```

**Issue 3: Slow Expert All-to-All Communication**
```
Symptoms:
- High latency during expert routing
- Low GPU utilization despite balanced load

Solution:
- Use DeepEP backend: --moe-a2a-backend deepep
- Enable two-batch overlap: --enable-two-batch-overlap
- Check network: Ensure InfiniBand is active and configured
```

**Issue 4: EPLB Not Rebalancing**
```
Symptoms:
- No rebalancing logs
- Expert usage remains imbalanced over time

Solution:
- Enable explicit EPLB: --enable-eplb
- Use appropriate recorder mode: --expert-distribution-recorder-mode stat
- Lower rebalance threshold: --eplb-rebalance-num-iterations 50
```

#### Performance Tuning Tips

**1. Optimize Memory Allocation**
```bash
# Start with conservative memory fraction
--mem-fraction-static 0.80

# Gradually increase if no OOM
--mem-fraction-static 0.85

# Monitor with nvidia-smi
```

**2. Tune Redundant Expert Count**
```bash
# Formula: redundant_experts ≈ num_GPUs / 2 to num_GPUs
# For 32 GPUs: try 16-32 redundant experts

# Start low
--ep-num-redundant-experts 16

# Increase if imbalance persists
--ep-num-redundant-experts 32
```

**3. DeepEP Mode Selection**
```bash
# For prefill (focus on throughput)
--deepep-mode normal

# For decode (focus on latency)
--deepep-mode low_latency
```

**4. Batch Size Tuning**
```bash
# For decode, tune CUDA graph batch size
# Larger = better throughput, more memory
--cuda-graph-bs 128

# If OOM, reduce
--cuda-graph-bs 64
```


## Section 4: TensorRT-LLM Wide EP Implementation

You've learned how to deploy Wide EP with SGLang. Now let's explore **TensorRT-LLM**, NVIDIA's highly optimized inference engine.

NVIDIA's [TensorRT-LLM also supports Wide EP](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) with its own optimizations, offering maximum performance for production deployments.

**In this section, you'll learn:**
- How TensorRT-LLM differs from SGLang
- YAML-based configuration approach
- FP8 quantization and CUDA graph optimizations
- When to choose TensorRT-LLM vs SGLang

### TensorRT-LLM Wide EP Overview

TensorRT-LLM (TRT-LLM) provides a highly optimized inference engine with native Wide EP support for MoE models. It's particularly well-suited for production deployments requiring maximum performance.

**Key Features**:
- Built-in WIDEEP backend for expert parallelism
- Integrated EPLB (Expert Parallelism Load Balancer)
- Highly optimized CUDA kernels
- FP8 quantization support for MoE layers
- CUDA graphs for decode optimization

### Architecture: TensorRT-LLM vs SGLang

**TensorRT-LLM Approach**:
```
┌──────────────────────────────────────────────────────┐
│          TensorRT-LLM Wide EP Architecture           │
└──────────────────────────────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                                   │
   ┌────▼─────┐                       ┌────▼─────┐
   │ Prefill  │                       │  Decode  │
   │ Workers  │ ─────KV Transfer────▶ │ Workers  │
   │          │                       │          │
   │ - TP=16  │                       │ - TP=16  │
   │ - EP=16  │                       │ - EP=16  │
   │ - WIDEEP │                       │ - WIDEEP │
   │ - EPLB   │                       │ - EPLB   │
   │ - FP8 KV │                       │ - FP8 KV │
   └──────────┘                       └──────────┘
```

**SGLang Approach**:
```
┌──────────────────────────────────────────────────────┐
│            SGLang Wide EP Architecture               │
└──────────────────────────────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                                   │
   ┌────▼─────┐                       ┌────▼─────┐
   │ Prefill  │                       │  Decode  │
   │ Workers  │ ──────NIXL/────────▶  │ Workers  │
   │          │    Mooncake           │          │
   │ - TP=32  │                       │ - TP=32  │
   │ - DP=32  │                       │ - DP=32  │
   │ - DeepEP │                       │ - DeepEP │
   │ - EPLB   │                       │ - EPLB   │
   └──────────┘                       └──────────┘
```

### Configuration Comparison

#### TensorRT-LLM Configuration (YAML-based)

**Prefill Worker** (`wide_ep_prefill.yaml`):
```yaml
backend: pytorch

# WideEP settings
moe_config:
  backend: WIDEEP
  load_balancer: /path/to/eplb.yaml

# Parallelism
tensor_parallel_size: 16
moe_expert_parallel_size: 16
pipeline_parallel_size: 1
enable_attention_dp: true

# Batch and sequence settings
max_batch_size: 256
max_num_tokens: 256
max_seq_len: 8448

# KV cache with FP8
kv_cache_config:
  free_gpu_memory_fraction: 0.30
  dtype: fp8

# CUDA graphs
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

**Decode Worker** (`wide_ep_decode.yaml`):
```yaml
backend: pytorch

moe_config:
  backend: WIDEEP
  load_balancer: /path/to/eplb.yaml

tensor_parallel_size: 16
moe_expert_parallel_size: 16
enable_attention_dp: true

max_batch_size: 256
max_seq_len: 8448

kv_cache_config:
  free_gpu_memory_fraction: 0.30
  dtype: fp8

cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

#### SGLang Configuration (Command-line based)

**Prefill Worker**:
```bash
python -m dynamo.sglang \
  --model-path deepseek-ai/DeepSeek-R1 \
  --tp-size 32 \
  --dp-size 32 \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --ep-num-redundant-experts 32 \
  --eplb-algorithm deepseek \
  --disaggregation-mode prefill
```

### Key Differences

| Aspect | TensorRT-LLM | SGLang |
|--------|--------------|--------|
| **Configuration** | YAML files | Command-line arguments |
| **MoE Backend** | WIDEEP (built-in) | DeepEP (external library) |
| **KV Cache** | FP8 quantization native | FP16/BF16 default |
| **CUDA Graphs** | Explicit batch size list | Auto-generated |
| **Optimization** | Highly tuned kernels | Flexible, easier to customize |
| **Deployment** | More structured | More flexible |
| **Memory Management** | `free_gpu_memory_fraction` | `mem_fraction_static` |

### EPLB Configuration (TensorRT-LLM)

TensorRT-LLM uses a separate YAML file for EPLB configuration:

**`eplb.yaml`**:
```yaml
# Expert Parallelism Load Balancer
algorithm: deepseek  # or hierarchical, global
redundant_experts: 32
rebalance_frequency: 100
usage_recorder_mode: stat
buffer_size: 10
```

This modular approach allows easy tuning without restarting workers.

### Performance Considerations

**TensorRT-LLM Advantages**:
- ✅ FP8 KV cache reduces memory by ~50%
- ✅ Highly optimized CUDA kernels
- ✅ Better CUDA graph integration
- ✅ Lower latency for decode phase
- ✅ Built-in profiling and optimization tools

**SGLang Advantages**:
- ✅ More flexible configuration
- ✅ Easier to experiment and customize
- ✅ Better support for diverse models
- ✅ Active development and community
- ✅ Simpler deployment workflow

### When to Use Each

**Use TensorRT-LLM Wide EP when**:
- Maximum performance is critical
- You have NVIDIA GPUs (H100, H200, GB200)
- Production deployment with stable configuration
- Need FP8 quantization
- Willing to invest in optimization

**Use SGLang Wide EP when**:
- Rapid experimentation needed
- Diverse model support required
- Flexible deployment patterns
- Active development workflow
- Community support important


### Hands-On: Deploying DeepSeek-R1 with TensorRT-LLM Wide EP

Here's a complete example of deploying DeepSeek-R1 using TensorRT-LLM's Wide EP backend with Dynamo.

**Prerequisites**:
1. TensorRT-LLM installed (version with Wide EP support)
2. DeepSeek-R1 model converted to TensorRT-LLM engine format
3. Multi-node cluster with H100/H200 GPUs
4. Dynamo with TensorRT-LLM backend support

**Configuration Files Provided**

All necessary configuration files are pre-created in `configs/trtllm/`:

- **`eplb.yaml`** - EPLB configuration (algorithm, redundant experts, rebalancing)
- **`wide_ep_prefill.yaml`** - Prefill worker config (TP=16, EP=16, FP8 KV cache)
- **`wide_ep_decode.yaml`** - Decode worker config (CUDA graphs, FP8 KV cache)
- **`wide_ep_agg.yaml`** - Aggregated mode config (optional, for non-disaggregated deployment)

You can view and customize these files in the `configs/trtllm/` directory.

**Key Configuration Highlights**:

From `eplb.yaml`:
- Algorithm: `deepseek` (or hierarchical, global)
- Redundant experts: `32` (adjust based on GPU count)
- Rebalancing frequency: `100` iterations

From `wide_ep_prefill.yaml` and `wide_ep_decode.yaml`:
- Tensor Parallelism: `16` GPUs
- Expert Parallelism: `16` GPUs
- FP8 KV cache for 50% memory savings
- DP attention enabled
- Max batch size: `256`

**Step 1: Start Infrastructure**
```bash
# Start NATS and etcd
docker compose -f deploy/docker-compose.yml up -d

# Start Dynamo frontend
python -m dynamo.frontend --http-port=8000 &
```

**Step 2: Launch Prefill Workers** (2 nodes × 8 GPUs = 16 GPUs)
```bash
# On Prefill Node 0:
python -m dynamo.trtllm \
  --engine-dir /path/to/deepseek-r1-engine \
  --config-path ./configs/trtllm/wide_ep_prefill.yaml \
  --disaggregation-mode prefill \
  --dist-init-addr ${PREFILL_HEAD_IP}:29500 \
  --nnodes 2 \
  --node-rank 0

# On Prefill Node 1:
# Same command with --node-rank 1
```

**Step 3: Launch Decode Workers** (2 nodes × 8 GPUs = 16 GPUs)
```bash
# On Decode Node 0:
python -m dynamo.trtllm \
  --engine-dir /path/to/deepseek-r1-engine \
  --config-path ./configs/trtllm/wide_ep_decode.yaml \
  --disaggregation-mode decode \
  --dist-init-addr ${DECODE_HEAD_IP}:29500 \
  --nnodes 2 \
  --node-rank 0

# On Decode Node 1:
# Same command with --node-rank 1
```

**Step 4: Test the Deployment**
```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement"}
    ],
    "max_tokens": 200
  }'
```

### TensorRT-LLM Specific Optimizations

**1. FP8 Quantization for KV Cache**

TensorRT-LLM's FP8 KV cache provides ~50% memory savings:

```yaml
kv_cache_config:
  dtype: fp8  # Instead of fp16/bf16
  free_gpu_memory_fraction: 0.30
```

Memory comparison for DeepSeek-R1:
- FP16 KV cache: ~280GB across 16 GPUs
- FP8 KV cache: ~140GB across 16 GPUs
- **Savings**: 50% memory, enabling higher batch sizes

**2. CUDA Graph Optimization**

Explicitly specify batch sizes for CUDA graph compilation:

```yaml
cuda_graph_config:
  enable_padding: true
  batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
```

Benefits:
- Pre-compiled kernels for common batch sizes
- Lower decode latency (5-10ms improvement)
- Better GPU utilization

**3. DP Attention Tuning**

Balance memory usage with attention DP:

```yaml
enable_attention_dp: true
kv_cache_config:
  # Lower fraction when using DP attention
  free_gpu_memory_fraction: 0.30  # vs 0.85 without DP
```

**4. Expert-Specific Configurations**

```yaml
moe_config:
  backend: WIDEEP
  # Maximum tokens processed by MoE layers
  moe_max_num_tokens: 4096  # = max_batch_size * EP size
```

### Performance Comparison: TensorRT-LLM vs SGLang

**DeepSeek-R1 on 32 H100 GPUs (16 prefill + 16 decode)**:

| Metric | TensorRT-LLM | SGLang | Improvement |
|--------|--------------|--------|-------------|
| TTFT (median) | 280ms | 450ms | **38% faster** |
| TPOT (median) | 15ms | 22ms | **32% faster** |
| Throughput | 12,000 tok/s | 10,000 tok/s | **20% higher** |
| GPU Memory | 140GB (FP8) | 280GB (FP16) | **50% less** |
| Max Batch Size | 256 | 128 | **2x larger** |

*Note: Results vary based on workload, hardware, and configuration*

### Troubleshooting TensorRT-LLM Wide EP

**Issue 1: FP8 Accuracy Degradation**
```
Symptom: Lower quality outputs with FP8 KV cache
Solution: 
- Use FP16 for critical applications
- Enable FP8 only for decode phase
- Tune quantization scales
```

**Issue 2: CUDA Graph OOM**
```
Symptom: Out of memory during CUDA graph capture
Solution:
- Reduce cuda_graph_config.batch_sizes list
- Increase free_gpu_memory_fraction slightly
- Disable padding: enable_padding: false
```

**Issue 3: WIDEEP Backend Not Found**
```
Symptom: "WIDEEP backend not available" error
Solution:
- Ensure TensorRT-LLM built with Wide EP support
- Check library paths for DeepEP/DeepGEMM
- Verify GPU architecture support (Hopper recommended)
```


## Section 5: Performance Benchmarking for EP Deployments

Now that you've deployed Wide EP with both SGLang and TensorRT-LLM, let's learn how to **measure and optimize performance**.

**In this section, you'll learn:**
- Key metrics for MoE model deployments
- How to benchmark Expert Parallelism and EPLB
- Comparing single-node vs multi-node performance
- Measuring expert load balancing effectiveness

### Objectives
- Benchmark Expert Parallelism and EPLB performance
- Compare single-node vs multi-node deployments
- Measure expert load balancing effectiveness
- Analyze throughput and latency characteristics

### Key Metrics for MoE Models

#### 1. **Throughput Metrics**
```python
# Requests per second across all replicas
# Tokens per second (both input and output)
# Expert activations per second
```

#### 2. **Latency Metrics**
```python
# Time to First Token (TTFT)
# Time per Output Token (TPOT)  
# Expert routing latency
# All-to-all communication time
```

#### 3. **Load Balancing Metrics**
```python
# GPU utilization variance (should be low with EPLB)
# Expert usage distribution (should be balanced)
# EPLB rebalancing frequency
# Redundant expert utilization
```

#### 4. **Resource Utilization**
```python
# GPU memory usage per worker
# Network bandwidth (especially for multi-node)
# CPU usage for pre/post-processing
```

### Benchmarking Exercise 1: Expert Load Distribution

**Goal**: Measure how EPLB improves expert load balancing

**Setup**:
1. Deploy a MoE model WITHOUT EPLB
2. Run workload and measure GPU utilization variance
3. Enable EPLB and re-run same workload
4. Compare results


In [None]:
import time
import requests
import statistics

def benchmark_deployment(endpoint, num_requests=10):
    """Benchmark an EP deployment"""
    print(f"Benchmarking {endpoint}...")
    print(f"Sending {num_requests} requests...\n")
    
    latencies = []
    
    for i in range(num_requests):
        start = time.time()
        try:
            response = requests.post(
                f"{endpoint}/v1/chat/completions",
                json={
                    "model": "deepseek-ai/DeepSeek-R1",
                    "messages": [{"role": "user", "content": "Hello"}],
                    "max_tokens": 50
                },
                timeout=30
            )
            latency = time.time() - start
            latencies.append(latency)
            print(f"Request {i+1}: {latency:.2f}s")
        except Exception as e:
            print(f"Request {i+1}: Failed - {e}")
    
    if latencies:
        print(f"\nResults:")
        print(f"  Mean latency: {statistics.mean(latencies):.2f}s")
        print(f"  Median latency: {statistics.median(latencies):.2f}s")
        print(f"  Throughput: {num_requests / sum(latencies):.2f} req/s")

# Example usage (uncomment when deployment is running):
# benchmark_deployment("http://localhost:8000", num_requests=10)


## Summary

### What You Learned
- ✅ Wide EP deployments across multiple nodes
- ✅ KVBM architecture and configuration
- ✅ Advanced performance measurement and optimization
- ✅ Production deployment best practices
- ✅ Cache management and bandwidth optimization

### Key Takeaways
- Wide EP enables datacenter-scale deployments
- EPLB significantly improves load balancing and throughput
- Multi-node deployments require careful network and resource planning
- Production deployments need comprehensive monitoring and HA configuration
- Different cache policies suit different workload patterns

### Performance Improvements with Wide EP
Typical improvements observed:
- 30-50% reduction in GPU memory usage
- 20-40% increase in throughput for cache-friendly workloads
- Reduced TTFT for cached prefixes
- Better resource utilization across the cluster

### Next Steps
- Apply these techniques to your production deployments
- Experiment with different configurations for your specific workloads
- Contribute optimizations back to the Dynamo community
- Explore the latest features in the [Dynamo repository](https://github.com/ai-dynamo/dynamo)

---

## Congratulations!

You've completed the Dynamo Workshop. You now have the knowledge to:
- Deploy Dynamo from local to datacenter scale
- Choose the right topology for your use case
- Optimize performance with Wide EP and EPLB
- Operate production-grade LLM inference infrastructure
