# Lab 3: Wide EP Deployments and KVBM

## Overview

In this lab, you will:
- Deploy Dynamo across multiple nodes (wide EP deployments)
- Implement KVBM (KV Cache Bandwidth Manager) for advanced KV cache management
- Measure and compare performance of different deployment strategies
- Optimize for production-scale workloads

## Duration: ~120 minutes

---


## Section 1: Wide EP Deployments - Overview

### Objectives
- Understand wide Elastic Parallelism (EP) deployments
- Configure multi-node deployments
- Implement cross-node communication
- Optimize for datacenter-scale workloads

### What is Wide EP?
- Deploy Dynamo across multiple nodes/machines
- Distribute model replicas for higher throughput
- Load balance across distributed workers
- Scale horizontally beyond single-node limits

### Tasks
- [ ] Review multi-node architecture
- [ ] Plan node distribution strategy
- [ ] Configure network policies for cross-node communication
- [ ] Understand resource allocation across nodes


## Section 2: Deploy Multi-Node Dynamo Cluster

### Objectives
- Deploy Dynamo workers across multiple Kubernetes nodes
- Configure node affinity and anti-affinity
- Implement cross-node service discovery
- Verify distributed deployment

### Architecture
```
Node 1: Frontend + etcd + NATS
Node 2: Worker Replicas (Set A)
Node 3: Worker Replicas (Set B)
Node 4: Worker Replicas (Set C)
```

### Tasks
- [ ] Define node selection strategy
- [ ] Configure pod affinity rules
- [ ] Deploy workers to specific nodes
- [ ] Configure inter-node networking
- [ ] Verify worker distribution across nodes
- [ ] Test load balancing across nodes


## Section 3: KVBM (KV Cache Bandwidth Manager)

### Objectives
- Understand KVBM architecture and benefits
- Deploy KVBM for advanced KV cache management
- Configure cache policies and eviction strategies
- Optimize memory bandwidth utilization

### What is KVBM?
KVBM (KV Cache Bandwidth Manager) is Dynamo's advanced KV cache management system that:
- Manages KV cache transfer between prefill and decode
- Optimizes memory bandwidth usage
- Implements intelligent cache eviction policies
- Enables efficient cache reuse across requests
- Reduces memory pressure on GPUs

### KVBM Architecture
```
Prefill Workers → KV Cache → KVBM → Decode Workers
                                ↓
                         Cache Store
                                ↓
                     Bandwidth Optimization
```

### Tasks
- [ ] Review KVBM documentation
- [ ] Deploy KVBM component
- [ ] Configure cache size limits
- [ ] Set eviction policies
- [ ] Enable cache metrics collection


## Section 4: Configure KVBM Policies

### Objectives
- Understand different cache eviction policies
- Configure KVBM for specific workload patterns
- Tune performance parameters

### KVBM Configuration Options

#### Cache Eviction Policies
- **LRU (Least Recently Used)**: Default policy, evicts oldest cache entries
- **LFU (Least Frequently Used)**: Evicts least frequently accessed entries
- **TTL (Time To Live)**: Cache entries expire after specified time
- **Adaptive**: Dynamically adjusts based on workload patterns

#### Performance Tuning
- Cache size limits
- Bandwidth allocation
- Transfer batch sizes
- Prefetch strategies

### Tasks
- [ ] Select appropriate eviction policy
- [ ] Configure cache size limits
- [ ] Set bandwidth allocation
- [ ] Enable prefetching
- [ ] Configure monitoring and alerting


## Section 5: Performance Measurement and Comparison

### Objectives
- Benchmark KVBM-enabled deployments
- Compare with standard disaggregated serving
- Measure cache hit rates and bandwidth utilization
- Analyze performance improvements

### Metrics to Measure

#### Throughput Metrics
- Requests per second
- Tokens per second (input and output)
- GPU utilization across nodes

#### Latency Metrics
- Time to First Token (TTFT)
- Time per Output Token (TPOT)
- End-to-end latency
- Cache transfer latency

#### Cache Metrics
- Cache hit rate
- Cache miss rate
- Eviction frequency
- Memory bandwidth utilization

#### Resource Metrics
- GPU memory usage
- Network bandwidth usage
- CPU utilization
- Inter-node communication overhead

### Tasks
- [ ] Set up monitoring infrastructure
- [ ] Run baseline benchmarks (without KVBM)
- [ ] Run KVBM-enabled benchmarks
- [ ] Collect and analyze metrics
- [ ] Compare performance results
- [ ] Identify bottlenecks and optimization opportunities


## Section 6: Optimization Techniques

### Objectives
- Apply advanced optimization techniques
- Tune for specific workload patterns
- Maximize resource utilization
- Achieve production-grade performance

### Optimization Strategies

#### Network Optimization
- Enable RDMA for low-latency communication
- Configure network bandwidth allocation
- Optimize inter-node routing

#### Cache Optimization
- Tune cache size based on model and workload
- Adjust eviction policies for access patterns
- Enable intelligent prefetching

#### Resource Optimization
- Balance GPU allocation across nodes
- Optimize CPU/GPU ratios
- Configure memory limits

### Tasks
- [ ] Profile current deployment
- [ ] Identify optimization opportunities
- [ ] Apply optimization techniques
- [ ] Re-benchmark after optimizations
- [ ] Document performance improvements


## Section 7: Production Considerations

### Objectives
- Prepare deployment for production
- Implement monitoring and alerting
- Configure high availability
- Plan for disaster recovery

### Production Checklist

#### High Availability
- [ ] Deploy multiple frontend replicas
- [ ] Configure etcd cluster (3+ nodes)
- [ ] Set up NATS cluster
- [ ] Implement health checks and auto-recovery

#### Monitoring
- [ ] Deploy Prometheus for metrics
- [ ] Configure Grafana dashboards
- [ ] Set up alerting rules
- [ ] Monitor GPU health (DCGM)

#### Security
- [ ] Enable TLS for all communications
- [ ] Configure RBAC policies
- [ ] Implement network policies
- [ ] Set up secrets management

#### Scalability
- [ ] Configure horizontal pod autoscaling
- [ ] Plan for cluster expansion
- [ ] Document scaling procedures

### Tasks
- [ ] Review production checklist
- [ ] Implement critical items
- [ ] Test failover scenarios
- [ ] Document operational procedures


## Section 8: Exercises

### Exercise 1: Scale Testing
- Deploy across 4+ nodes
- Gradually increase load
- Measure scaling efficiency
- Identify scaling limits

### Exercise 2: KVBM Policy Comparison
- Test different eviction policies
- Compare performance for different workload patterns
- Determine optimal policy for your use case

### Exercise 3: Failure Scenarios
- Simulate node failures
- Test automatic recovery
- Measure impact on service availability
- Verify cache resilience

### Exercise 4: Multi-Model Wide Deployment
- Deploy multiple models across nodes
- Implement model-aware routing
- Optimize resource allocation per model
- Benchmark multi-model performance


## Summary

### What You Learned
- ✅ Wide EP deployments across multiple nodes
- ✅ KVBM architecture and configuration
- ✅ Advanced performance measurement and optimization
- ✅ Production deployment best practices
- ✅ Cache management and bandwidth optimization

### Key Takeaways
- Wide EP enables datacenter-scale deployments
- KVBM significantly improves cache efficiency and reduces memory pressure
- Multi-node deployments require careful network and resource planning
- Production deployments need comprehensive monitoring and HA configuration
- Different cache policies suit different workload patterns

### Performance Improvements with KVBM
Typical improvements observed:
- 30-50% reduction in GPU memory usage
- 20-40% increase in throughput for cache-friendly workloads
- Reduced TTFT for cached prefixes
- Better resource utilization across the cluster

### Next Steps
- Apply these techniques to your production deployments
- Experiment with different configurations for your specific workloads
- Contribute optimizations back to the Dynamo community
- Explore the latest features in the [Dynamo repository](https://github.com/ai-dynamo/dynamo)

---

## Congratulations!

You've completed the Dynamo Workshop. You now have the knowledge to:
- Deploy Dynamo from local to datacenter scale
- Choose the right topology for your use case
- Optimize performance with advanced features like KVBM
- Operate production-grade LLM inference infrastructure
