# Lab 3.1: Expert Parallelism Foundations

Welcome! This notebook starts from first principles and builds up to Expert Parallelism (EP) for Mixture-of-Experts (MoE) models in Dynamo. You'll learn the main parallelism strategies, why EP exists, and how Dynamo supports WideEP, DeepEP, and dynamic load balancing (EPLB). You'll also run small, illustrative Python snippets to visualize concepts.

**Duration**: 45-60 minutes

**Next**: After completing this lab, continue to **Lab 3.2: Wide EP Production Deployment** to learn how to deploy these concepts in production with Kubernetes, SGLang, and TensorRT-LLM.

---

## Learning Objectives

By the end of this notebook, you will be able to:
- Explain why parallelism is needed for LLM inference (prefill vs decode)
- Differentiate DP, TP, PP, SP, and EP
- Describe what MoE is and why EP applies only to MoE models
- Understand Expert Parallelism in Dynamo (Standard, Wide, Deep, Dynamic/EPLB)
- Identify when to use Wide-EP and how EPLB balances load
- Interpret high-level NVL72 Wide-EP insights and their implications

## Table of Contents

**Foundations (Sections 1-5)**
1. [Why Parallelism Matters for LLMs](#1.-Why-Parallelism-Matters-for-LLMs)
2. [Parallelism Strategies at a Glance](#2.-Parallelism-Strategies-at-a-Glance)
3. [From Dense to MoE: Why Experts?](#3.-From-Dense-to-MoE:-Why-Experts?)
   - [3.1 How MoE Expert Selection Works](#3.1-How-MoE-Expert-Selection-Works-(Flat-Expert-Pool))
4. [Expert Parallelism (EP): Core Idea](#4.-Expert-Parallelism-(EP):-Core-Idea)
5. [Expert Parallelism in Dynamo](#5.-Expert-Parallelism-in-Dynamo)

**Deep Dives (Section 6)**
- [6. Deep Dives: EP Variants](#6.-Deep-Dives:-EP-Variants)
  - 6.1 [Standard EP](#6.1-Deep-Dive:-Standard-EP)
  - 6.2 [Wide EP](#6.2-Deep-Dive:-Wide-EP)
  - 6.3 [Deep EP](#6.3-Deep-Dive:-Deep-EP)
  - 6.4 [Dynamic EP (EPLB)](#6.4-Deep-Dive:-Dynamic-EP-(EPLB))

**Hands-On (Section 7)**
- [7. Hands-on Exercises](#7.-Hands-on-Exercises)
  - [Exercise 1: WideEP Routing Simulation](#Exercise-1-‚Äî-WideEP-Routing-Simulation)
  - [Exercise 2: Throughput Comparison Chart](#Exercise-2-‚Äî-Throughput-Comparison-Chart)

**Advanced Topics (Section 8)**
- [8. Advanced: Large-Scale Wide-EP on GB200 NVL72](#8.-Advanced:-Large-Scale-Wide-EP-on-GB200-NVL72)
  - [GroupGEMM and Weight-Loading Intuition](#GroupGEMM-and-Weight-Loading-Intuition)
  - [When to use Wide-EP (NVL72 perspective)](#When-to-use-Wide-EP-(NVL72-perspective))

**Resources (Section 9)**
- [9. Further Reading](#9.-Further-Reading)

---

## How to Use This Notebook

- **Beginners**: Start with Sections 1-5 for core concepts
- **Intermediate**: Explore Section 6 (Deep Dives) for EP variants
- **Advanced**: Read Section 8 for NVL72 insights and state-of-the-art optimizations
- **Hands-on learners**: Run the exercises in Section 7 to build intuition

## 1. Why Parallelism Matters for LLMs

Large Language Models are big and compute-intensive. Parallelism helps distribute memory and compute across multiple GPUs/nodes.

### LLM Inference Phases
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                  LLM Inference                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                         ‚îÇ
        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        ‚îÇ                                 ‚îÇ
   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê                       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îê
   ‚îÇ Phase 1 ‚îÇ                       ‚îÇ Phase 2 ‚îÇ
   ‚îÇ PREFILL ‚îÇ                       ‚îÇ DECODE  ‚îÇ
   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```
- **Prefill**: processes prompt tokens in parallel (compute-bound)
- **Decode**: generates one token at a time using KV cache (memory-bound)

---

## 2. Parallelism Strategies at a Glance

| Type | Description | Key Benefit | Applies To |
|------|-------------|-------------|------------|
| Data Parallelism (DP) | Same model copy on each GPU, different data shards | Simple throughput scaling | Dense + MoE |
| Tensor Parallelism (TP) | Split individual layers across GPUs | Fit larger models than one GPU | Dense + MoE |
| Pipeline Parallelism (PP) | Split model layers into stages across GPUs | Memory-efficient for deep models | Dense + MoE |
| Sequence Parallelism (SP) | Distribute sequence tokens across GPUs for attention | Efficient long sequences | Dense + MoE |
| Expert Parallelism (EP) | Distribute MoE experts across GPUs, route tokens | Scale capacity without linear compute | MoE only |

**Notes:**
- DP/TP/PP/SP apply to all models.
- EP applies only to MoE models (because only MoE has experts).

---

## 3. From Dense to MoE: Why Experts?

Dense models activate all parameters for every token. MoE models activate only a small subset of experts (e.g., top-1 or top-2) per token.

**Example:**
- Dense: 200B parameters ‚Üí all active per token
- MoE: 1T parameters ‚Üí ~200B active per token (top-k experts)

This gives huge capacity without proportional compute cost. The challenge becomes routing tokens to the right experts efficiently.

### 3.1 How MoE Expert Selection Works (Flat Expert Pool)

Most production MoE models (DeepSeek-V3, Qwen3, Mixtral) use a **flat expert pool** architecture. Let's understand how it works:

#### The Router Network

```
Input Token Embedding
         ‚îÇ
         ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Router  ‚îÇ ‚Üê Small neural network (learned during training)
    ‚îÇ Network ‚îÇ    Outputs: score for EACH expert
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
         ‚ñº
   [s‚ÇÄ, s‚ÇÅ, s‚ÇÇ, ..., s‚ÇÇ‚ÇÖ‚ÇÖ]  ‚Üê Scores for all 256 experts
         ‚îÇ
         ‚ñº
    Top-K Selection (e.g., k=8)
         ‚îÇ
         ‚ñº
   [E‚ÇÅ‚ÇÇ, E‚ÇÑ‚ÇÖ, E‚Çá‚Çà, E‚ÇÅ‚ÇÇ‚ÇÉ, E‚ÇÅ‚ÇÖ‚ÇÜ, E‚ÇÅ‚Çà‚Çâ, E‚ÇÇ‚ÇÄ‚ÇÅ, E‚ÇÇ‚ÇÉ‚ÇÑ]
         ‚îÇ
         ‚ñº
   Weighted Sum ‚Üí Output
```

**Key Components:**

1. **Router Network**: A small learned neural network that:
   - Takes token embedding as input
   - Outputs a score for **every expert** in the pool
   - Trained end-to-end with the model

2. **Top-K Selection**: 
   - Selects the k experts with highest scores
   - Only these experts process the token
   - Others are completely skipped (sparse activation)

3. **Weighted Combination**:
   - Each selected expert's output is weighted by its score
   - Final output = weighted sum of expert outputs

#### Example: DeepSeek-V3 Configuration

- **Total experts**: 256 (flat pool)
- **Active per token**: 8 (top-8 selection)
- **Activation ratio**: 8/256 = 3.1% of experts per token
- **Effective compute**: ~37B parameters active (out of 671B total)

#### Why "Flat" Pool?

All 256 experts are:
- ‚úÖ Scored simultaneously by the router
- ‚úÖ Treated equally (no hierarchy or grouping)
- ‚úÖ Can be selected for any token (maximum flexibility)

**Contrast with hierarchical** (rare in practice):
- ‚ùå Would have multiple routing stages
- ‚ùå Experts grouped by domain (Language/Reasoning/etc.)
- ‚ùå First choose group, then choose expert within group

#### Hands-On: Simulating MoE Routing

Let's simulate how a router selects experts for different tokens:

In [None]:
# Simulate MoE expert routing with flat expert pool
import numpy as np
import matplotlib.pyplot as plt

# Configuration (similar to DeepSeek-V3)
NUM_EXPERTS = 256
TOP_K = 8
NUM_TOKENS = 5

# Simulate router scores for different tokens
np.random.seed(42)

print("=" * 70)
print("MoE Expert Routing Simulation (Flat Expert Pool)")
print("=" * 70)
print(f"Total experts: {NUM_EXPERTS}")
print(f"Active experts per token: {TOP_K}")
print(f"Activation ratio: {TOP_K/NUM_EXPERTS*100:.1f}%\n")

# Simulate routing for different tokens
tokens = ["The", "quantum", "entanglement", "phenomenon", "is"]

fig, axes = plt.subplots(1, len(tokens), figsize=(15, 3))
fig.suptitle("Expert Selection for Different Tokens (Top-8 from 256 experts)", fontsize=14)

for idx, token in enumerate(tokens):
    # Simulate router scores (in practice, these come from a learned neural network)
    router_scores = np.random.randn(NUM_EXPERTS)
    
    # Top-K selection
    top_k_indices = np.argsort(router_scores)[-TOP_K:][::-1]
    top_k_scores = router_scores[top_k_indices]
    
    # Normalize scores to weights (softmax-like)
    weights = np.exp(top_k_scores) / np.sum(np.exp(top_k_scores))
    
    print(f"Token: '{token}'")
    print(f"  Selected experts: {top_k_indices.tolist()}")
    print(f"  Weights: {[f'{w:.3f}' for w in weights]}")
    print()
    
    # Visualize
    ax = axes[idx]
    colors = ['#ff6b6b' if i in top_k_indices else '#e0e0e0' for i in range(NUM_EXPERTS)]
    ax.bar(range(NUM_EXPERTS), [1 if i in top_k_indices else 0.1 for i in range(NUM_EXPERTS)], 
           color=colors, width=1.0, edgecolor='none')
    ax.set_title(f'"{token}"', fontsize=12, fontweight='bold')
    ax.set_xlabel('Expert ID', fontsize=9)
    ax.set_ylim(0, 1.2)
    ax.set_xlim(0, NUM_EXPERTS)
    if idx == 0:
        ax.set_ylabel('Selected', fontsize=9)
    ax.set_xticks([0, 64, 128, 192, 255])
    ax.set_yticks([])
    ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("=" * 70)
print("Key Observations:")
print("  ‚Ä¢ Different tokens route to different experts")
print("  ‚Ä¢ Only 3.1% of experts active per token (8/256)")
print("  ‚Ä¢ This is why MoE scales capacity without proportional compute!")
print("=" * 70)


#### Connection to Expert Parallelism Deployment

Now you understand how MoE models work internally. But here's the deployment challenge:

**Problem**: DeepSeek-V3 has 256 experts. How do we distribute them across GPUs?

**Solutions** (covered in Section 4):
- **Standard EP**: Distribute experts across 8-16 GPUs
- **Wide EP**: Distribute across 32-64+ GPUs for maximum throughput
- **EPLB**: Dynamically balance load when some experts are more popular

The router mechanism you just learned operates **within** whatever EP deployment strategy you choose!

---


## 4. Expert Parallelism (EP): Core Idea

In Expert Parallelism, each GPU or node hosts a subset of experts. Tokens are routed to the selected experts via efficient all-to-all communication.

**Diagram:**
```
[GPU0: Experts 0,1]  <-->  [GPU1: Experts 2,3]  <-->  [GPU2: Experts 4,5]
              ‚Üò  Tokens routed dynamically  ‚Üô
```

**Performance depends on:**
- Expert placement (static vs dynamic)
- Load balancing (even token distribution)
- Communication optimization (all-to-all)

**Key consideration:** How many experts should each GPU store? We'll explore this question in detail in Section 8.

---

## 5. Expert Parallelism in Dynamo

Dynamo provides pluggable backends with different EP capabilities.

| EP Type | Description | Example Backend / Model |
|---------|-------------|-------------------------|
| Standard EP | Static expert distribution | Mixtral via SGLang |
| Wide EP | Disaggregated experts across clusters | DeepSeek-R1 WideEP |
| Deep EP | Hierarchical/nested experts | DeepSeek-V2/V3 |
| Dynamic EP (EPLB) | Adaptive routing/load balancing | EPLB in SGLang |

---

## 6. Deep Dives: EP Variants

This section goes deeper into each Expert Parallelism variant. Read these after the basics.

- **Standard EP**: static placement of experts across GPUs
- **Wide EP**: distributed experts across nodes/clusters; used with horizontal replicas
- **Deep EP**: hierarchical/nested experts for specialization
- **Dynamic EP (EPLB)**: runtime (or static) load balancing of expert placements

---

### 6.1 Deep Dive: Standard EP

- Experts are statically sharded across GPUs
- Router selects top-k experts per token, then all-to-all routes tokens
- Good starting point for small-to-medium MoE deployments
- Combine with TP/PP for dense layers if needed

**Tip:** Monitor expert usage distribution; if imbalance appears, consider enabling EPLB.

---

### 6.2 Deep Dive: Wide EP

- Distributes experts across many GPUs/nodes; often used with multi-replica (horizontal) serving
- Targets high throughput at scale; pairs well with disaggregated prefill/decode
- Benefits from high-bandwidth interconnect (e.g., NVL72) to hide all-to-all costs
- Works best with batching and GroupGEMM-friendly routing

#### WideEP vs DeepEP Comparison

| Dimension | Wide EP | Deep EP |
|-----------|---------|---------|
| Goal | Maximize throughput & utilization | Enhance specialization |
| Structure | Experts spread across GPUs/nodes | Experts grouped hierarchically |
| Routing | Token ‚Üí top-k experts globally | Token ‚Üí coarse ‚Üí fine experts |
| Use Case | Distributed inference at scale | Hierarchical reasoning |

**Visuals:**
```
WideEP:  GPU clusters -> flat expert pool
         [E0,E1,E2,...E64]

DeepEP:  Hierarchical expert tree
         Root
          ‚îú‚îÄ‚îÄ Language Experts
          ‚îÇ    ‚îú‚îÄ‚îÄ Code Expert
          ‚îÇ    ‚îú‚îÄ‚îÄ Math Expert
          ‚îú‚îÄ‚îÄ Reasoning Experts
               ‚îú‚îÄ‚îÄ Symbolic
               ‚îú‚îÄ‚îÄ Commonsense
```

---

### 6.3 Deep Dive: Deep EP

- Organizes experts hierarchically (coarse ‚Üí fine routing)
- Improves specialization; useful for mixed domains and complex reasoning tasks
- May reduce routing search space at each level; pairs with expert grouping/placement policies

**Diagram (conceptual):**
```
Root
 ‚îú‚îÄ‚îÄ Language Experts
 ‚îÇ    ‚îú‚îÄ‚îÄ Code Expert
 ‚îÇ    ‚îú‚îÄ‚îÄ Math Expert
 ‚îú‚îÄ‚îÄ Reasoning Experts
      ‚îú‚îÄ‚îÄ Symbolic
      ‚îú‚îÄ‚îÄ Commonsense
```

---

### 6.4 Deep Dive: Dynamic EP (EPLB)

**Expert Parallelism Load Balancer (EPLB)** dynamically optimizes expert placement across GPUs to balance workload.

#### Critical: EPLB Changes WHERE Experts Live, NOT Which Expert Processes a Token

**Important distinction to avoid confusion:**

**What the Router Does (Never Changes):**
```
Token "quantum" ‚Üí Router ‚Üí Selects Expert 0, Expert 45, Expert 123
```
- The router's decision is **fixed by the model's learned weights**
- If the router says "Expert 0 should process this token," then **Expert 0 MUST process it**
- This is critical for correctness and **never changes during inference**

**What EPLB Does (Physical Placement Only):**

EPLB only changes **WHERE** Expert 0 physically lives, not which tokens it processes.

**Without EPLB (naive placement):**
```
GPU 0: Expert 0, Expert 1  ‚Üê Expert 0 gets 1000 tokens, Expert 1 gets 800 tokens
GPU 1: Expert 2, Expert 3  ‚Üê Expert 2 gets 600 tokens, Expert 3 gets 500 tokens
GPU 2: Expert 4, Expert 5  ‚Üê Expert 4 gets 300 tokens, Expert 5 gets 200 tokens
GPU 3: Expert 6, Expert 7  ‚Üê Expert 6 gets 100 tokens, Expert 7 gets 50 tokens

Problem: GPU 0 is overloaded (1800 tokens), GPU 3 is mostly idle (150 tokens)
Throughput: Limited by slowest GPU!
```

**With EPLB (smart rebalancing):**
```
GPU 0: Expert 0           ‚Üê Still processes the same 1000 tokens
GPU 1: Expert 1, Expert 4 ‚Üê 800 + 300 = 1100 tokens
GPU 2: Expert 2, Expert 5, Expert 6 ‚Üê 600 + 200 + 100 = 900 tokens
GPU 3: Expert 3, Expert 7 ‚Üê 500 + 50 = 550 tokens

Result: All GPUs balanced (~900 tokens average)
Throughput: 1.8x improvement!
```

**Key Point:** The token still goes to Expert 0! It just travels to a different GPU now.

#### EPLB Modes

- **Static EPLB**: precomputed expert‚ÜíGPU mappings from historical data
- **Online EPLB**: runtime redistribution to match live workload patterns
- Can replicate hot experts and relocate cold ones
- Performed between forward passes without breaking CUDA graphs
- **Goal**: minimize GPU utilization variance and improve tokens/sec

#### Expert Replication

EPLB can also **replicate** popular experts:
```
GPU 0: Expert 0 (replica A)
GPU 1: Expert 0 (replica B), Expert 1
```
- Both replicas have **identical weights**
- A token needing Expert 0 can go to **either replica** (whichever GPU is less busy)
- Output is **exactly the same** regardless of which replica processes it

### Figure 4: EPLB Load Balancing

![Figure 4: EPLB](images/figure4_eplb_load_balancing.gif)

*Diagram showing Expert Parallel Load Balancer (EPLB) redistributes experts to ensure balanced GPU workload, preventing over- and under-utilization.*

---

In [None]:
# EPLB Load Balancing Demonstration
import matplotlib.pyplot as plt
import numpy as np

# Scenario: 8 experts, 4 GPUs
# Each expert has different popularity (tokens to process)
expert_popularity = np.array([1000, 800, 600, 500, 300, 200, 100, 50])

print("=" * 70)
print("EPLB Load Balancing Demonstration")
print("=" * 70)
print("\nüìä Expert Popularity (tokens each expert needs to process):")
for i, tokens in enumerate(expert_popularity):
    bar = '‚ñà' * (tokens // 100)
    print(f"  Expert {i}: {bar} {tokens} tokens")

# Without EPLB: naive assignment (2 experts per GPU)
print("\n\n‚ùå WITHOUT EPLB (Naive Assignment: 2 experts per GPU)")
print("-" * 70)
naive_assignment = {
    0: [0, 1],  # GPU 0 gets Expert 0, 1
    1: [2, 3],  # GPU 1 gets Expert 2, 3
    2: [4, 5],  # GPU 2 gets Expert 4, 5
    3: [6, 7]   # GPU 3 gets Expert 6, 7
}

naive_loads = []
for gpu_id, expert_ids in naive_assignment.items():
    load = sum(expert_popularity[e] for e in expert_ids)
    naive_loads.append(load)
    expert_str = ', '.join([f"E{e}" for e in expert_ids])
    print(f"GPU {gpu_id}: [{expert_str}] ‚Üí {load} tokens")

print(f"\n‚ö†Ô∏è  Problem:")
print(f"    ‚Ä¢ GPU 0 is overloaded: {max(naive_loads)} tokens")
print(f"    ‚Ä¢ GPU 3 is mostly idle: {min(naive_loads)} tokens")
print(f"    ‚Ä¢ Throughput limited by slowest GPU!")

# With EPLB: balanced assignment
print("\n\n‚úÖ WITH EPLB (Smart Rebalancing)")
print("-" * 70)
eplb_assignment = {
    0: [0],        # Most popular expert gets its own GPU
    1: [1, 4],     # Balance: 800 + 300 = 1100
    2: [2, 5, 6],  # Balance: 600 + 200 + 100 = 900
    3: [3, 7]      # Balance: 500 + 50 = 550
}

eplb_loads = []
for gpu_id, expert_ids in eplb_assignment.items():
    load = sum(expert_popularity[e] for e in expert_ids)
    eplb_loads.append(load)
    expert_str = ', '.join([f"E{e}" for e in expert_ids])
    print(f"GPU {gpu_id}: [{expert_str}] ‚Üí {load} tokens")

print(f"\n‚úÖ Result:")
print(f"    ‚Ä¢ All GPUs balanced (~{np.mean(eplb_loads):.0f} tokens average)")
print(f"    ‚Ä¢ Max load reduced from {max(naive_loads)} to {max(eplb_loads)}")
print(f"    ‚Ä¢ Throughput improvement: ~{max(naive_loads)/max(eplb_loads):.1f}x faster!")

# Visualize the difference
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Without EPLB
colors_naive = ['#ff6b6b' if load > 1500 else '#ffa500' if load > 1000 else '#90EE90' for load in naive_loads]
bars1 = ax1.bar(range(4), naive_loads, color=colors_naive, edgecolor='black', linewidth=1.5)
ax1.axhline(y=np.mean(naive_loads), color='blue', linestyle='--', linewidth=2, label=f'Average: {np.mean(naive_loads):.0f}')
ax1.set_title('Without EPLB: Imbalanced', fontsize=14, fontweight='bold', color='#d32f2f')
ax1.set_xlabel('GPU', fontsize=12)
ax1.set_ylabel('Tokens to Process', fontsize=12)
ax1.set_xticks(range(4))
ax1.set_xticklabels([f'GPU {i}' for i in range(4)])
ax1.legend(fontsize=10)
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim(0, 2000)

# Add load labels on bars
for i, (bar, load) in enumerate(zip(bars1, naive_loads)):
    ax1.text(bar.get_x() + bar.get_width()/2, load + 50, f'{load}', 
             ha='center', va='bottom', fontweight='bold', fontsize=10)

# With EPLB
bars2 = ax2.bar(range(4), eplb_loads, color='#66cc99', edgecolor='black', linewidth=1.5)
ax2.axhline(y=np.mean(eplb_loads), color='blue', linestyle='--', linewidth=2, label=f'Average: {np.mean(eplb_loads):.0f}')
ax2.set_title('With EPLB: Balanced', fontsize=14, fontweight='bold', color='#388e3c')
ax2.set_xlabel('GPU', fontsize=12)
ax2.set_ylabel('Tokens to Process', fontsize=12)
ax2.set_xticks(range(4))
ax2.set_xticklabels([f'GPU {i}' for i in range(4)])
ax2.legend(fontsize=10)
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim(0, 2000)

# Add load labels on bars
for i, (bar, load) in enumerate(zip(bars2, eplb_loads)):
    ax2.text(bar.get_x() + bar.get_width()/2, load + 50, f'{load}', 
             ha='center', va='bottom', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("üéØ Key Takeaway:")
print("   EPLB monitors expert popularity and redistributes them across GPUs")
print("   to prevent bottlenecks and maximize throughput!")
print("   Remember: Experts still process the SAME tokens, just on different GPUs!")
print("=" * 70)

---

## üéØ Transition: From Theory to Practice

Great work! You now understand:
- ‚úÖ Why parallelism matters for LLMs
- ‚úÖ Different parallelism strategies (DP, TP, PP, SP, EP)
- ‚úÖ How MoE models work with expert routing
- ‚úÖ Expert Parallelism variants (Standard, Wide, Deep, Dynamic/EPLB)

**Now let's reinforce these concepts with hands-on exercises!** The following interactive examples will help you visualize how EP and EPLB work in practice.

---

## 7. Hands-on Exercises

### Exercise 1 ‚Äî WideEP Routing Simulation

Simulate how tokens get assigned to top-k experts.

Run the cell below and observe randomized token‚Üíexperts assignments.

In [None]:
# Exercise 1: token ‚Üí experts (top-k) routing simulation
import random

experts = [f"E{i}" for i in range(8)]
tokens = [f"T{i}" for i in range(16)]

def route(tokens, experts, topk=2):
    routing = {}
    for t in tokens:
        routing[t] = random.sample(experts, topk)
    return routing

routing = route(tokens, experts)
for t, e in routing.items():
    print(f"{t} ‚Üí {e}")

### Exercise 2 ‚Äî Throughput Comparison Chart

Visualize relative throughput improvements from different strategies.

**‚ö†Ô∏è Note**: The values below are **illustrative examples only** to demonstrate the concept. Actual performance will vary significantly based on your hardware, model size, workload characteristics, and configuration. Always run your own benchmarks to measure real performance!

In [None]:
# Exercise 2: simple bar chart of relative throughput (ILLUSTRATIVE VALUES)
import matplotlib.pyplot as plt

x = ["Dense", "Standard EP", "WideEP", "WideEP + EPLB"]
y = [1.0, 1.8, 2.5, 2.9]  # Illustrative values - NOT actual benchmarks!

plt.figure(figsize=(8,4))
plt.bar(x, y, color=["#8888ff", "#66cc99", "#ffcc66", "#ff8888"])
plt.title("Relative Throughput by Strategy (Illustrative)")
plt.ylabel("Speedup (√ó Dense)")
plt.ylim(0, 3.2)
plt.grid(axis='y', alpha=0.3)
plt.text(0.5, 3.0, 'Values are illustrative only!', 
         ha='center', fontsize=10, style='italic', color='red')
plt.show()

## 8. Advanced: Large-Scale Wide-EP on GB200 NVL72

In this section, we explore how **Wide Expert Parallelism (Wide-EP)** on **GB200 NVL72** rack-scale systems enables efficient inference for massive MoE models like DeepSeek-R1.

### 8.1 The Scaling Challenge

As MoE models like DeepSeek-R1 grow to 671B parameters with 256 experts, deploying them efficiently becomes critical. We've seen that Expert Parallelism (EP) distributes experts across GPUs, but **how many GPUs should we use?**

Consider deploying DeepSeek-R1 (256 experts):
- **With 8 GPUs:** Each GPU stores 256 √∑ 8 = **32 experts per GPU**
- **With 16 GPUs:** Each GPU stores 256 √∑ 16 = **16 experts per GPU**
- **With 64 GPUs:** Each GPU stores 256 √∑ 64 = **4 experts per GPU**

**The Question:**
Does it matter how many GPUs we use? Is there a "sweet spot" for the number of experts per GPU?

To answer this, we need to understand what happens during MoE inference at high throughput...

---

### 8.2 The Problem: Weight-Loading Bottleneck

MoE models dynamically load expert weights on a per-token, per-layer basis. In high-throughput inference:

**The Challenge:**
1. Each GPU stores many experts (e.g., 32 experts/GPU in small EP)
2. Every token activates 8 experts, potentially from different experts on the same GPU
3. Weights must be loaded into on-chip memory/registers before computation
4. **GroupGEMM kernels** batch tokens per expert for efficiency, but are bottlenecked by weight-loading overhead

**GroupGEMM**: A fused kernel that packs all tokens routed to the same expert into a single matrix multiplication. This is efficient _if_ we can keep weights in fast memory and reuse them across many tokens.

### Figure 2: GroupGEMM Token Routing

![Figure 2: GroupGEM](images/figure2_groupgemm_token_routing.webp)

*Tokens routed to the same expert are packed together and processed with a single fused GroupGEMM kernel for efficient MoE inference.*

**The Bottleneck:**
- More experts per GPU ‚Üí more weight-loading pressure
- Less weight reuse ‚Üí lower arithmetic intensity (FLOPs/byte)
- GroupGEMM efficiency degrades as experts/GPU increases

---

### ü§î Think About It: How Can We Solve This?

Now that we've identified the weight-loading bottleneck, think about this:

**Question:** How can we reduce the number of experts stored on each GPU? What's the key insight?

<details>
<summary>üëâ Click to reveal answer</summary>

**The Key Insight:**
If we **distribute the same 256 experts across MORE GPUs**, each GPU will store fewer experts!

**Example:**
- **Small EP (8 GPUs):** 256 experts √∑ 8 = 32 experts/GPU ‚ùå (too many!)
- **Wide EP (64 GPUs):** 256 experts √∑ 64 = 4 experts/GPU ‚úÖ (much better!)

**The Solution:**
Instead of using 8 GPUs, use 32-64 GPUs (or more) to spread experts thinner. This reduces weight-loading pressure per GPU while maintaining the same total model capacity.

**Trade-off:** We need more GPUs, but each GPU becomes more efficient!

**Next step:** Let's see how this works in practice! üöÄ
</details>

---

### 8.3 The Solution: Wide-EP ‚Äî Distribute Experts Across More GPUs

The solution is elegant: **spread experts across many more GPUs**, reducing the experts-per-GPU ratio.

**This is Wide Expert Parallelism (Wide-EP):**
- Instead of 8 GPUs, use 32‚Äì64 GPUs (or more)
- Each GPU stores fewer experts (e.g., 4 instead of 32)
- **Result**: Less weight-loading pressure per GPU, better GroupGEMM efficiency

### Figure 1: Small-Scale vs Large-Scale EP

![Figure 1: Small vs Large Scale EP](images/figure1_small_vs_large_scale_ep.gif)

*Animation showing how small-scale EP deploys many experts per GPU, while large-scale EP (Wide-EP) spreads fewer experts per GPU across a much larger cluster, enabling efficient scaling of MoE layers.*

Let's visualize how distributing DeepSeek-R1's 256 experts across different EP configurations reduces the burden per GPU:

---


### ü§î Think About It: How Does This Solve the Problem?

Now that we've seen the solution (distributing experts across more GPUs), think about this:

**Question:** How does Wide-EP (fewer experts per GPU) actually solve the weight-loading bottleneck we identified?

<details>
<summary>üëâ Click to reveal answer</summary>

**The Solution:**
By distributing experts across more GPUs:
- **Each GPU stores fewer experts** (e.g., 4 instead of 32)
- **Less weight-loading overhead** per GPU (fewer experts to load from memory)
- **Better weight reuse** within GroupGEMM kernels (same expert weights reused for more tokens)
- **Higher arithmetic intensity** (more FLOPs per byte loaded)

**Trade-off:** We need more GPUs, but the per-GPU efficiency improves significantly.

**Key insight:** Wide-EP trades GPU count for per-GPU efficiency. But there's a catch‚Äîwe need high-bandwidth communication to make this practical!

**Next step:** What hardware makes this practical? ü§î
</details>

---


In [None]:
# Wide-EP: Reducing Experts per GPU
import matplotlib.pyplot as plt
import numpy as np

# Simple visualization: How Wide-EP distributes 256 experts
TOTAL_EXPERTS = 256
ep_configs = [
    ("Small EP\n(8 GPUs)", 8),
    ("Medium EP\n(16 GPUs)", 16),
    ("Wide EP\n(32 GPUs)", 32),
    ("Very Wide EP\n(64 GPUs)", 64)
]

print("=" * 70)
print("Wide-EP: Distributing 256 Experts Across GPUs")
print("=" * 70)
print("\nDeepSeek-V3 has 256 experts. How many experts per GPU?\n")

labels = []
experts_per_gpu = []
colors = []

for label, num_gpus in ep_configs:
    epg = TOTAL_EXPERTS // num_gpus
    experts_per_gpu.append(epg)
    labels.append(label)
    
    # Color code: red for many experts/GPU, green for few
    if epg >= 20:
        colors.append('#ff6b6b')  # Red - too many!
    elif epg >= 10:
        colors.append('#ffa500')  # Orange - moderate
    else:
        colors.append('#66cc99')  # Green - good!
    
    print(f"{label:20s}: {epg:2d} experts/GPU")

print("\n" + "=" * 70)
print("Key Insight:")
print("  ‚Ä¢ Fewer experts per GPU = Less weight-loading pressure")
print("  ‚Ä¢ Wide-EP (32-64 GPUs) enables efficient GroupGEMM operations")
print("  ‚Ä¢ Trade-off: More GPUs needed, but higher throughput per GPU")
print("=" * 70)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(range(len(labels)), experts_per_gpu, color=colors, 
              edgecolor='black', linewidth=1.5, width=0.6)

ax.set_title('Wide-EP: Experts per GPU (256 total experts)', 
             fontsize=14, fontweight='bold')
ax.set_xlabel('EP Configuration', fontsize=12)
ax.set_ylabel('Experts per GPU', fontsize=12)
ax.set_xticks(range(len(labels)))
ax.set_xticklabels(labels, fontsize=10)
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, value in zip(bars, experts_per_gpu):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2., height + 0.5,
            f'{value}', ha='center', va='bottom', 
            fontweight='bold', fontsize=12)

# Add annotation
ax.annotate('More efficient!\nLess weight-loading', 
            xy=(3, experts_per_gpu[3]), 
            xytext=(2.5, 15),
            arrowprops=dict(arrowstyle='->', color='green', lw=2),
            fontsize=11, color='green', fontweight='bold')

plt.tight_layout()
plt.show()

### 8.4 System Architecture: GB200 NVL72 Rack-Scale System

Now we understand _why_ Wide-EP helps (fewer experts per GPU), but how do we make it practical? The answer is **GB200 NVL72**.

**GB200 NVL72 Key Features:**
- **72 GPUs** in a single coherent NVLink domain
- **130 TB/s aggregate bandwidth** for GPU-to-GPU communication
- Enables efficient all-to-all token routing during decode phase
- Custom NCCL kernels handle non-static communication sizes with CUDA graph compatibility

**Why EP=64 (not 72)?**
- Clean division: 256 experts √∑ 64 = 4 experts/GPU (integer)
- Power-of-two efficiency for collectives (64-way vs 72-way)
- Reserve 8 GPUs for prefill/MTP/headroom in disaggregated serving

### Figure 3: MoE Deployment on NVL72

![Figure 3: MoE Deployment NVL72](images/figure3_moe_deployment_nvl72.gif)

*Schematic diagram showing an MoE deployment with 232 experts per GPU (4 experts √ó 58 MoE layers) and only 8 experts activated per layer, coordinated across 72 GPUs in a GB200 NVL72 NVLink domain.*

**EPLB (Expert Parallel Load Balancer):**
- **Static mode**: Pre-computed expert‚ÜíGPU mappings from historical patterns
- **Online mode**: Runtime redistribution with non-blocking weight updates between forward passes
- Prevents "hot experts" from concentrating on the same GPU

---

### ü§î Think About It: How Does NVL72 Make This Practical?

Now that we understand the NVL72 architecture, think about this:

**Question:** Why is NVL72's 130 TB/s NVLink bandwidth critical for making Wide-EP practical? What would happen without it?

<details>
<summary>üëâ Click to reveal answer</summary>

**Why NVL72 Matters:**
- **Communication overhead:** With Wide-EP, tokens must be routed across many GPUs via all-to-all collectives
- **130 TB/s bandwidth** hides this communication cost, making the all-to-all operations efficient
- **Without high bandwidth:** Communication would become the bottleneck, negating the benefits of Wide-EP
- **Coherent NVLink domain:** Enables efficient token routing without going through slow network interfaces

**The Key Insight:**
Wide-EP solves the weight-loading problem, but creates a communication problem. NVL72's massive bandwidth solves the communication problem, making Wide-EP practical!

**Without NVL72:** You'd need to use fewer GPUs (smaller EP), which brings back the weight-loading bottleneck.

**Next step:** Does this actually work in practice? Let's see the results! üìä
</details>

---

### 8.5 Performance Results: Up to 1.8√ó Throughput Gain

So does Wide-EP actually work? NVIDIA's benchmarks show significant performance improvements:

### Figure 5: EP Throughput Comparison

![Figure 5: Throughput](images/figure5_ep_throughput_comparison.webp)

*Large-scale Expert Parallelism (EP=32) delivers up to **1.8√ó higher output token throughput per GPU** compared to small EP (EP=8) at 100 tokens/sec per user. Both configurations leverage disaggregated serving and multi-token prediction (MTP).*

**Key Findings:**
- **Super-linear scaling**: 4√ó more GPUs ‚Üí 7.2√ó total throughput (not just 4√ó)
- **Per-GPU efficiency improves** with larger EP configurations
- Enabled by NVL72's 130 TB/s NVLink hiding communication overhead
- Works best with disaggregated serving (separate prefill/decode pools)

**Why super-linear?**
1. Fewer experts per GPU ‚Üí better GroupGEMM efficiency
2. Higher weight reuse ‚Üí better arithmetic intensity
3. NVL72 bandwidth offsets communication costs
4. EPLB prevents load imbalance

---


### 8.6 When to Use Wide-EP: Practical Guidance

Wide-EP is powerful but not always the right choice. Here's when it makes sense:

**‚úÖ Good candidates for Wide-EP:**
- **Large MoE models** with many experts (e.g., DeepSeek-R1 with 256 experts)
- **Latency-constrained throughput** scenarios (need high tokens/sec at fixed latency)
- **High-bandwidth interconnect** available (NVL72, multi-node NVLink clusters)
- **Variable workloads** where EPLB can balance load dynamically

**‚ùå When to avoid Wide-EP:**
- Small models with few experts (communication overhead dominates)
- Low-latency, low-throughput scenarios (overhead not justified)
- Limited interconnect bandwidth (communication becomes bottleneck)
- Single-node deployments with < 8 GPUs

**Integration with NVIDIA Dynamo:**

Wide-EP works best when orchestrated by **NVIDIA Dynamo** for disaggregated serving:

| Component | Role |
|-----------|------|
| **NVIDIA Dynamo** | Orchestration: disaggregates prefill/decode, SLA-aware autoscaling, dynamic routing |
| **TensorRT-LLM Wide-EP** | Execution: expert-parallel MoE with optimized kernels, FP8, CUDA graphs, EPLB |

**Together they enable:**
- Prefill pool scales independently from decode pool
- Decode pool uses Wide-EP (e.g., EP=64) for throughput
- Dynamo Planner adapts to ISL/OSL fluctuations in real-time
- EPLB balances expert load within the decode pool

**Reference:** [NVIDIA Blog: Scaling Large MoE Models with Wide Expert Parallelism on NVL72](https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/?utm_source=chatgpt.com)

---


## üß† Knowledge Check: Pop Quiz

Test your understanding! Try to answer each question before revealing the answer.

---

### Question 1: Active Parameters

**Q:** DeepSeek-R1 has 671B total parameters but only ~37B active per token. Why is this efficient?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer:**
- Only 8 out of 256 experts are activated per token (sparse activation)
- Reduces compute from 671B to ~37B FLOPs per token
- Enables "dense model quality at sparse model cost"
- Active parameters = dense layers + (8 active experts √ó expert size)

**Key insight:** Sparse activation gives you the benefits of a huge model without paying the full computational cost!
</details>

---

### Question 2: EP vs TP

**Q:** What's the key difference between Expert Parallelism (EP) and Tensor Parallelism (TP)?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer:**

**TP (Tensor Parallelism):**
- Splits individual tensor computations across GPUs
- Applies to ALL layers (dense and MoE)
- Each GPU computes a slice of the same operation
- Example: 16K-dim matrix ‚Üí 8 GPUs √ó 2K slices

**EP (Expert Parallelism):**
- Distributes entire experts across GPUs
- Applies ONLY to MoE layers
- Each GPU holds complete expert weights for a subset of experts
- Example: 256 experts ‚Üí 64 GPUs √ó 4 experts/GPU

**Key difference:** TP splits operations, EP distributes whole experts!
</details>

---

### Question 3: NVL72 Mystery

**Q:** NVL72 has 72 GPUs. Why do we use EP=64 instead of EP=72 for DeepSeek-R1?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer:**

**Multiple reasons:**
1. **Clean division**: 256 experts √∑ 64 = 4 experts/GPU (integer division)
2. **Power-of-two efficiency**: 64-way collectives optimize better than 72-way
3. **Topology alignment**: Maps cleanly to 8√ó8 groupings; reduces cross-island traffic
4. **Disaggregated serving**: Reserve 8 GPUs for prefill/MTP/headroom while 64 handle decode
5. **Load balancing**: Equal experts per GPU simplifies EPLB algorithms

**Key insight:** EP size is a design choice for performance, not dictated by raw GPU count!
</details>

---

### Question 4: EPLB Mechanics

**Q:** Does EPLB (Expert Parallel Load Balancer) change which expert processes a token?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer: NO!** 

EPLB changes **WHERE** experts live, NOT which expert processes a token.

**What stays the same:**
- Router decision is fixed by learned model weights
- Token "quantum" ‚Üí Router ‚Üí Expert 0, Expert 45, Expert 123
- These expert IDs never change

**What EPLB changes:**
- Physical GPU placement of experts
- Example: Move Expert 0 from GPU 0 to GPU 5 to balance load
- Token still goes to Expert 0, just travels to a different GPU

**Analogy:** Like reassigning doctors to different hospital wings‚Äîpatients still see the same specialists, just in different locations!
</details>

---

### Question 5: GroupGEMM Magic

**Q:** Why does Wide-EP (fewer experts per GPU) improve GroupGEMM efficiency?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer:**

**Weight-loading bottleneck:**
- MoE dynamically loads expert weights per token per layer
- More experts per GPU = more weight-loading overhead
- GroupGEMM needs weights in registers/on-chip memory before multiplication

**Wide-EP solution:**
- Fewer experts per GPU = less weight-loading pressure
- Higher weight reuse within the GroupGEMM kernel
- Better arithmetic intensity (more FLOPs per byte loaded)
- Example: 32 experts/GPU ‚Üí 4 experts/GPU (8√ó reduction in weight pressure)

**Trade-off:**
- Need more GPUs
- BUT: 130 TB/s NVLink on NVL72 hides communication overhead
- Result: Up to 1.8√ó higher per-GPU throughput

**Key insight:** Distribute experts to reduce memory pressure and improve compute efficiency!
</details>

---

### Bonus Question: Throughput Math üìä

**Q:** If EP=8 achieves 100 tokens/sec/GPU, and EP=32 achieves 1.8√ó improvement, what's the total cluster throughput gain when scaling from 8 to 32 GPUs?

<details>
<summary>üëâ Click to reveal answer</summary>

**Answer:**

**EP=8 baseline:**
- 8 GPUs √ó 100 tokens/sec/GPU = 800 tokens/sec total

**EP=32 with 1.8√ó per-GPU improvement:**
- 32 GPUs √ó (100 √ó 1.8) tokens/sec/GPU = 32 √ó 180 = **5,760 tokens/sec total**

**Gain:**
- 5,760 √∑ 800 = **7.2√ó total throughput increase**
- This is **super-linear scaling**! (4√ó more GPUs ‚Üí 7.2√ó throughput)
- Enabled by NVL72's high-bandwidth interconnect hiding communication costs

**Key insight:** Wide-EP doesn't just scale linearly‚Äîit improves per-GPU efficiency, giving you super-linear gains! üöÄ
</details>

---


## Summary

Congratulations! You've completed Lab 3.1. You now understand:

- **DP/TP/PP/SP**: apply to all models; solve memory/compute distribution differently.
- **EP**: MoE-only; route tokens to a small set of experts per token.
- **Dynamo**: supports Standard EP, WideEP, DeepEP, and dynamic EPLB via backends.
- **WideEP + EPLB**: production-proven for scaling throughput and balancing load.
- **NVL72 insights**: Large-scale EP with high-bandwidth interconnects delivers up to 1.8√ó per-GPU throughput gains.

### üöÄ Next Steps

Ready to deploy these concepts in production? Continue to:

**Lab 3.2: Wide EP Production Deployment** (`lab3.2-wide-ep-deployment.ipynb`)

In Lab 3.2, you'll learn:
- Kubernetes deployment with Dynamo Operator
- Multi-node SGLang deployment with EPLB
- TensorRT-LLM Wide EP configuration
- Monitoring, troubleshooting, and performance tuning
- Production best practices

---

## 9. Further Reading

### MoE Architecture and Comparisons
- [The Big LLM Architecture Comparison (Sebastian Raschka)](https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) - Detailed comparison of DeepSeek-V3, Qwen3, and other MoE architectures

### Expert Parallelism Deployment
- [Scaling Large MoE Models with Wide Expert Parallelism on NVL72](https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/?utm_source=chatgpt.com) - NVIDIA's technical deep-dive on Wide-EP deployment strategies
- [TensorRT-LLM Wide-EP Examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep)

### Dynamo and Backends
- [NVIDIA Dynamo Documentation](https://github.com/NVIDIA/Dynamo)
- [SGLang Documentation](https://github.com/sgl-project/sglang)