# Module 3: Building Mixture of Experts (MoE)

In this notebook, we'll dive deep into **Mixture of Experts** (MoE) - a powerful technique for scaling language models efficiently.

## Learning Objectives

By the end of this notebook, you will:
1. Understand the motivation behind MoE architectures
2. Learn how expert routing works (Top-K gating)
3. Implement a complete MoE layer from scratch
4. Explore load balancing and routing strategies
5. Compare MoE vs dense models in terms of parameters and compute
6. Build a complete MoE transformer

## What You'll Learn

- **Sparse vs Dense Models**: Why not all parameters need to be active
- **Expert Routing**: How to intelligently route tokens to specialists
- **Load Balancing**: Ensuring experts are used efficiently
- **Scaling Laws**: Get more capacity without proportional compute cost

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Dict

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Configure plotting
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

## Part 1: Why Mixture of Experts?

### The Problem: Scaling Language Models

Larger models generally perform better, but they're expensive:
- **GPT-3**: 175B parameters, huge compute cost
- **Training**: Requires massive GPU clusters
- **Inference**: Slow and costly

### The Insight: Conditional Computation

Not all parameters need to be active for every input!

**Dense Model**:
```
Every token → All parameters → Output
500M params active per token
```

**MoE Model**:
```
Every token → Route to 2 of 8 experts → Output  
Total: 500M params, Active: ~125M params per token
```

### Benefits:

1. **More parameters, same compute**: 4x model capacity, 2x compute
2. **Specialization**: Different experts learn different patterns
3. **Better scaling**: Sub-linear compute growth with capacity

### Visualizing the Concept

In [None]:
# Compare parameter count vs active parameters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Model sizes
model_sizes = ["100M", "500M", "1B", "5B", "10B"]
x_pos = np.arange(len(model_sizes))

# Dense model: all parameters active
dense_total = [100, 500, 1000, 5000, 10000]
dense_active = dense_total.copy()

# MoE model: only fraction active (assume 25% with 8 experts, top-2)
moe_total = [100, 500, 1000, 5000, 10000]
moe_active = [25, 125, 250, 1250, 2500]

# Plot 1: Total vs Active Parameters
width = 0.35
ax1.bar(x_pos - width / 2, dense_active, width, label="Dense (Active)", alpha=0.8)
ax1.bar(x_pos + width / 2, moe_active, width, label="MoE (Active)", alpha=0.8)
ax1.set_ylabel("Active Parameters (Millions)")
ax1.set_xlabel("Model Size")
ax1.set_title("Active Parameters: Dense vs MoE")
ax1.set_xticks(x_pos)
ax1.set_xticklabels(model_sizes)
ax1.legend()
ax1.grid(axis="y", alpha=0.3)

# Plot 2: Compute efficiency (FLOPs per forward pass)
# Assume compute proportional to active parameters
ax2.plot(
    dense_total, dense_active, "o-", label="Dense Model", linewidth=2, markersize=8
)
ax2.plot(moe_total, moe_active, "s-", label="MoE Model", linewidth=2, markersize=8)
ax2.set_xlabel("Total Parameters (Millions)")
ax2.set_ylabel("Compute Cost (Arbitrary Units)")
ax2.set_title("Compute Scaling: Dense vs MoE")
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insight: MoE models achieve sub-linear compute growth!")
print("At 10B parameters: Dense uses 10B, MoE uses only 2.5B per forward pass")

## Part 2: Expert Routing - The Core Mechanism

### How Do We Choose Which Experts to Use?

The **router** (also called **gating network**) decides which experts process each token.

### Top-K Routing Algorithm:

1. **Router Network**: Small neural network that scores each expert
2. **Top-K Selection**: Choose K experts with highest scores
3. **Softmax Normalization**: Convert scores to weights
4. **Weighted Combination**: Combine expert outputs

$$\text{Router}(x) = \text{Top-K}(\text{Softmax}(W_r \cdot x))$$

In [None]:
class TopKRouter(nn.Module):
    """
    Top-K router for Mixture of Experts.

    Routes each token to the top-K experts based on learned routing scores.
    """

    def __init__(
        self,
        hidden_size: int,
        num_experts: int,
        top_k: int = 2,
        use_load_balancing: bool = True,
        load_balancing_weight: float = 0.01,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_experts = num_experts
        self.top_k = top_k
        self.use_load_balancing = use_load_balancing
        self.load_balancing_weight = load_balancing_weight

        # Router is a simple linear layer
        self.gate = nn.Linear(hidden_size, num_experts, bias=False)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict]:
        """
        Route tokens to top-K experts.

        Args:
            x: Input tensor (batch_size, seq_len, hidden_size)

        Returns:
            expert_indices: Indices of selected experts (batch, seq_len, top_k)
            expert_weights: Weights for selected experts (batch, seq_len, top_k)
            stats: Dictionary of routing statistics
        """
        batch_size, seq_len, hidden_size = x.shape

        # Flatten batch and sequence dimensions
        x_flat = x.view(-1, hidden_size)  # (batch * seq_len, hidden_size)

        # Compute routing logits
        logits = self.gate(x_flat)  # (batch * seq_len, num_experts)

        # Apply softmax to get probabilities
        routing_probs = F.softmax(logits, dim=-1)

        # Select top-K experts
        expert_weights, expert_indices = torch.topk(routing_probs, self.top_k, dim=-1)

        # Normalize weights to sum to 1
        expert_weights = expert_weights / expert_weights.sum(dim=-1, keepdim=True)

        # Reshape back
        expert_indices = expert_indices.view(batch_size, seq_len, self.top_k)
        expert_weights = expert_weights.view(batch_size, seq_len, self.top_k)

        # Compute statistics for monitoring
        stats = self._compute_stats(routing_probs, expert_indices)

        # Compute load balancing loss if enabled
        if self.use_load_balancing and self.training:
            stats["load_balancing_loss"] = self._load_balancing_loss(routing_probs)

        return expert_indices, expert_weights, stats

    def _compute_stats(
        self, routing_probs: torch.Tensor, expert_indices: torch.Tensor
    ) -> Dict:
        """Compute routing statistics for monitoring."""
        stats = {}

        # Expert utilization (how many tokens are routed to each expert)
        expert_counts = torch.zeros(self.num_experts, device=routing_probs.device)
        for i in range(self.num_experts):
            expert_counts[i] = (expert_indices == i).sum()

        # Normalize to get distribution
        expert_distribution = expert_counts / expert_counts.sum()
        stats["expert_distribution"] = expert_distribution

        # Balance metric (how evenly distributed)
        # Perfect balance = 1.0, completely unbalanced = 0.0
        uniform_dist = torch.ones_like(expert_distribution) / self.num_experts
        balance_metric = (
            1.0 - torch.sum(torch.abs(expert_distribution - uniform_dist)) / 2.0
        )
        stats["expert_balance_metric"] = balance_metric.item()

        # Routing entropy (higher = more diverse routing)
        entropy = -torch.sum(
            routing_probs * torch.log(routing_probs + 1e-10), dim=-1
        ).mean()
        stats["routing_entropy"] = entropy.item()

        return stats

    def _load_balancing_loss(self, routing_probs: torch.Tensor) -> torch.Tensor:
        """
        Compute load balancing loss to encourage even expert utilization.

        This is the auxiliary loss from the Switch Transformer paper.
        """
        # Average routing probability per expert
        expert_probs = routing_probs.mean(dim=0)  # (num_experts,)

        # Count how many tokens are routed to each expert
        top1_indices = routing_probs.argmax(dim=-1)
        expert_counts = torch.zeros_like(expert_probs)
        for i in range(self.num_experts):
            expert_counts[i] = (top1_indices == i).float().mean()

        # Load balancing loss
        loss = self.num_experts * torch.sum(expert_probs * expert_counts)
        return self.load_balancing_weight * loss


# Test the router
hidden_size = 512
num_experts = 8
top_k = 2
batch_size = 4
seq_len = 10

router = TopKRouter(hidden_size, num_experts, top_k)
x = torch.randn(batch_size, seq_len, hidden_size)

expert_indices, expert_weights, stats = router(x)

print(f"Input shape: {x.shape}")
print(f"Expert indices shape: {expert_indices.shape}")
print(f"Expert weights shape: {expert_weights.shape}")
print("\nRouting Statistics:")
print(f"  Balance metric: {stats['expert_balance_metric']:.3f} (1.0 = perfect balance)")
print(f"  Routing entropy: {stats['routing_entropy']:.3f}")
print("\nExample routing for first token:")
print(f"  Selected experts: {expert_indices[0, 0].tolist()}")
print(f"  Expert weights: {expert_weights[0, 0].tolist()}")

### Visualizing Expert Routing

In [None]:
def visualize_routing(expert_indices, expert_weights, num_experts=8):
    """
    Visualize which experts are chosen for different tokens.
    """
    batch_size, seq_len, top_k = expert_indices.shape

    # Take first batch
    indices = expert_indices[0].cpu().numpy()
    weights = expert_weights[0].cpu().numpy()

    # Create routing matrix (seq_len x num_experts)
    routing_matrix = np.zeros((seq_len, num_experts))
    for i in range(seq_len):
        for k in range(top_k):
            expert_idx = indices[i, k]
            weight = weights[i, k]
            routing_matrix[i, expert_idx] = weight

    # Plot
    plt.figure(figsize=(12, 6))
    sns.heatmap(
        routing_matrix,
        annot=True,
        fmt=".2f",
        cmap="YlOrRd",
        xticklabels=[f"E{i}" for i in range(num_experts)],
        yticklabels=[f"T{i}" for i in range(seq_len)],
        cbar_kws={"label": "Routing Weight"},
    )
    plt.xlabel("Expert")
    plt.ylabel("Token Position")
    plt.title("Token-to-Expert Routing Pattern")
    plt.tight_layout()
    plt.show()


visualize_routing(expert_indices, expert_weights, num_experts)
print("Each row shows which experts process that token (brighter = higher weight)")

## Part 3: Building the Expert Layer

Each expert is typically a feed-forward network (FFN), identical in architecture but with different learned parameters.

In [None]:
class Expert(nn.Module):
    """
    Single expert: a feed-forward network.
    """

    def __init__(self, hidden_size: int, intermediate_size: int, dropout: float = 0.1):
        super().__init__()
        self.fc1 = nn.Linear(hidden_size, intermediate_size)
        self.fc2 = nn.Linear(intermediate_size, hidden_size)
        self.activation = nn.GELU()
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through expert."""
        x = self.fc1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x


# Test a single expert
expert = Expert(hidden_size=512, intermediate_size=2048)
x = torch.randn(4, 10, 512)
output = expert(x)
print(f"Expert input: {x.shape}")
print(f"Expert output: {output.shape}")
print(f"Expert parameters: {sum(p.numel() for p in expert.parameters()):,}")

## Part 4: Complete MoE Layer

Now let's combine the router and experts into a complete MoE layer.

In [None]:
class MoELayer(nn.Module):
    """
    Mixture of Experts layer with Top-K routing.
    """

    def __init__(
        self,
        hidden_size: int,
        intermediate_size: int,
        num_experts: int = 8,
        top_k: int = 2,
        dropout: float = 0.1,
        use_load_balancing: bool = True,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_experts = num_experts
        self.top_k = top_k

        # Create experts
        self.experts = nn.ModuleList(
            [
                Expert(hidden_size, intermediate_size, dropout)
                for _ in range(num_experts)
            ]
        )

        # Create router
        self.router = TopKRouter(hidden_size, num_experts, top_k, use_load_balancing)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, Dict]:
        """
        Forward pass through MoE layer.

        Args:
            x: Input tensor (batch_size, seq_len, hidden_size)

        Returns:
            output: MoE output (batch_size, seq_len, hidden_size)
            stats: Routing statistics
        """
        batch_size, seq_len, hidden_size = x.shape

        # Get routing decisions
        expert_indices, expert_weights, stats = self.router(x)

        # Initialize output
        output = torch.zeros_like(x)

        # Process each expert
        # Note: This is a naive implementation. Production systems use
        # more efficient batching strategies.
        for expert_idx in range(self.num_experts):
            # Find all tokens routed to this expert
            expert_mask = (expert_indices == expert_idx).any(dim=-1)  # (batch, seq_len)

            if not expert_mask.any():
                continue  # No tokens for this expert

            # Get tokens for this expert
            expert_input = x[expert_mask]  # (num_tokens, hidden_size)

            # Process through expert
            expert_output = self.experts[expert_idx](expert_input)

            # Get weights for this expert
            # This is a bit tricky - we need to find which positions in top_k
            # correspond to this expert
            for k in range(self.top_k):
                k_mask = expert_indices[:, :, k] == expert_idx
                if k_mask.any():
                    weights = expert_weights[:, :, k][k_mask].unsqueeze(-1)
                    output[k_mask] += weights * expert_output[: weights.shape[0]]

        return output, stats


# Test MoE layer
moe_layer = MoELayer(hidden_size=512, intermediate_size=2048, num_experts=8, top_k=2)

x = torch.randn(4, 10, 512)
output, stats = moe_layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print("\nMoE Layer Statistics:")
print(f"  Expert balance: {stats['expert_balance_metric']:.3f}")
print(f"  Routing entropy: {stats['routing_entropy']:.3f}")
print(f"\nTotal parameters: {sum(p.numel() for p in moe_layer.parameters()):,}")
print(
    f"Active parameters per token: ~{sum(p.numel() for p in moe_layer.experts[0].parameters()) * 2:,}"
)

## Part 5: Load Balancing - A Critical Challenge

### The Problem

Without proper incentives, the model might use only a few experts and ignore others!

### Solutions:

1. **Load Balancing Loss**: Penalize uneven expert usage
2. **Capacity Factor**: Limit tokens per expert
3. **Expert Dropout**: Randomly drop experts during training

Let's visualize the effect of load balancing:

In [None]:
# Train two routers: one with and one without load balancing
torch.manual_seed(42)
router_with_lb = TopKRouter(512, 8, 2, use_load_balancing=True)
router_without_lb = TopKRouter(512, 8, 2, use_load_balancing=False)

# Simulate some training steps
x = torch.randn(32, 20, 512)  # Larger batch

# Get routing decisions
indices_with, weights_with, stats_with = router_with_lb(x)
indices_without, weights_without, stats_without = router_without_lb(x)

# Plot expert utilization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# With load balancing
dist_with = stats_with["expert_distribution"].cpu().numpy()
ax1.bar(range(8), dist_with, color="steelblue", alpha=0.8)
ax1.axhline(y=1 / 8, color="red", linestyle="--", label="Ideal (uniform)")
ax1.set_xlabel("Expert ID")
ax1.set_ylabel("Utilization")
ax1.set_title(
    f"With Load Balancing (Balance: {stats_with['expert_balance_metric']:.3f})"
)
ax1.legend()
ax1.set_ylim(0, 0.3)

# Without load balancing
dist_without = stats_without["expert_distribution"].cpu().numpy()
ax2.bar(range(8), dist_without, color="coral", alpha=0.8)
ax2.axhline(y=1 / 8, color="red", linestyle="--", label="Ideal (uniform)")
ax2.set_xlabel("Expert ID")
ax2.set_ylabel("Utilization")
ax2.set_title(
    f"Without Load Balancing (Balance: {stats_without['expert_balance_metric']:.3f})"
)
ax2.legend()
ax2.set_ylim(0, 0.3)

plt.tight_layout()
plt.show()

print("Notice: Load balancing encourages more uniform expert usage!")

## Part 6: MoE Transformer Block

Let's integrate MoE into a complete transformer block. We typically apply MoE to the FFN layer every N layers (controlled by `moe_frequency`).

In [None]:
# Import attention from previous notebook
class MultiHeadAttention(nn.Module):
    """Simplified version - see notebook 02 for full implementation."""

    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        # Simplified: just return projection for demonstration
        # See notebook 02 for complete implementation
        output = self.W_o(x)
        return output


class MoETransformerBlock(nn.Module):
    """
    Transformer block with MoE in the FFN layer.
    """

    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int,
        num_experts: int = 8,
        top_k: int = 2,
        dropout: float = 0.1,
        use_moe: bool = True,
    ):
        super().__init__()
        self.use_moe = use_moe

        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)

        # Feed-forward: either MoE or dense
        if use_moe:
            self.ffn = MoELayer(
                hidden_size=d_model,
                intermediate_size=d_ff,
                num_experts=num_experts,
                top_k=top_k,
                dropout=dropout,
            )
        else:
            self.ffn = Expert(d_model, d_ff, dropout)

        # Layer norms
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        """
        Forward pass through MoE transformer block.

        Returns:
            output: Block output
            moe_stats: MoE routing statistics (if MoE is used)
        """
        # 1. Attention with residual
        attn_output = self.attention(x, mask)
        x = self.ln1(x + self.dropout(attn_output))

        # 2. FFN (MoE or dense) with residual
        if self.use_moe:
            ffn_output, moe_stats = self.ffn(x)
        else:
            ffn_output = self.ffn(x)
            moe_stats = None

        x = self.ln2(x + self.dropout(ffn_output))

        return x, moe_stats


# Compare MoE vs Dense block
moe_block = MoETransformerBlock(d_model=512, num_heads=8, d_ff=2048, use_moe=True)
dense_block = MoETransformerBlock(d_model=512, num_heads=8, d_ff=2048, use_moe=False)

x = torch.randn(4, 10, 512)
moe_out, stats = moe_block(x)
dense_out, _ = dense_block(x)

moe_params = sum(p.numel() for p in moe_block.parameters())
dense_params = sum(p.numel() for p in dense_block.parameters())

print(f"MoE Block Parameters: {moe_params:,}")
print(f"Dense Block Parameters: {dense_params:,}")
print(f"\nParameter Ratio: {moe_params / dense_params:.2f}x more parameters")
print(f"Compute Ratio: ~{2 / 8:.2f}x (using 2 of 8 experts)")
print(
    f"\nEfficiency Gain: {(moe_params / dense_params) / (2 / 8):.2f}x more parameters per unit compute!"
)

## Part 7: Understanding the Efficiency Gains

In [None]:
def analyze_model_efficiency(d_model, d_ff, num_experts, top_k, num_layers):
    """
    Analyze parameter count and compute cost for MoE vs Dense.
    """
    # Dense FFN parameters per layer
    dense_ffn_params = d_model * d_ff + d_ff + d_ff * d_model + d_model

    # MoE FFN parameters per layer
    moe_ffn_params = num_experts * dense_ffn_params + d_model * num_experts  # +router

    # Attention parameters (same for both)
    attn_params = 4 * d_model * d_model  # Q, K, V, O projections

    # Total parameters
    dense_total = num_layers * (attn_params + dense_ffn_params)
    moe_total = num_layers * (attn_params + moe_ffn_params)

    # Active parameters per forward pass
    dense_active = dense_total
    moe_active = num_layers * (attn_params + top_k * dense_ffn_params / num_experts)

    return {
        "dense_total": dense_total,
        "moe_total": moe_total,
        "dense_active": dense_active,
        "moe_active": moe_active,
    }


# Our Storyteller model configuration
results = analyze_model_efficiency(
    d_model=1024, d_ff=4096, num_experts=8, top_k=2, num_layers=16
)

print("Storyteller Model Analysis:")
print("=" * 50)
print("\nIf we used Dense FFN everywhere:")
print(f"  Total parameters: {results['dense_total'] / 1e6:.1f}M")
print(f"  Active per forward: {results['dense_active'] / 1e6:.1f}M")
print("\nWith MoE (8 experts, top-2):")
print(f"  Total parameters: {results['moe_total'] / 1e6:.1f}M")
print(f"  Active per forward: {results['moe_active'] / 1e6:.1f}M")
print("\nKey Insights:")
print(f"  Parameter increase: {results['moe_total'] / results['dense_total']:.2f}x")
print(f"  Compute increase: {results['moe_active'] / results['dense_active']:.2f}x")
print(
    f"  Efficiency ratio: {(results['moe_total'] / results['dense_total']) / (results['moe_active'] / results['dense_active']):.2f}x"
)

## Part 8: Practical Considerations

### When to Use MoE?

**Good for:**
- Large-scale models (billions of parameters)
- Diverse tasks (different experts for different patterns)
- Training efficiency (more parameters without proportional compute)

**Challenges:**
- Load balancing requires careful tuning
- Communication overhead in distributed training
- Larger memory footprint (storing all experts)
- More complex implementation

### Hyperparameters to Tune:

1. **Number of experts**: 8-64 typical (power of 2)
2. **Top-K**: Usually 1 or 2
3. **MoE frequency**: Every 2-4 layers
4. **Load balancing weight**: 0.01-0.1
5. **Capacity factor**: 1.0-2.0 (for capacity-based routing)

## Summary and Key Takeaways

In this notebook, you learned:

1. **MoE Motivation**:
   - Scale model capacity without proportional compute cost
   - Conditional computation via expert routing

2. **Core Components**:
   - **Router**: Learns which experts to use for each token
   - **Experts**: Specialized feed-forward networks
   - **Top-K Gating**: Select K best experts per token

3. **Load Balancing**:
   - Critical for utilizing all experts
   - Auxiliary loss encourages even distribution

4. **Efficiency**:
   - 4-8x more parameters, only 1.5-2x more compute
   - Sub-linear scaling of compute with capacity

5. **Implementation**:
   - Built complete MoE layer from scratch
   - Integrated into transformer blocks
   - Monitored routing statistics

### What's Next?

In the next notebook, we'll:
- Set up the complete training pipeline
- Implement data loading and batching
- Configure optimizers and learning rate schedules
- Train a small MoE model

### Further Reading

- [Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer](https://arxiv.org/abs/1701.06538) (Shazeer et al., 2017)
- [Switch Transformers](https://arxiv.org/abs/2101.03961) (Fedus et al., 2021)
- [GLaM: Efficient Scaling of Language Models](https://arxiv.org/abs/2112.06905) (Du et al., 2021)
- [ST-MoE: Designing Stable and Transferable MoE](https://arxiv.org/abs/2202.08906) (Zoph et al., 2022)

## Exercise: Design Your Own MoE Configuration

Given these constraints:
- 32GB GPU memory
- Target: ~500M total parameters
- Want to maximize capacity while keeping training feasible

Design an MoE configuration and justify your choices:

In [None]:
# Your configuration here
config = {
    "d_model": 1024,  # Hidden size
    "d_ff": 4096,  # FFN intermediate size
    "num_layers": 16,  # Number of transformer layers
    "num_experts": 8,  # Experts per MoE layer
    "top_k": 2,  # Experts activated per token
    "moe_frequency": 2,  # Apply MoE every N layers
}

# Analyze your configuration
# TODO: Estimate total parameters, memory usage, and training throughput

# Justify your choices:
# - Why this number of experts?
# - Why this top-k value?
# - Why this MoE frequency?
# - What tradeoffs did you make?