# Muon Tutorial

For understanding of the increasingly popular muon optimizer

## References

https://kellerjordan.github.io/posts/muon/

https://thinkingmachines.ai/blog/modular-manifolds/

https://arxiv.org/abs/2502.16982v1

## Background

**Notation**: 
- $\theta \in \mathbb{R}^d$ = parameters (model weights)
- $g_t \in \mathbb{R}^d$ = gradient at step $t$
- $\alpha \in \mathbb{R}$ = learning rate (scalar)
- $\beta, \beta_1, \beta_2 \in [0,1]$ = decay rates (scalars)
- $\epsilon \in \mathbb{R}$ = numerical stability constant (scalar, typically small)
- $\lambda \in \mathbb{R}$ = weight decay coefficient (scalar)

### 1. SGD

$$\theta_t = \theta_{t-1} - \alpha \cdot g_t$$

**Dimensionality**: $g_t$ same shape as $\theta$

Basic gradient descent. Moves directly opposite to gradient with fixed learning rate

### 2. SGD + Momentum

$$v_t = \beta \cdot v_{t-1} + (1-\beta) \cdot g_t \quad \text{[build velocity]}$$
$$\theta_t = \theta_{t-1} - \alpha \cdot v_t \quad \text{[move with velocity]}$$

**Dimensionality**: $v_t$ same shape as $\theta$

Adds velocity memory. Exponential moving average of gradients. Smooths updates and accelerates in consistent directions

### 3. RMSProp

$$v_t = \beta \cdot v_{t-1} + (1-\beta) \cdot g_t^2 \quad \text{[scale track]}$$
$$\theta_t = \theta_{t-1} - \alpha \cdot \frac{g_t}{\sqrt{v_t + \epsilon}} \quad \text{[adaptive scaling]}$$

**Dimensionality**: $v_t$ same shape as $\theta$

Adds adaptive scaling per-parameter. "Scale track" = exponential moving average of squared gradients (tracks gradient magnitude history). Dividing by $\sqrt{v_t}$ gives larger steps to params with small gradients, smaller steps to those with large gradients. Per-parameter adaptive rates

### 4. Adam

$$m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \quad \text{[momentum track]}$$
$$v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 \quad \text{[scale track]}$$
$$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}} \quad \text{[combine both]}$$

**Dimensionality**: $m_t$, $v_t$, $\theta$ all same shape (element-wise)

Adam $\approx$ "RMSProp + Momentum". Combines both velocity AND scale memory

### 5. AdamW

$$m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \quad \text{[momentum track]}$$
$$v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 \quad \text{[scale track]}$$
$$\theta_t = \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}} - \alpha \cdot \lambda \cdot \theta_{t-1} \quad \text{[Adam + weight decay]}$$

**Key difference**: Weight decay $(\lambda \cdot \theta_{t-1})$ applied directly to parameters (decoupled), not added to gradients like typical regularization. More effective regularization with Adam's adaptive rates




#### Example: Why L2 Regularization is Weaker in Adam

L2 regularization: $g_t \leftarrow g_t + \lambda \cdot \theta_{t-1}$

Goes through Adam:
The regularization term $\lambda \cdot \theta_{t-1}$ enters $m_t$, then gets divided by $\sqrt{v_t}$:
$$\text{regularization contribution} \propto \frac{\lambda \cdot \theta_{t-1}}{\sqrt{v_t}}$$

AdamW decouples weight decay:
$$\theta_t = \theta_{t-1} - \alpha \cdot \frac{m_t}{\sqrt{v_t} + \epsilon} - \alpha \cdot \lambda \cdot \theta_{t-1}$$

Consistent $\alpha \cdot \lambda$ across all parameters

## Muon