# ZeRO (Zero Redundancy Optimizer)

## Definition

ZeRO (Zero Redundancy Optimizer) is a memory optimization technology developed by Microsoft as part of the DeepSpeed library. Unlike traditional data parallelism that replicates the entire model states across all GPUs, ZeRO strategically partitions model states (optimizer states, gradients, and parameters) across parallel devices to eliminate memory redundancy while maintaining computational efficiency.

## Mathematical Foundation

In distributed training with data parallelism across $N$ devices, each device traditionally stores:

- Complete model parameters: $\theta$
- Complete gradients: $\nabla\theta$
- Complete optimizer states: $S_{\theta}$ (e.g., $m$ and $v$ for Adam optimizer)

With traditional data parallelism, memory requirement per device is $O(M)$ where $M$ is the model size.

For Adam optimizer, the memory requirement includes:
$$M_{Adam} = 4 \times \text{sizeof}(\theta) + 2 \times \text{sizeof}(\theta) + \text{sizeof}(\theta) = 7 \times \text{sizeof}(\theta)$$

Where the components represent:
- $4 \times \text{sizeof}(\theta)$: Optimizer states (32-bit weights, 32-bit gradients, 32-bit momentum, 32-bit variance)
- $2 \times \text{sizeof}(\theta)$: Forward activations (estimate)
- $\text{sizeof}(\theta)$: Temporary buffers

With ZeRO, these states are partitioned across $N$ devices, reducing per-device memory complexity to $O(M/N)$.

## Core Principles of ZeRO

ZeRO is built upon three fundamental principles:

1. **Strategic Partitioning**: Model states (optimizer states, gradients, and parameters) are partitioned across devices instead of being replicated.
   
2. **Dynamic Communication**: Devices exchange necessary data during computation phases through collective communication operations.
   
3. **Progressive Optimization Stages**: ZeRO implements memory optimization in incremental stages (1, 2, 3), each removing more redundancy with increasing communication requirements.

## Detailed Explanation of ZeRO Stages

### ZeRO Stage 1: Optimizer State Partitioning

In Stage 1, ZeRO partitions only the optimizer states across GPUs.

For Adam optimizer with parameters $\theta$, momentum $m$, and variance $v$:
- Each GPU $i$ stores:
  - Complete parameters $\theta$
  - Complete gradients $\nabla\theta$
  - Partial optimizer states $m_i$ and $v_i$ (only for parameters assigned to GPU $i$)

The memory consumption per device becomes:
$$M_{Stage1} = \text{sizeof}(\theta) + \text{sizeof}(\nabla\theta) + \frac{2 \times \text{sizeof}(\theta)}{N} + \text{activations}$$

**Process flow:**
1. Each GPU computes gradients for all parameters
2. All-reduce operation to average gradients across GPUs
3. Each GPU updates only its assigned optimizer states and parameters
4. All-gather operation to synchronize updated parameters

Memory reduction: Approximately 25-33% compared to standard data parallelism.

### ZeRO Stage 2: Gradient Partitioning

Stage 2 adds partitioning of gradients on top of Stage 1:

Each GPU $i$ stores:
- Complete parameters $\theta$
- Partial gradients $\nabla\theta_i$ (only for parameters assigned to GPU $i$)
- Partial optimizer states $m_i$ and $v_i$ (only for parameters assigned to GPU $i$)

The memory consumption per device becomes:
$$M_{Stage2} = \text{sizeof}(\theta) + \frac{\text{sizeof}(\nabla\theta)}{N} + \frac{2 \times \text{sizeof}(\theta)}{N} + \text{activations}$$

**Process flow:**
1. During backward pass, each GPU computes gradients for all parameters
2. Reduce-scatter operation: gradients are partitioned so each GPU retains only its assigned portion
3. Each GPU updates only its assigned optimizer states and parameters
4. All-gather operation to synchronize updated parameters

Memory reduction: Approximately 50% compared to standard data parallelism.

### ZeRO Stage 3: Parameter Partitioning

Stage 3 represents the most aggressive memory optimization, partitioning everything:

Each GPU $i$ stores:
- Partial parameters $\theta_i$
- Partial gradients $\nabla\theta_i$
- Partial optimizer states $m_i$ and $v_i$

The memory consumption per device becomes:
$$M_{Stage3} = \frac{\text{sizeof}(\theta)}{N} + \frac{\text{sizeof}(\nabla\theta)}{N} + \frac{2 \times \text{sizeof}(\theta)}{N} + \text{activations}$$

**Process flow:**
1. Before forward pass, all-gather to collect necessary parameters from other GPUs
2. Forward pass computes using the gathered parameters
3. Release gathered parameters to free memory
4. Re-gather necessary parameters for backward pass
5. Compute gradients during backward pass
6. Reduce-scatter gradients to appropriate GPUs
7. Each GPU updates only its portion of parameters

Memory reduction: Up to 8x compared to standard data parallelism for large models.

## ZeRO-Offload

ZeRO-Offload extends ZeRO by offloading computation and memory to CPU.

**Key components:**
- Optimizer states and gradients stored in CPU memory
- Parameters remain in GPU for computation
- CPU-GPU communication occurs during optimization
- CPU performs optimizer updates while GPU computes next batch

**Mathematical representation:**
For a model with $P$ parameters, traditional training requires $P(4+K)$ bytes of GPU memory (where $K$ represents activation memory). With ZeRO-Offload:
$$M_{GPU} = P + \text{activation memory}$$
$$M_{CPU} = 2P \text{ to } 4P \text{ (depending on optimizer)}$$

**Benefits:**
- Enables training of models 10x larger than standard data parallel training
- Provides up to 5x memory reduction on GPUs
- Works with a single GPU, unlike basic ZeRO which requires multiple GPUs

## ZeRO-Infinity

ZeRO-Infinity takes offloading to the extreme by leveraging both CPU memory and NVMe storage.

**Key features:**
- Partitions model states across GPUs (like ZeRO Stage 3)
- Offloads to CPU memory (like ZeRO-Offload)
- Further offloads to NVMe storage when CPU memory is insufficient
- Uses aggressive pre-fetching and computation-communication overlapping
- Employs bandwidth optimization and memory-centric tiling

**Mathematical model:**
Memory capacity effectively becomes:
$$M_{effective} = M_{GPU} + M_{CPU} + M_{NVMe}$$

With hierarchical memory management:
$$T_{access} = \begin{cases}
T_{fast}, & \text{for GPU memory} \\
T_{medium}, & \text{for CPU memory} \\
T_{slow}, & \text{for NVMe storage}
\end{cases}$$

**Performance optimization:**
ZeRO-Infinity uses bandwidth-optimized memory swapping:
$$T_{swap} = \min\left(\frac{S_{data}}{BW_{achieved}}, T_{computation}\right)$$

Where aggressive prefetching ensures data is available before needed.

## Importance of ZeRO in Deep Learning

ZeRO is crucial for several reasons:

1. **Model Size Scaling**: Enables training of trillion-parameter models that would otherwise be impossible on available hardware.

2. **Democratization**: Allows researchers with limited hardware to train larger models. A model requiring 64 high-end GPUs can be trained on just 16 with ZeRO.

3. **Resource Efficiency**: Maximizes memory utilization across distributed systems, reducing wasted resources.

4. **Training Speed**: By enabling larger batch sizes through memory optimization, ZeRO can accelerate training convergence.

5. **Cost Reduction**: Reduces infrastructure requirements for large-scale model training, lowering overall costs.

## Pros and Cons

### Pros:
- Drastically reduces memory requirements per GPU (up to 8x in Stage 3)
- Enables training of models that would otherwise be impossible
- Maintains computational efficiency with minimal overhead
- Works with existing PyTorch code with minimal changes
- Compatible with other optimization techniques (mixed precision, checkpointing)
- Scales efficiently to thousands of GPUs

### Cons:
- Increased communication overhead, especially in Stage 3
- Additional complexity in implementation and debugging
- Performance depends heavily on network bandwidth and latency
- Potential I/O bottlenecks with ZeRO-Infinity
- Stage 3 may slow down training due to parameter communication
- Integration with custom training loops can be challenging

## Recent Advancements

Recent developments in ZeRO technology include:

1. **ZeRO++**: Enhanced communication efficiency through tensor slicing and localized exchanges, reducing communication volume by up to 50%.

2. **ZeRO-Offload++**: Improved offloading strategies with dynamic threshold-based policies that adapt to model and hardware characteristics.

3. **DeepSpeed Inference with ZeRO**: Adaptation of ZeRO concepts for optimized inference, enabling larger models to be deployed.

4. **ZeRO-DP**: Integration with DeepSpeed Pipeline Parallelism for hybrid parallelism, combining the benefits of both approaches.

5. **Activation Checkpointing with ZeRO**: Coordinated memory-saving techniques that intelligently manage both activation memory and model state memory.

6. **Composition with Tensor Parallelism**: Combined with Megatron-style tensor parallelism for maximum efficiency, enabling training of models beyond a trillion parameters.

7. **Heterogeneous Training**: Support for clusters with diverse GPU configurations, optimizing memory distribution based on available resources.

8. **Smart Gradient Accumulation**: Techniques that reduce memory requirements during gradient accumulation phases for very large batch sizes.