# Zero Redundancy Optimization (ZeRO)

In this session, we will learn about ZeRO, Microsoft’s neural network training optimization solution.


## 1. Mixed Precision

As modern GPUs support computation with lower precision, most modern neural network training uses a mixed precision approach that combines FP16 (half) and FP32 (single). On a V100 GPU, while FP32 delivers around 14 TFLOPS, FP16 can achieve training speeds of up to 100 TFLOPS. In addition, using FP16 reduces the model size, which provides advantages not only during training but also during deployment.

<br>

![](../images/mixed_precision_1.png)

<br>

### But why Mixed?
This raises a question: why not train the model using only FP16? Why is it necessary to use both FP32 and FP16 together? In short, training with only FP16 causes the loss to diverge severely, making training almost impossible. If gradients are kept in FP16, most fractional values are truncated, which prevents precise learning. Therefore, the goal is to combine the high speed of FP16 with the high accuracy of FP32 to take advantage of both approaches.

![](../images/ddp_analysis_3.png)

The computationally expensive Forward and Backward passes are performed using the FP16 model, and the computed gradients are copied to the higher-precision FP32 model to update the weights. This leads to another question: how should FP16 gradients be applied to FP32? Experimental results showed that when backpropagating a loss computed in FP16, some values with very small magnitudes (shown on the left in the figure) became zero during computation.

![](../images/mixed_precision_4.png)

<br>

### Loss Scaling
How can this problem be resolved? The idea is very simple: multiply the loss gradients by a very large value to shift their distribution to the right. This technique is called loss scaling. By multiplying the FP16 loss by a large value, values that might otherwise disappear when applied to FP32 can be preserved.

![](../images/mixed_precision_5.png)


In [None]:
"""
Reference: apex/apex/amp/opt.py
"""

import contextlib

@contextlib.contextmanager
def scale_loss(self, loss):
    if not self._amp_handle.is_active():
        yield loss
        return

    # When there are multiple losses per-optimizer, we need
    # to save out current grad accumulation, since we won't be
    # able to unscale this particulare loss once the grads are
    # all mixed together.
    cached_grads = []
    if self._loss_idx > 0:
        for p in master_params(self._optimizer):
            if p.grad is not None:
                cached_grads.append(p.grad.data.detach().clone())
            else:
                cached_grads.append(None)
        self._optimizer.zero_grad()

    loss_scale = self._cur_loss_scaler().loss_scale()
    yield loss * loss_scale

In [None]:
"""
Reference: apex/tests/L0/run_amp/test_fused_sgd.py
"""

with amp.scale_loss(loss0, optimizer, loss_id=loss_ids[0]) as scaled_loss:
    scaled_loss.backward()
    if i == inject_inf and which_backward == 0:
        if inject_inf_loc == "fp32":
            model0.weight0.grad[0] = float('inf')
        elif inject_inf_loc == "fp16":
            model0.weight1.grad[0] = float('inf')

In practice, as shown in the figure below, multiplying the loss by a large value allowed training to proceed without divergence. The gray graph represents performance without scaling, while the green graph shows performance with scaling applied. Surprisingly, the performance is almost identical to FP32.

![](../images/mixed_precision_2.png)

For this reason, mixed precision using both FP16 and FP32 has become almost essential in modern neural network training. Until bfloat16 (Google TPU), which provides FP32-level coverage with FP16-level storage requirements, is supported on a wider variety of GPUs and becomes more widely adopted, FP16 + FP32 mixed precision training will remain a core technique in neural network training.

<br>

### How Mixed Precision Works

The following figure illustrates how mixed precision operates. Let’s take a closer look at the process using code and equations.

<br>

![](../images/mixed_precision_33.png)


### 0) Create the model and optimizer

We define a neural network with two layers.


In [None]:
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.w1 = nn.Linear(512, 512, bias=False)
        self.w2 = nn.Linear(512, 1, bias=False)
    
    def forward(self, x):
        z1 = self.w1(x)
        z2 = self.w2(z1)
        return z2

We create the neural network to be trained and the optimizer.


In [None]:
from torch.optim import SGD

fp32_model= Net().to("cuda")
optimizer = SGD(fp32_model.parameters(), lr=1e-2)

In [None]:
f"GPU = {torch.cuda.memory_allocated(0) / (1024 ** 2)} GiB"

<br>

### 1)  Float2Half

This process simply truncates parameters such as `0.524796132` to values like `0.5247`.

As you can see, the memory footprint is roughly half the size of an FP32 model. (1.0 GB + 0.5 GB)


In [None]:
fp16_model = Net().half().to("cuda")
fp16_model.load_state_dict(fp32_model.state_dict())

In [None]:
f"GPU = {torch.cuda.memory_allocated(0) / (1024 ** 2)} GiB"

<br>

### 2) Forward

We perform the forward pass using the model copied to fp16.

$z_1 = w_1 \cdot x \; $ (FWD: layer1)

$z_2 = w_2 \cdot z_1 \; $ (FWD: layer2)


In [None]:
import torch

# example input sizes
batch_size, hidden_size = 4, 512

# create dummy data (bsz=4, hid=256)
x = torch.randn(batch_size,hidden_size, dtype=torch.half, device="cuda") 

# do forward
z2 = fp16_model(x)

# check dtypr of output logits
f"logits type = {z2.dtype}"

We compute the loss using the output values calculated in FP16.

$L = \frac{(y - z_2)^2}{2} \; $ (Loss computation)


In [None]:
# craete dummy data (bsz=4)
y = torch.tensor([[1.9], [9.5], [0.9], [1.2]], dtype=torch.half, device="cuda")

# compute mean square error loss
L = torch.nn.functional.mse_loss(z2, y)

# check dtype of loss
f"loss type = {L.dtype}"

<br>

### 3) Backward 

Now we need to update the model parameters using the Gradient Descent rule:

$w_n := w_n - lr \cdot \frac{dL}{dw_n}$

Therefore, we must compute the gradients $\frac{dL}{dw_1}$ and $\frac{dL}{dw_2}$.  
They are approximately as follows (the desired results can be obtained using the chain rule):

$\frac{dL}{dw_2} = \frac{dL}{dz_2} \cdot \frac{dz_2}{dw_2}$

$\frac{dL}{dw_1} = \frac{dL}{dz_2} \cdot \frac{dz_2}{dz_1} \cdot \frac{dz_1}{dw_1}$


<br>


More concretely, they are:

$\frac{dL}{dz_2} =  y - z_2 \; $ (BWD-activation: layer2)

$\frac{dz_2}{dw_2} = z_1 \;$ (BWD-weight: layer2)

$\frac{dz_2}{dz_1} = w_2 \;$ (BWD-activation: layer1)

$\frac{dz_1}{dw_1} = x \; $ (BWD-weight: layer1)

<br>

$\frac{dL}{dw_2} = (y - z_2) \cdot z_1$

$\frac{dL}{dw_1} = (y - z_2) \cdot w_2 \cdot x$


In [None]:
# loss scaling
L *= 1024

# do backward
L.backward()

<br>

### 4) Update Weight

Finally, to update the parameters, we call `optimizer.step()`.

$w_1 := w_1 - lr \cdot \frac{dL}{dw_1} \; $ (Weight Update)

$w_2 := w_2 - lr \cdot \frac{dL}{dw_2} \; $ (Weight Update)


In [None]:
print(f'before: {fp32_model.w1.weight}\n')
optimizer.step()
print(f'after: {fp32_model.w1.weight}\n')

If you think about it, the FP32 model has never performed forward or backward passes.  
Therefore, it does not have gradient tensors. As a result, even if we call `optimizer.step()`, the values do not change.  
So, before calling `optimizer.step()`, we must copy the gradients from the FP16 model that went through `backward()`.

For reference, in PyTorch, all parameters (`nn.Parameter`) that have `requires_grad=True` set contain an attribute called `grad`.  
When `backward()` is called on a tensor produced by the model, PyTorch traverses the computation graph backward, computes derivatives, and stores the results in the `grad` attribute.  

Since `grad` has the same size as the corresponding tensor, if a model occupies 10GB of memory, its gradients will also require about 10GB.  
This is one of the reasons why training requires much more memory than inference.  

Therefore, for tensors that are not used for training, you should always set `requires_grad=False` to prevent unnecessary memory consumption.


In [None]:
# copy gradient to FP32 model
fp32_model.w1.weight.grad = fp16_model.w1.weight.grad.float()
fp32_model.w2.weight.grad = fp16_model.w2.weight.grad.float()

In [None]:
print(f'before: {fp32_model.w1.weight}\n')
optimizer.step()
print(f'after: {fp32_model.w1.weight}\n')

### Performing Mixed Precision Training in PyTorch

In PyTorch, you can easily perform mixed precision training as follows.


In [None]:
# Reference: https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()

for data, label in data_iter:
   optimizer.zero_grad()
   # Casts operations to mixed precision
   with torch.cuda.amp.autocast():
      loss = model(data)

   # Scales the loss, and calls backward()
   # to create scaled gradients
   scaler.scale(loss).backward()

   # Unscales gradients and calls
   # or skips optimizer.step()
   scaler.step(optimizer)

   # Updates the scale for next iteration
   scaler.update()

<br>

### Dynamic Loss Scaling

Loss Scaling makes mixed precision training very effective. However, it is very difficult to know what scale value is optimal. To address this problem, several open-source projects propose a technique called **Dynamic Loss Scaling**. This is implemented in tools such as NVIDIA’s `amp` and Microsoft’s `deepspeed`.

The idea of Dynamic Loss Scaling is very simple. **The goal is to keep the scale value as large as possible without causing gradient overflow.** Increasing gradient values is generally beneficial, but if they become too large, overflow occurs. Therefore, the scale value is increased to the maximum level that does not cause overflow.

As a result, training starts with a very large scale value. In the case of `deepspeed`, the default value is set to $2^{32}$. Backpropagation is performed with this scaled loss, and if gradient overflow is detected, the scale value is reduced by half. This process is repeated multiple times to find the largest possible scale value that does not cause overflow—this is exactly what Dynamic Loss Scaling does.


<br>

### AMP (Apex Mixed Precision)

`apex` is a library developed by NVIDIA and is one of the most well-known mixed precision libraries. Nowadays, mixed precision functionality is built directly into `torch`, and tools such as DeepSpeed and PyTorch Lightning are widely used, so `apex` is not used as frequently as before. However, it is still a commonly used library. Its usage is very simple, as shown below.


In [None]:
import torch
from apex import amp


# Declare model and optimizer as usual, with default (FP32) precision
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

# Allow Amp to perform casts as required by the opt_level
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# loss.backward() becomes:
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Looking at the code above, you can see a parameter called `opt_level`.  
`apex` provides a feature to configure different mixed precision levels, and understanding these levels can be very useful if you ever need to use `apex` in the future.  
(For reference, these are the letter **O** followed by the numbers **0, 1, 2, 3**.)

![](../images/apex.png)

- `O0`: FP32 training  
- `O1`: Operations that are well supported by Tensor Cores run in FP16 / the rest run in FP32  
- `O2`: All parameters are set to FP16 except normalization weights  
- `O3`: Pure FP16 training


<br>

## 2. Zero Redundancy Optimization

By using FP16 and FP32 together, training speed improves significantly, but a drawback appears: **memory usage**.  
Because the FP32 master weights, FP16 parameters, and gradients are all kept on the GPU at the same time, more memory is required compared to before.

![](../images/zero_1.png)

And even if the model parameters themselves exist in FP16, optimization is performed in FP32.  
Therefore, tensors required by adaptive optimizers such as AdaGrad and Adam—such as **variance** and **momentum**—still need to be stored in FP32.

![](../images/adam.png)


In [None]:
"""
Reference: pytorch/torch/optim/adam.py
"""

@torch.no_grad()
def step(self, closure=None):
    """Performs a single optimization step.

    Args:
        closure (callable, optional): A closure that reevaluates the model
            and returns the loss.
    """
    loss = None
    if closure is not None:
        with torch.enable_grad():
            loss = closure()

    for group in self.param_groups:
        params_with_grad = []
        grads = []
        exp_avgs = []
        exp_avg_sqs = []
        max_exp_avg_sqs = []
        state_steps = []
        beta1, beta2 = group['betas']

        for p in group['params']:
            if p.grad is not None:
                params_with_grad.append(p)
                if p.grad.is_sparse:
                    raise RuntimeError(
                        'Adam does not support sparse gradients, please consider SparseAdam instead'
                    )
                grads.append(p.grad)

                state = self.state[p]
                # Lazy state initialization
                # For every parameter, tensors `exp_avg` and `exp_avg_sq`
                # of the same size are maintained.
                # Because of this, when using Adam-based optimizers,
                # GPU memory equivalent to **two additional model copies**
                # is required.
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = torch.zeros_like(
                        p, memory_format=torch.preserve_format
                    )
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = torch.zeros_like(
                        p, memory_format=torch.preserve_format
                    )
                    if group['amsgrad']:
                        # Maintains the maximum of all exponential moving
                        # averages of squared gradient values
                        state['max_exp_avg_sq'] = torch.zeros_like(
                            p, memory_format=torch.preserve_format
                        )

                exp_avgs.append(state['exp_avg'])
                exp_avg_sqs.append(state['exp_avg_sq'])

                if group['amsgrad']:
                    max_exp_avg_sqs.append(state['max_exp_avg_sq'])

                # Update the step count for each parameter
                state['step'] += 1
                # Record the step after update
                state_steps.append(state['step'])

        F.adam(
            params_with_grad,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            amsgrad=group['amsgrad'],
            beta1=beta1,
            beta2=beta2,
            lr=group['lr'],
            weight_decay=group['weight_decay'],
            eps=group['eps'],
        )
    return loss


Up to this point, we have examined the different kinds of tensors that are allocated in memory when training a model, such as **FP16 parameters**, **FP16 gradients**, **FP32 parameters**, **FP32 gradients**, **momentum**, **variance**, and so on.
What is surprising is that the *actual model parameters themselves occupy only a small portion* of the total memory. During training, an **enormous number of additional tensors** are allocated in GPU memory beyond the model itself.

<br>

![](../images/memory.png)

<br>

In addition, **Data tensors** and **Activation tensors** are also allocated in memory.

* **Data tensors** refer to tensors in token form *before* they are fed into the model.
* **Activation tensors** refer to tensors such as hidden states that are produced during the Forward & Backward passes.

Furthermore, when distributed training is used, extra memory is required for **bucket buffers that temporarily hold tensors during communication**. We already discussed buckets earlier in the Data Parallelism session, for example in the context of *Gradient Bucketing*.

Therefore, it is not enough to parallelize only the **model** and the **data**. We also need to carefully manage **optimizer states** (variance, momentum), as well as **Data & Activation memory**, and even **communication-related buffers**.

<br>

**Zero Redundancy Optimization (ZeRO)** is a collection of **memory optimization techniques** designed to manage all of these components very efficiently. Broadly speaking, ZeRO consists of solutions such as **ZeRO-DP (ZeRO Data Parallelism)** and **ZeRO-R (ZeRO Residual States)**.

From here on, let’s go through them step by step.


<br>

## 3. ZeRO Data Parallelism

First, if we examine memory usage, the left side of the figure above (FP16, FP32, model & optimizer & gradient) occupies the largest space. Therefore, these need to be partitioned and managed efficiently. ZeRO-DP helps with this by enabling these tensors to be partitioned across devices together with Data Parallelism.

![](../images/zero_2.png)

ZeRO-DP is provided in four stages, and it can be selectively applied through the `DeepSpeed` library.

- **Stage 0**:
  - No Partitioning
  - ZeRO-DP is not applied.
- **Stage 1**:
  - Optimizer States Partitioning
  - Optimizer states (momentum, variance) tensors are partitioned across multiple GPUs.
  - ~4× reduction in memory usage
  - Similar amount of communication cost as before
- **Stage 2**:
  - Stage 1 + Gradient partitioning
  - Gradient tensors are partitioned across multiple GPUs.
  - ~2× further reduction in memory usage
  - Similar amount of communication cost as before
- **Stage 3**:
  - Parameter partitioning
  - Parameter (model) tensors are partitioned across multiple GPUs.
  - Memory usage decreases linearly depending on the partitioning level
  - ~1.5× more communication cost than before
  
Because the operation of ZeRO-DP is quite complex, let’s watch a video to understand it.

https://www.microsoft.com/en-us/research/uploads/prod/2020/02/Turing-Animation.mp4?_=1


In [None]:
from IPython.display import HTML

HTML("""
<div align="middle">
<video width="80%" controls>
      <source src="../images/zero_video.mp4" type="video/mp4">
</video></div>""")


![](../images/zero_3.png)

In conclusion, by applying ZeRO-DP, you can train much larger models on smaller GPUs than before. Let’s try it out right away. First, we create a configuration file. I enabled the learning rate scheduler, fp16, and zero optimization (stage 1). In addition to these, the DeepSpeed configuration provides a wide variety of options. You can find more options at https://www.deepspeed.ai/docs/config-json.


```
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 1,
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": 300,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-5,
      "warmup_num_steps": 30
    }
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 1
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 9999999999
}

```

Then, write the following code. The argument parser must include the options --local_rank and --deepspeed_config, and among them, --local_rank is automatically provided when the script is executed.


In [None]:
"""
src/ch7_zero_redundancy_optimization/zero_dp_args.py
"""
from argparse import ArgumentParser
from datasets import load_dataset
from torch.optim import Adam
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import deepspeed
import torch.distributed as dist

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

parser = ArgumentParser()
parser.add_argument(
    "--deepspeed_config", default="../src/zero_dp_config.json", type=str
)
parser.add_argument("--local_rank", default=0, type=int)
args = parser.parse_args()

optimizer = Adam(model.parameters(), lr=3e-5, weight_decay=3e-7)

engine, optimizer, _, scheduler = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
)

datasets = load_dataset("squad").data["train"]["context"]
datasets = [str(sample) for sample in datasets]
data_loader = DataLoader(datasets, batch_size=8, num_workers=8)

for i, data in enumerate(data_loader):
    tokens = tokenizer(
        data,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=1024,
    )

    loss = engine(
        input_ids=tokens.input_ids.cuda(),
        attention_mask=tokens.attention_mask.cuda(),
        labels=tokens.input_ids.cuda(),
    ).loss

    engine.backward(loss)
    engine.step()

    if i % 10 == 0 and dist.get_rank() == 0:
        print(f"step:{i}, loss:{loss}")

    if i >= 300:
        break


In [None]:
!deepspeed --num_gpus=4 ../src/zero_args.py --deepspeed_config=../src/ch7_zero_redundancy_optimization/zero_dp_config.json

Or, you can directly pass the configuration into deepspeed.initialize().

In [None]:
"""
src/ch7_zero_redundancy_optimization/zero_dp_args.py
"""
from datasets import load_dataset
from torch.optim import Adam
from torch.utils.data import DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import deepspeed
import torch.distributed as dist

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
optimizer = Adam(model.parameters(), lr=3e-5, weight_decay=3e-7)

engine, optimizer, _, scheduler = deepspeed.initialize(
    optimizer=optimizer,
    model=model,
    config={
        "train_batch_size": 16,
        "gradient_accumulation_steps": 1,
        "scheduler": {
            "type": "WarmupDecayLR",
            "params": {
                "total_num_steps": 300,
                "warmup_min_lr": 0,
                "warmup_max_lr": 3e-5,
                "warmup_num_steps": 30,
            },
        },
        "fp16": {
            "enabled": True,
            "initial_scale_power": 32,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1,
        },
        "zero_optimization": {
            "stage": 1,
            "allgather_partitions": True,
            "allgather_bucket_size": 5e8,
            "overlap_comm": False,
            "reduce_scatter": True,
            "reduce_bucket_size": 5e8,
            "contiguous_gradients": True,
        },
        "zero_allow_untested_optimizer": True,
        "wall_clock_breakdown": False,
        "steps_per_print": 9999999999,
    },
)

datasets = load_dataset("squad").data["train"]["context"]
datasets = [str(sample) for sample in datasets]
data_loader = DataLoader(datasets, batch_size=8, num_workers=8)

for i, data in enumerate(data_loader):
    tokens = tokenizer(
        data,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=1024,
    )

    loss = engine(
        input_ids=tokens.input_ids.cuda(),
        attention_mask=tokens.attention_mask.cuda(),
        labels=tokens.input_ids.cuda(),
    ).loss

    engine.backward(loss)
    engine.step()

    if i % 10 == 0 and dist.get_rank() == 0:
        print(f"step:{i}, loss:{loss}")

    if i >= 300:
        break


In [None]:
!deepspeed --num_gpus=4 ../src/ch7_zero_redundancy_optimization/zero_config.py

<br>

## 4. Activation Checkpointing

In addition to FP16 and FP32 model parameters, gradients, and optimizer states, another large memory region is **Activation Memory**.  
Activations refer to the input tensors that are multiplied by the model weights.

For example, consider the following neural network:

$y = w_1 \cdot (w_2 \cdot x)$

In this case:
- The tensor `x`, which is multiplied by `w_2`
- The tensor `w_2 · x`, which is multiplied by `w_1`

are both stored as **Activation Memory**.

These activation tensors must be kept during the forward pass because they are required later to compute gradients during the backward pass.


In [None]:
"""
Reference: https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html
"""

import torch


class ReLU(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        # input 값을 저장하고 있음.
        
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

![](../images/max_pooling.png)

In the Pipeline Parallelism session, we mentioned that during the backward pass, activation tensors used in the forward pass must be stored.  
As shown above, in order to compute gradients for a Max Pooling layer, the original positions of the pooled values are required. Therefore, the input tensor from the forward pass must be preserved.

Likewise, if you look at the implementation of `ReLU`, you can see that the `input` tensor is saved using `ctx.save_for_backward`.

![](../images/checkpoint_full_act.gif)

**In other words, to perform the backward step, inputs from the forward step must be stored.**  
The animation above illustrates this process.

However, storing activations everywhere significantly increases memory consumption.

![](../images/checkpoint_no_act.gif)

If activation tensors are not stored, memory usage can be greatly reduced.  
But in that case, as shown above, the forward computation must be executed again during the backward pass to recompute the activations.

Activation Checkpointing combines the advantages of both approaches by storing activations only at selected points.

![](../images/checkpoint_act.gif)

By storing activations only at intermediate checkpoints, we avoid recomputing the forward pass from the very beginning.  
Instead, the forward computation resumes from the nearest checkpoint, reducing computation time.  
At the same time, since most activations are discarded, memory consumption is significantly reduced.

This technique—**saving activations at intermediate points and recomputing forward passes from those checkpoints when needed**—is called **Activation Checkpointing**.

PyTorch already provides built-in support for checkpointing.  
Let’s try it out using PyTorch.


In [None]:
"""
src/ch7_zero_redundancy_optimization/checkpointing.py
"""
from torch import nn
from torch.utils.checkpoint import checkpoint
from transformers import BertTokenizer, BertLayer, BertConfig

config = BertConfig.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer("Hello I am Kevin", return_tensors="pt")

embedding = nn.Embedding(tokenizer.vocab_size, config.hidden_size)
layers = nn.ModuleList([BertLayer(config) for _ in range(6)])

hidden_states = embedding(tokens.input_ids)
attention_mask = tokens.attention_mask

for i, layer_module in enumerate(layers):
    layer_outputs = checkpoint(
        layer_module,
        hidden_states,
        attention_mask,
    )

    hidden_states = layer_outputs[0]

print(f"output: {hidden_states}")

Usage is very simple. As shown in the example above, you only need to **replace the existing call from `module(a, b, c)` to `checkpoint(module, a, b, c)`**, and that’s it.

In addition, most models in Hugging Face `transformers`, which we frequently use, already include built-in support for Activation Checkpointing.  
You can **enable or disable it simply by calling**:

- `model.gradient_checkpointing_enable()`
- `model.gradient_checkpointing_disable()`

Isn’t that incredibly easy?


<br>

## 4. ZeRO-R

ZeRO-R is a collection of techniques designed to highly optimize areas such as **Activation Memory** and **Communication Buckets**.

![](../images/zero_r_1.png)

In the previous chapter, we improved **model state memory** (FP16 & FP32 parameters, gradients, optimizer states) efficiently using **ZeRO-DP**.  
ZeRO-R proposes the following three additional solutions:

- **Activation Memory Partitioning**
- **Constant Size Buffer**
- **Memory Defragmentation**

Let’s go through each one.

<br>

### 1) Activation Memory Partitioning

<br>

![](../images/zero_r_2.png)

Although Activation Checkpointing can help with memory efficiency and speed, it can still cause significant memory issues when training very large models.  
This is especially true when combined with model parallelism, where many copies of activation tensors are created in different places after the forward pass.

**ZeRO-R addresses this by All-gathering activation tensors and then partitioning only the necessary ones across GPUs.**  
Additionally, extremely large activations are checkpointed to **CPU RAM**, slightly sacrificing speed but significantly saving GPU memory.

<br>

### 2) Constant Memory Buffer

Constant Memory Buffer refers to a technique that **keeps the size of communication buckets (used in All-reduce, All-gather, etc.) constant**.

In general, as the model grows larger, it is beneficial for communication buckets to grow as well.  
However, when models become extremely large, bucket sizes can grow too much and occupy a significant portion of GPU memory.

To address this, **an upper limit is imposed on bucket sizes so they cannot grow beyond a fixed maximum value**.  
Once the bucket size reaches a certain threshold, maintaining it at that size is sufficient to achieve good efficiency without further growth.

<br>

### 3) Memory Defragmentation (Contiguous Checkpointing)

![](../images/zero_r_3.png)

During training, tensors are frequently created and destroyed, which causes **GPU memory fragmentation** to occur very often.  
In some cases, even when there is enough total GPU memory available, fragmentation prevents large contiguous tensors from being allocated.

To solve this, ZeRO-R **pre-allocates contiguous empty memory regions** that can hold activations, gradients, and similar tensors.  
When tensors of similar sizes are created, they are **moved into these reserved regions**, minimizing fragmentation as much as possible.


Just like ZeRO-DP, you only need to write a simple configuration.

- **Constant Buffer Size**
  - The maximum bucket size is determined using `allgather_bucket_size` and `reduce_bucket_size`.
- **Activation Memory**
  - Activation memory is partitioned across GPUs using `partition_activations`.
  - Extremely large activation tensors are offloaded to the CPU using `cpu_checkpointing`.
- **Memory Defragmentation**
  - Memory fragmentation is mitigated using `contiguous_memory_optimization`.

In addition to these techniques, many more optimizations exist.  
If you want to learn more about the available options in detail, please refer to the paper and the official documentation.




```
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 1,
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": 300,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-5,
      "warmup_num_steps": 30
    }
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 1,
    "allgather_bucket_size": 5e8,
    "reduce_bucket_size": 5e8
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 9999999999
}
```

In [None]:
!deepspeed --num_gpus=4 ../src/zero_args.py --deepspeed_config=../src/ch7_zero_redundancy_optimization/zero_r_config.json

<br>

## 5. ZeRO Offload

In the previous chapter, we saw that **Activation Memory Partitioning** included the ability to offload very large activation tensors to the CPU.  
**ZeRO Offload**, which is a successor to ZeRO-R, goes one step further by **offloading parts of the model itself to CPU RAM**, effectively breaking the GPU memory capacity limit.

The core idea of ZeRO Offload is as follows.

![](../images/zero_off_1.png)

#### GPU-side
- FP16 parameters and gradients reside on the GPU.
- Forward and backward passes are executed on the GPU (because they involve heavy computation).

#### CPU-side
- FP32 parameters, gradients, and optimizer states reside on the CPU.
- Weight updates are performed on the CPU (because they are relatively lightweight operations).
- In particular, a highly optimized **CPU Adam optimizer** is implemented to run very efficiently on the CPU.

In general, CPU computation speed is tens of times slower than GPU computation speed.  
Therefore, large-scale computations must be performed on the GPU.  
For this reason, forward and backward passes are executed on the GPU.

If we think about it, a large portion of GPU memory is occupied by FP32 parameters, gradients, and optimizer states, yet the operations they perform are limited to the relatively low-cost **weight update** step.

![](../images/ddp_analysis_3.png)

By offloading all FP32 components to the CPU, the GPU is left with only FP16 tensors that are strictly necessary for GPU computation, resulting in a much more relaxed GPU memory state.

<br>

### DPU: Delayed Parameter Update

![](../images/zero_off_2.png)

If data transfer to the CPU starts only after all forward and backward computations on the GPU are finished, the GPU must remain idle during communication.  
ZeRO Offload introduces a technique called **Delayed Parameter Update (DPU)**, which—similar to Gradient Bucketing in DDP—**overlaps communication and computation to reduce overall execution time**.

![](../images/zero_off_3.png)

Experimental results show that applying DPU does not negatively impact model quality and can slightly improve training speed.

<br>

### ZeRO Offload + ZeRO-DP

![](../images/zero_off_4.png)

ZeRO Offload can be combined with **ZeRO-DP**.  
If optimizer states and gradients are offloaded to the CPU while ZeRO-DP is enabled, the system takes the form shown above.

Note that the combination of ZeRO-DP and Offload is supported starting from **ZeRO stage 2**, and to offload model parameters as well, the ZeRO stage must be set to **stage 3**.

- **ZeRO stage 2**: Optimizer States Offload
- **ZeRO stage 3**: Optimizer States + Parameter Offload

<br>

### CPU Adam

The standard Adam optimizer is highly optimized for GPUs, so running it on the CPU is generally slower.  
To address this, ZeRO Offload provides a **CPU Adam optimizer** that applies various optimization techniques to run very efficiently on the CPU.

The implementation of CPU Adam lies closer to computer architecture and operating systems than to traditional machine learning or distributed systems, so it is not covered in detail here.  
For more information, please refer to the paper.  
(Honestly, I also skipped a deep dive into this part—if anyone has studied it in depth, please let me know by opening an issue.)

![](../images/cpu_adam.png)


Let’s try **ZeRO Offload** in practice.  
As before, we first modify the configuration.

To offload both the **optimizer** and **parameters**, the ZeRO stage is set to **stage 3**, and the options `offload_param` and `offload_optimizer` are added.


```
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 1,
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": 300,
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-5,
      "warmup_num_steps": 30
    }
  },
  "fp16": {
    "enabled": true,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 3,
    "allgather_bucket_size": 5e8,
    "reduce_bucket_size": 5e8,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    }
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 9999999999
}
```

In [None]:
!deepspeed --num_gpus=4 ../src/zero_args.py --deepspeed_config=../src/zero_off_config.json

<br>

## 6. ZeRO Infinity

ZeRO Infinity adopts a method of storing parameters in **NVMe (SSD) memory**.  
Because NVMe has a much larger memory capacity than even CPU RAM, this approach is considered to push memory limits even further.  
The ZeRO Infinity algorithm is also very complex, so let’s watch a video to understand it:

https://www.microsoft.com/en-us/research/uploads/prod/2021/04/1400x788_deepspeed_nologo-1.mp4


In [None]:
from IPython.display import HTML

HTML("""
<div align="middle">
<video width="80%" controls>
      <source src="../images/zero_infinity.mp4" type="video/mp4">
</video></div>""")


### Core Idea of ZeRO Infinity

ZeRO Infinity is an extension of **ZeRO Offload**.  
Previously, ZeRO Offload operated CPU RAM and GPU VRAM as follows:

- **GPU**: FP16 parameters and gradients reside on the GPU, performing **forward and backward passes**.
- **CPU**: FP32 parameters, gradients, and optimizer states reside on the CPU, performing **weight updates**.

ZeRO Infinity introduces **NVMe storage**, resulting in the use of three types of devices. The usage is as follows:

- **NVMe**: By default, all parameters reside on NVMe when they are not in use.
- **GPU**: FP16 parameters and gradients are loaded from NVMe to the GPU **only when forward and backward passes need to be performed**.
- **CPU**: FP32 parameters, gradients, and optimizer states are loaded from NVMe to the CPU **only when weight updates need to be performed**.

In short, **all tensors are stored on NVMe by default and are moved to CPU or GPU only when needed for computation**.

<br>
    
### Comparison with Offload & ZeRO-DP

![](../images/zero_infinity.png)

Since ZeRO Infinity stores almost all tensors on NVMe, both CPU and GPU remain nearly empty.  
As shown above, this enables training models that were completely impossible to train using previous techniques.

According to experimental results, ZeRO Infinity was also **faster than ZeRO Offload**.

ZeRO Infinity requires devices equipped with NVMe storage, so we will not run a hands-on experiment in this tutorial.  
If you have an NVMe-enabled system, you can enable ZeRO Infinity by setting the device of `offload_param` and `offload_optimizer` to `nvme` and configuring the `nvme_path` appropriately, as shown below:

```json
"offload_param": {
    "device": "nvme",
    "nvme_path": "/local_nvme",
    "pin_memory": true
},
"offload_optimizer": {
    "device": "nvme",
    "nvme_path": "/local_nvme",
    "pin_memory": true
}
