Skip to content

Support bf16 optimizer states with CPU offload#8010

Merged
tohtana merged 5 commits into
deepspeedai:masterfrom
lucaspirola:feature/bf16-cpu-offload-optimizer-states
May 20, 2026
Merged

Support bf16 optimizer states with CPU offload#8010
tohtana merged 5 commits into
deepspeedai:masterfrom
lucaspirola:feature/bf16-cpu-offload-optimizer-states

Conversation

@lucaspirola
Copy link
Copy Markdown
Contributor

What

DeepSpeed's bf16 bf16_optimizer_states option and offload_optimizer are
currently mutually exclusive, per the support matrix in docs/_pages/config-json.md:

ZeRO 1/2/3 bf16_optimizer_states=false bf16_optimizer_states=true
requires ZeRO-Offload + DeepSpeedCPUAdam; states fp32, on CPU supported without offload; states bf16, on GPU

This PR fills the missing cell: bf16_optimizer_states=true together with
offload_optimizer: {device: cpu} for ZeRO 1/2/3 — Adam moments held in bf16
and offloaded to CPU host RAM. That reduces the offloaded optimizer state
from ~10 to ~6 bytes/param (bf16 master + two bf16 moments) with no added GPU
memory.

Why

CPU offload currently forces fp32 optimizer states; for large models the
offloaded optimizer state dominates host RAM. Keeping the moments in bf16
(matching the already-bf16 master weights) cuts that footprint substantially
while keeping the state off the GPU.

How

DeepSpeedCPUAdam already supports bf16 momentum/variance through its
fp32_optimizer_states constructor flag — the feature was simply not wired up.
No C++/CUDA kernel changes.

  • engine.py_configure_basic_optimizer builds DeepSpeedCPUAdam /
    ZenFlowCPUAdam with fp32_optimizer_states=False when bf16_optimizer_states
    is set (a user-supplied value is popped to avoid a keyword clash, and overridden
    with a warning).
  • base_optimizer.py_configure_master_weights runs the offload +
    DeepSpeedCPUAdam validator whenever offload is configured (not only for the
    fp32-states case), and asserts a user-provided optimizer actually stores bf16
    moments.
  • stage3.py / stage_1_and_2.py — pass offload_enabled through.
  • config-json.md — updated bf16 support matrix.

Backward compatibility

false+offload and true+no-offload configs are unaffected: the default
resolves to fp32_optimizer_states=True (prior behavior), and the no-offload
bf16-states path (FusedAdam on GPU) is untouched.

Numerics

bf16_optimizer_states continues to require bf16_master_weights_and_grads, so
master weights are bf16 — identical precision to the existing on-GPU bf16-states
path. CPU Adam computes updates in fp32 internally and rounds moments to bf16
(round-to-nearest-even), matching that path.

Testing

  • tests/unit/ops/adam/test_cpu_adam.pyDeepSpeedCPUAdam bf16 moment
    allocation + fp32 parity.
  • tests/unit/v1/half_precision/test_bf16.pybf16_optimizer_states + CPU
    offload across ZeRO 1/2/3 (extends TestBF16MasterWeightsGradients), plus a
    guard test that a user-provided DeepSpeedCPUAdam must opt into bf16 moments.

All new and affected existing tests pass; TestBF16MasterWeightsGradients
(9 cases) was verified on a 2-GPU host.

🤖 Generated with Claude Code

bf16 `bf16_optimizer_states` and `offload_optimizer` were effectively
mutually exclusive: bf16 optimizer states were only realized on the GPU,
and CPU-offloaded optimizer states were forced to fp32. This allows
`bf16_optimizer_states=true` together with `offload_optimizer: {device:
cpu}` for ZeRO 1/2/3, so the Adam moments are held in bf16 *and*
offloaded to CPU host RAM -- taking the offloaded optimizer state from
~10 to ~6 bytes/param with no added GPU memory.

The CPU Adam kernel already supports bf16 momentum/variance via
DeepSpeedCPUAdam's `fp32_optimizer_states` flag; the feature was only
unwired. Changes:

- engine.py: build DeepSpeedCPUAdam/ZenFlowCPUAdam with
  fp32_optimizer_states=False when bf16_optimizer_states is set.
- base_optimizer.py: run the offload + DeepSpeedCPUAdam validator
  whenever offload is configured, and assert the offloaded optimizer
  stores bf16 moments.
- stage3.py / stage_1_and_2.py: pass offload_enabled through.
- config-json.md: update the bf16 support matrix.
- tests: cover bf16 CPU-offloaded optimizer states for ZeRO 1/2/3 and
  DeepSpeedCPUAdam bf16 moment allocation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lucas Pirola <lucas@pirola.eu>
lucaspirola added a commit to lucaspirola/kdr that referenced this pull request May 17, 2026
Add ds_configs/zero3_offload_bf16_optim.json: ZeRO-3 with bf16 optimizer
states offloaded to CPU host RAM (~151 GB for the 25.2B model, vs ~302 GB
for the fp32 zero3_offload_optim.json). Pairs with the configs' existing
`optimizer: deepspeed_cpu_adam` — no kdr code change.

Requires a DeepSpeed build with deepspeedai/DeepSpeed#8010, which composes
`bf16_optimizer_states` with `offload_optimizer.device=cpu` (previously
mutually exclusive). The config documents the #8010 dependency and the
fp32 fallback. gradient_clipping hardcoded 0.5 to match the gemma-4 /
profile-J grad_clip_norm.

Both gemma-4 config headers updated to reference it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
lucaspirola added a commit to lucaspirola/kdr that referenced this pull request May 18, 2026
…or #8010

The zero3_offload_bf16_optim.json config sets bf16.bf16_optimizer_states
(deepspeedai/DeepSpeed#8010 — bf16 optimizer moments offloaded to host
RAM). DeepSpeed's _configure_master_weights then asserts the offloaded
DeepSpeedCPUAdam was constructed with fp32_optimizer_states=False, so
its moments are actually stored in bf16. kdr's build_optimizer never
passed it → AssertionError at deepspeed.initialize.

Add wants_bf16_optimizer_states(accelerator): reads bf16.bf16_optimizer_
states from the active ds_config. build_optimizer gains a matching
bf16_optimizer_states flag and passes fp32_optimizer_states=False to
DeepSpeedCPUAdam only when set — a standard (non-#8010) DeepSpeed build
has no such kwarg and is paired with the fp32-states fallback config.
run_recovery wires the flag from the ds_config.

Fixes probe bug #6.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you @lucaspirola!
When I developed bf16_optimizer_states feature, I wasn't able to include CPU offload due to the lack of the bandwidth. I really appreciate you completing it.

@tohtana
Copy link
Copy Markdown
Collaborator

tohtana commented May 19, 2026

It seems that our CI has an issue with the PyTorch and transformers versions combination. Let me take a look on it.

@tohtana tohtana merged commit 3c337b5 into deepspeedai:master May 20, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants