Support bf16 optimizer states with CPU offload by lucaspirola · Pull Request #8010 · deepspeedai/DeepSpeed

lucaspirola · 2026-05-17T18:56:55Z

What

DeepSpeed's bf16 bf16_optimizer_states option and offload_optimizer are
currently mutually exclusive, per the support matrix in docs/_pages/config-json.md:

ZeRO 1/2/3	`bf16_optimizer_states=false`	`bf16_optimizer_states=true`
	requires ZeRO-Offload + `DeepSpeedCPUAdam`; states fp32, on CPU	supported without offload; states bf16, on GPU

This PR fills the missing cell: bf16_optimizer_states=true together with
offload_optimizer: {device: cpu} for ZeRO 1/2/3 — Adam moments held in bf16
and offloaded to CPU host RAM. That reduces the offloaded optimizer state
from ~10 to ~6 bytes/param (bf16 master + two bf16 moments) with no added GPU
memory.

Why

CPU offload currently forces fp32 optimizer states; for large models the
offloaded optimizer state dominates host RAM. Keeping the moments in bf16
(matching the already-bf16 master weights) cuts that footprint substantially
while keeping the state off the GPU.

How

DeepSpeedCPUAdam already supports bf16 momentum/variance through its
fp32_optimizer_states constructor flag — the feature was simply not wired up.
No C++/CUDA kernel changes.

engine.py — _configure_basic_optimizer builds DeepSpeedCPUAdam /
ZenFlowCPUAdam with fp32_optimizer_states=False when bf16_optimizer_states
is set (a user-supplied value is popped to avoid a keyword clash, and overridden
with a warning).
base_optimizer.py — _configure_master_weights runs the offload +
DeepSpeedCPUAdam validator whenever offload is configured (not only for the
fp32-states case), and asserts a user-provided optimizer actually stores bf16
moments.
stage3.py / stage_1_and_2.py — pass offload_enabled through.
config-json.md — updated bf16 support matrix.

Backward compatibility

false+offload and true+no-offload configs are unaffected: the default
resolves to fp32_optimizer_states=True (prior behavior), and the no-offload
bf16-states path (FusedAdam on GPU) is untouched.

Numerics

bf16_optimizer_states continues to require bf16_master_weights_and_grads, so
master weights are bf16 — identical precision to the existing on-GPU bf16-states
path. CPU Adam computes updates in fp32 internally and rounds moments to bf16
(round-to-nearest-even), matching that path.

Testing

tests/unit/ops/adam/test_cpu_adam.py — DeepSpeedCPUAdam bf16 moment
allocation + fp32 parity.
tests/unit/v1/half_precision/test_bf16.py — bf16_optimizer_states + CPU
offload across ZeRO 1/2/3 (extends TestBF16MasterWeightsGradients), plus a
guard test that a user-provided DeepSpeedCPUAdam must opt into bf16 moments.

All new and affected existing tests pass; TestBF16MasterWeightsGradients
(9 cases) was verified on a 2-GPU host.

🤖 Generated with Claude Code

bf16 `bf16_optimizer_states` and `offload_optimizer` were effectively mutually exclusive: bf16 optimizer states were only realized on the GPU, and CPU-offloaded optimizer states were forced to fp32. This allows `bf16_optimizer_states=true` together with `offload_optimizer: {device: cpu}` for ZeRO 1/2/3, so the Adam moments are held in bf16 *and* offloaded to CPU host RAM -- taking the offloaded optimizer state from ~10 to ~6 bytes/param with no added GPU memory. The CPU Adam kernel already supports bf16 momentum/variance via DeepSpeedCPUAdam's `fp32_optimizer_states` flag; the feature was only unwired. Changes: - engine.py: build DeepSpeedCPUAdam/ZenFlowCPUAdam with fp32_optimizer_states=False when bf16_optimizer_states is set. - base_optimizer.py: run the offload + DeepSpeedCPUAdam validator whenever offload is configured, and assert the offloaded optimizer stores bf16 moments. - stage3.py / stage_1_and_2.py: pass offload_enabled through. - config-json.md: update the bf16 support matrix. - tests: cover bf16 CPU-offloaded optimizer states for ZeRO 1/2/3 and DeepSpeedCPUAdam bf16 moment allocation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lucas Pirola <lucas@pirola.eu>

Add ds_configs/zero3_offload_bf16_optim.json: ZeRO-3 with bf16 optimizer states offloaded to CPU host RAM (~151 GB for the 25.2B model, vs ~302 GB for the fp32 zero3_offload_optim.json). Pairs with the configs' existing `optimizer: deepspeed_cpu_adam` — no kdr code change. Requires a DeepSpeed build with deepspeedai/DeepSpeed#8010, which composes `bf16_optimizer_states` with `offload_optimizer.device=cpu` (previously mutually exclusive). The config documents the #8010 dependency and the fp32 fallback. gradient_clipping hardcoded 0.5 to match the gemma-4 / profile-J grad_clip_norm. Both gemma-4 config headers updated to reference it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…or #8010 The zero3_offload_bf16_optim.json config sets bf16.bf16_optimizer_states (deepspeedai/DeepSpeed#8010 — bf16 optimizer moments offloaded to host RAM). DeepSpeed's _configure_master_weights then asserts the offloaded DeepSpeedCPUAdam was constructed with fp32_optimizer_states=False, so its moments are actually stored in bf16. kdr's build_optimizer never passed it → AssertionError at deepspeed.initialize. Add wants_bf16_optimizer_states(accelerator): reads bf16.bf16_optimizer_ states from the active ds_config. build_optimizer gains a matching bf16_optimizer_states flag and passes fp32_optimizer_states=False to DeepSpeedCPUAdam only when set — a standard (non-#8010) DeepSpeed build has no such kwarg and is paired with the fp32-states fallback config. run_recovery wires the flag from the ds_config. Fixes probe bug #6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tohtana

This looks good to me, thank you @lucaspirola!
When I developed bf16_optimizer_states feature, I wasn't able to include CPU offload due to the lack of the bandwidth. I really appreciate you completing it.

tohtana · 2026-05-19T18:19:16Z

It seems that our CI has an issue with the PyTorch and transformers versions combination. Let me take a look on it.

lucaspirola requested review from loadams, tjruwase and tohtana as code owners May 17, 2026 18:56

tohtana added 3 commits May 18, 2026 21:00

Merge branch 'master' into feature/bf16-cpu-offload-optimizer-states

e4dd825

Merge branch 'master' into feature/bf16-cpu-offload-optimizer-states

cfbad14

Merge branch 'master' into feature/bf16-cpu-offload-optimizer-states

9dead79

tohtana approved these changes May 19, 2026

View reviewed changes

Merge branch 'master' into feature/bf16-cpu-offload-optimizer-states

66e993a

tohtana merged commit 3c337b5 into deepspeedai:master May 20, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bf16 optimizer states with CPU offload#8010

Support bf16 optimizer states with CPU offload#8010
tohtana merged 5 commits into
deepspeedai:masterfrom
lucaspirola:feature/bf16-cpu-offload-optimizer-states

lucaspirola commented May 17, 2026

Uh oh!

tohtana left a comment

Uh oh!

tohtana commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucaspirola commented May 17, 2026

What

Why

How

Backward compatibility

Numerics

Testing

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants