Support bf16 optimizer states with CPU offload#8010
Merged
tohtana merged 5 commits intoMay 20, 2026
Merged
Conversation
bf16 `bf16_optimizer_states` and `offload_optimizer` were effectively
mutually exclusive: bf16 optimizer states were only realized on the GPU,
and CPU-offloaded optimizer states were forced to fp32. This allows
`bf16_optimizer_states=true` together with `offload_optimizer: {device:
cpu}` for ZeRO 1/2/3, so the Adam moments are held in bf16 *and*
offloaded to CPU host RAM -- taking the offloaded optimizer state from
~10 to ~6 bytes/param with no added GPU memory.
The CPU Adam kernel already supports bf16 momentum/variance via
DeepSpeedCPUAdam's `fp32_optimizer_states` flag; the feature was only
unwired. Changes:
- engine.py: build DeepSpeedCPUAdam/ZenFlowCPUAdam with
fp32_optimizer_states=False when bf16_optimizer_states is set.
- base_optimizer.py: run the offload + DeepSpeedCPUAdam validator
whenever offload is configured, and assert the offloaded optimizer
stores bf16 moments.
- stage3.py / stage_1_and_2.py: pass offload_enabled through.
- config-json.md: update the bf16 support matrix.
- tests: cover bf16 CPU-offloaded optimizer states for ZeRO 1/2/3 and
DeepSpeedCPUAdam bf16 moment allocation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lucas Pirola <lucas@pirola.eu>
lucaspirola
added a commit
to lucaspirola/kdr
that referenced
this pull request
May 17, 2026
Add ds_configs/zero3_offload_bf16_optim.json: ZeRO-3 with bf16 optimizer states offloaded to CPU host RAM (~151 GB for the 25.2B model, vs ~302 GB for the fp32 zero3_offload_optim.json). Pairs with the configs' existing `optimizer: deepspeed_cpu_adam` — no kdr code change. Requires a DeepSpeed build with deepspeedai/DeepSpeed#8010, which composes `bf16_optimizer_states` with `offload_optimizer.device=cpu` (previously mutually exclusive). The config documents the #8010 dependency and the fp32 fallback. gradient_clipping hardcoded 0.5 to match the gemma-4 / profile-J grad_clip_norm. Both gemma-4 config headers updated to reference it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
lucaspirola
added a commit
to lucaspirola/kdr
that referenced
this pull request
May 18, 2026
…or #8010 The zero3_offload_bf16_optim.json config sets bf16.bf16_optimizer_states (deepspeedai/DeepSpeed#8010 — bf16 optimizer moments offloaded to host RAM). DeepSpeed's _configure_master_weights then asserts the offloaded DeepSpeedCPUAdam was constructed with fp32_optimizer_states=False, so its moments are actually stored in bf16. kdr's build_optimizer never passed it → AssertionError at deepspeed.initialize. Add wants_bf16_optimizer_states(accelerator): reads bf16.bf16_optimizer_ states from the active ds_config. build_optimizer gains a matching bf16_optimizer_states flag and passes fp32_optimizer_states=False to DeepSpeedCPUAdam only when set — a standard (non-#8010) DeepSpeed build has no such kwarg and is paired with the fp32-states fallback config. run_recovery wires the flag from the ds_config. Fixes probe bug #6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tohtana
approved these changes
May 19, 2026
Collaborator
tohtana
left a comment
There was a problem hiding this comment.
This looks good to me, thank you @lucaspirola!
When I developed bf16_optimizer_states feature, I wasn't able to include CPU offload due to the lack of the bandwidth. I really appreciate you completing it.
Collaborator
|
It seems that our CI has an issue with the PyTorch and transformers versions combination. Let me take a look on it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
DeepSpeed's bf16
bf16_optimizer_statesoption andoffload_optimizerarecurrently mutually exclusive, per the support matrix in
docs/_pages/config-json.md:bf16_optimizer_states=falsebf16_optimizer_states=trueDeepSpeedCPUAdam; states fp32, on CPUThis PR fills the missing cell:
bf16_optimizer_states=truetogether withoffload_optimizer: {device: cpu}for ZeRO 1/2/3 — Adam moments held in bf16and offloaded to CPU host RAM. That reduces the offloaded optimizer state
from ~10 to ~6 bytes/param (bf16 master + two bf16 moments) with no added GPU
memory.
Why
CPU offload currently forces fp32 optimizer states; for large models the
offloaded optimizer state dominates host RAM. Keeping the moments in bf16
(matching the already-bf16 master weights) cuts that footprint substantially
while keeping the state off the GPU.
How
DeepSpeedCPUAdamalready supports bf16 momentum/variance through itsfp32_optimizer_statesconstructor flag — the feature was simply not wired up.No C++/CUDA kernel changes.
engine.py—_configure_basic_optimizerbuildsDeepSpeedCPUAdam/ZenFlowCPUAdamwithfp32_optimizer_states=Falsewhenbf16_optimizer_statesis set (a user-supplied value is popped to avoid a keyword clash, and overridden
with a warning).
base_optimizer.py—_configure_master_weightsruns the offload +DeepSpeedCPUAdamvalidator whenever offload is configured (not only for thefp32-states case), and asserts a user-provided optimizer actually stores bf16
moments.
stage3.py/stage_1_and_2.py— passoffload_enabledthrough.config-json.md— updated bf16 support matrix.Backward compatibility
false+offload andtrue+no-offload configs are unaffected: the defaultresolves to
fp32_optimizer_states=True(prior behavior), and the no-offloadbf16-states path (FusedAdam on GPU) is untouched.
Numerics
bf16_optimizer_statescontinues to requirebf16_master_weights_and_grads, somaster weights are bf16 — identical precision to the existing on-GPU bf16-states
path. CPU Adam computes updates in fp32 internally and rounds moments to bf16
(round-to-nearest-even), matching that path.
Testing
tests/unit/ops/adam/test_cpu_adam.py—DeepSpeedCPUAdambf16 momentallocation + fp32 parity.
tests/unit/v1/half_precision/test_bf16.py—bf16_optimizer_states+ CPUoffload across ZeRO 1/2/3 (extends
TestBF16MasterWeightsGradients), plus aguard test that a user-provided
DeepSpeedCPUAdammust opt into bf16 moments.All new and affected existing tests pass;
TestBF16MasterWeightsGradients(9 cases) was verified on a 2-GPU host.
🤖 Generated with Claude Code