Skip to content

[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup #7819

@st-bang97

Description

@st-bang97

Describe the bug
When using DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam), if there are 2+ subgroups, the optimizer state update within a single training step appears to apply inconsistent bias-correction values across subgroups.
Concretely, for the same global step (e.g., step=3), _bias_correction2 differs between the first subgroup call and subsequent subgroup calls inside the same step, which implies that later subgroups may effectively be using a slightly different optimizer state than the earlier subgroup(s) for the same step.

This can lead to “learning progress from another subgroup” being reflected inconsistently within one step (state not being step-consistent across subgroup partitions).

Debug instrumentation details

Add a debug helper in csrc/include/cpu_adam.h inside Adam_Optimizer:

inline void print_state() {
    printf("ds_adam_state: _betta1_t=%f, _betta2_t=%f _bias_correction1=%f, _bias_correction2=%f\n",
           _betta1_t, _betta2_t, _bias_correction1, _bias_correction2);
}
Add prints in csrc/adam/cpu_adam_impl.cpp inside ds_adam_step:
opt->IncrementStep(step, beta1, beta2);
opt->update_state(lr, epsilon, weight_decay, `bias_correction);
printf("ds_adam_step, optimizer_id=%d, step=%d, lr=%f, beta1=%f, beta2=%f, epsilon=%f, weight_decay=%f, bias_correction=%d\n",
       optimizer_id, step, lr, beta1, beta2, epsilon, weight_decay, bias_correction);
opt->print_state();

To Reproduce

Build DeepSpeed with debugging prints added (details above).
git clone https://github.com/deepspeedai/DeepSpeedExamples/
cd DeepSpeedExamples
training/DeepSpeed-SuperOffload/finetune_llama-8b_1gpu.sh zerooffload 1
Ensure the run configuration results in 2+ subgroups when using ZeRO-3 + CPU offload (I observe the issue only when subgroup count ≥ 2).
Observe ds_adam_step logs showing _bias_correction2 mismatch within the same step.

Expected behavior
Within a single optimizer step step=N, all subgroup calls should produce identical bias-correction terms (e.g., _betta2_t, _bias_correction2) for the same optimizer instance, assuming they are logically part of the same step update.
Specifically, _bias_correction2 should be deterministic and consistent across subgroup invocations within the same step.

ds_report output
I already check multiple versions.

Screenshots
For the same step, _bias_correction2 differs between the first subgroup and subsequent subgroup calls.

Example excerpt (single step step=3, same optimizer_id=0):

Image

Similarly for step=5:

Image

Suspected root cause
In IncrementStep, the state is updated like:

_step++;
if (_step != step) {
    _betta1_t = std::pow(_betta1, step);
    _betta2_t = std::pow(_betta2, step);
    _step = step;
} else {
    _betta1_t *= _betta1;
    _betta2_t *= _betta2;
}

When ds_adam_step is invoked multiple times per global step (due to subgrouping), the first subgroup call goes through the else path (*=, exact-ish accumulation).
From the second subgroup call onward, the code can enter the std::pow path, which may return slightly approximated values for performance reasons. That approximation then affects _betta2_t → _bias_correction2, creating subgroup-to-subgroup inconsistency within the same step.

This matches the observed pattern: the first subgroup produces one _bias_correction2 value and later subgroups converge to a slightly different value for the same step

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions