[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup

**Describe the bug**
When using DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam), if there are 2+ subgroups, the optimizer state update within a single training step appears to apply inconsistent bias-correction values across subgroups.
Concretely, for the same global step (e.g., step=3), _bias_correction2 differs between the first subgroup call and subsequent subgroup calls inside the same step, which implies that later subgroups may effectively be using a slightly different optimizer state than the earlier subgroup(s) for the same step.

This can lead to “learning progress from another subgroup” being reflected inconsistently within one step (state not being step-consistent across subgroup partitions).

**Debug instrumentation details**
```
Add a debug helper in csrc/include/cpu_adam.h inside Adam_Optimizer:

inline void print_state() {
    printf("ds_adam_state: _betta1_t=%f, _betta2_t=%f _bias_correction1=%f, _bias_correction2=%f\n",
           _betta1_t, _betta2_t, _bias_correction1, _bias_correction2);
}
```
```
Add prints in csrc/adam/cpu_adam_impl.cpp inside ds_adam_step:
opt->IncrementStep(step, beta1, beta2);
opt->update_state(lr, epsilon, weight_decay, `bias_correction);
printf("ds_adam_step, optimizer_id=%d, step=%d, lr=%f, beta1=%f, beta2=%f, epsilon=%f, weight_decay=%f, bias_correction=%d\n",
       optimizer_id, step, lr, beta1, beta2, epsilon, weight_decay, bias_correction);
opt->print_state();
```

**To Reproduce**

Build DeepSpeed with debugging prints added (details above).
`git clone https://github.com/deepspeedai/DeepSpeedExamples/ `
`cd DeepSpeedExamples`
`training/DeepSpeed-SuperOffload/finetune_llama-8b_1gpu.sh zerooffload 1 `
Ensure the run configuration results in 2+ subgroups when using ZeRO-3 + CPU offload (I observe the issue only when subgroup count ≥ 2).
Observe ds_adam_step logs showing _bias_correction2 mismatch within the same step.


**Expected behavior**
Within a single optimizer step step=N, all subgroup calls should produce identical bias-correction terms (e.g., _betta2_t, _bias_correction2) for the same optimizer instance, assuming they are logically part of the same step update.
Specifically, _bias_correction2 should be deterministic and consistent across subgroup invocations within the same step.

**ds_report output**
I already check multiple versions.

**Screenshots**
For the same step, _bias_correction2 differs between the first subgroup and subsequent subgroup calls.

Example excerpt (single step step=3, same optimizer_id=0):

<img width="1311" height="334" alt="Image" src="https://github.com/user-attachments/assets/208732cd-032a-45a4-b68e-a5fb75b9d138" />


Similarly for step=5:

<img width="1305" height="448" alt="Image" src="https://github.com/user-attachments/assets/d6c36b66-081e-4c89-bb6d-6cd191d12e4b" />



Suspected root cause
In IncrementStep, the state is updated like:

```
_step++;
if (_step != step) {
    _betta1_t = std::pow(_betta1, step);
    _betta2_t = std::pow(_betta2, step);
    _step = step;
} else {
    _betta1_t *= _betta1;
    _betta2_t *= _betta2;
}

```

When ds_adam_step is invoked multiple times per global step (due to subgrouping), the first subgroup call goes through the else path (*=, exact-ish accumulation).
From the second subgroup call onward, the code can enter the std::pow path, which may return slightly approximated values for performance reasons. That approximation then affects _betta2_t → _bias_correction2, creating subgroup-to-subgroup inconsistency within the same step.

This matches the observed pattern: the first subgroup produces one _bias_correction2 value and later subgroups converge to a slightly different value for the same step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup #7819

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup #7819

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions