Fix ping-pong buffer index reset and removing redundant stream sync #7805

undersilence · 2026-01-21T07:52:59Z

After investigating the code in deepspeed/runtime/zero/stage_1_and_2.py, I have identified the root cause. The regression regarding communication overlap was introduced in PR #7371 (#7371). While the additional two-stream synchronization in that PR fixes gradient corruption, it effectively disables the overlapping behavior.

The underlying issue causing the gradient corruption (which #7371 attempted to fix) was actually introduced in PR #6993 (#6993). In that PR, bucket.clear() incorrectly resets the ping-pong buffer index to 0 at the end of reduce_ipg_grads. This logic disrupts the buffer index swapping mechanism within reduce_independent_p_g_buckets_and_remove_grads.

To fix this, L121 in deepspeed/runtime/zero/stage_1_and_2.py should be removed to prevent resetting the buffer index. Additionally, the stream synchronization logic introduced in #7371 should be removed to restore the overlap_comm=True functionality.

undersilence · 2026-01-21T08:20:32Z

ds_overlap_comm_test.py

Running the script with deepspeed ds_overlap_comm_test.py, the gradient norm should align with standard results (overlap_comm=False) in all steps.

Result with Fix (overlap_comm=True):

...
Step 8/10: Loss = 0.979004, Grad Norm = 0.04842831566929817
Step 9/10: Loss = 0.974609, Grad Norm = 0.046479932963848114
Step 10/10: Loss = 1.013672, Grad Norm = 0.05202537402510643

Baseline (overlap_comm=False):

...
Step 8/10: Loss = 0.979004, Grad Norm = 0.04842831566929817
Step 9/10: Loss = 0.974609, Grad Norm = 0.046479932963848114
Step 10/10: Loss = 1.013672, Grad Norm = 0.05202537402510643

The results match perfectly.

To demonstrate that the gradient corruption (which #7371 tried to fix) was indeed caused by the broken buffer indexing from #6993:
If we only remove the synchronization from #7371 without removing the bucket.clear() index reset, the gradient norms diverge from the baseline:

Incorrect Behavior (Sync Removed, Buffer Index Bug Remains):

...
Step 8/10: Loss = 0.979004, Grad Norm = 0.048596370965242386
Step 9/10: Loss = 0.974121, Grad Norm = 0.047780971974134445
Step 10/10: Loss = 1.013672, Grad Norm = 0.046479482203722

Signed-off-by: szlent <metarufolds@gmail.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-01-21T23:21:44Z

Thanks @undersilence for the fix! This is truly crucial. Looks good for the single-dtype case.

One concern for multi‑dtype: reduce_ipg_grads() reduces all IPG buckets (one per comm dtype), but reduce_independent_p_g_buckets_and_remove_grads() only flips the ping‑pong index for the overflowed bucket. With multiple dtypes, other buckets can be refilled on the default stream while their reductions are still in flight on the reduction stream.

I opened another PR on your PR branch to address this by launching an allreduce per dtype. Could you share your thoughts?

undersilence · 2026-01-22T03:35:05Z

Thanks for the detailed review and for opening the PR!

You are right that since reduce_ipg_grads() was flushing all buckets while we were only handling the buffer index/sync for the current one, it would leave the other buckets in an unsafe state.

Isolating the allreduce by dtype is the correct fix here. I really appreciate your help in making this robust!

Launch allreduce per dtype

tohtana · 2026-01-22T06:01:34Z

@undersilence Thank you for merging my PR! This is a great improvement. We really appreciate your contribution.

…eepspeedai#7805) Fix deepspeedai#7804 and deepspeedai#7188 After investigating the code in `deepspeed/runtime/zero/stage_1_and_2.py`, I have identified the root cause. The regression regarding communication overlap was introduced in PR deepspeedai#7371 (deepspeedai#7371). While the additional two-stream synchronization in that PR fixes gradient corruption, it effectively disables the overlapping behavior. The underlying issue causing the gradient corruption (which deepspeedai#7371 attempted to fix) was actually introduced in PR deepspeedai#6993 (deepspeedai#6993). In that PR, `bucket.clear()` incorrectly resets the ping-pong buffer index to 0 at the end of `reduce_ipg_grads`. This logic disrupts the buffer index swapping mechanism within `reduce_independent_p_g_buckets_and_remove_grads`. To fix this, L121 in `deepspeed/runtime/zero/stage_1_and_2.py` should be removed to prevent resetting the buffer index. Additionally, the stream synchronization logic introduced in deepspeedai#7371 should be removed to restore the `overlap_comm=True` functionality. --------- Signed-off-by: szlent <metarufolds@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

undersilence requested review from tjruwase and tohtana as code owners January 21, 2026 07:53

Fix ping-pong buffer index reset and removing redundant stream sync

ce0a954

Signed-off-by: szlent <metarufolds@gmail.com>

undersilence force-pushed the master branch from 0faed96 to ce0a954 Compare January 21, 2026 08:29

launch allreduce per dtype

93d29f4

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge pull request #1 from tohtana/tohtana/fix_overlap_multiple_dtypes

65b694b

Launch allreduce per dtype

tohtana approved these changes Jan 22, 2026

View reviewed changes

tohtana enabled auto-merge (squash) January 22, 2026 06:00

Merge branch 'master' into master

5c2e81e

tohtana merged commit 15ad92b into deepspeedai:master Jan 22, 2026
11 checks passed

tohtana mentioned this pull request Jan 30, 2026

Update version.txt to 0.18.6 after latest release #7826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ping-pong buffer index reset and removing redundant stream sync #7805

Fix ping-pong buffer index reset and removing redundant stream sync #7805

Uh oh!

undersilence commented Jan 21, 2026 •

edited

Loading

Uh oh!

undersilence commented Jan 21, 2026

Uh oh!

tohtana commented Jan 21, 2026

Uh oh!

undersilence commented Jan 22, 2026

Uh oh!

tohtana commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix ping-pong buffer index reset and removing redundant stream sync #7805

Fix ping-pong buffer index reset and removing redundant stream sync #7805

Uh oh!

Conversation

undersilence commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

undersilence commented Jan 21, 2026

Uh oh!

tohtana commented Jan 21, 2026

Uh oh!

undersilence commented Jan 22, 2026

Uh oh!

tohtana commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

undersilence commented Jan 21, 2026 •

edited

Loading