Fix hook count performance regression from v0.18.5 by tohtana · Pull Request #7886 · deepspeedai/DeepSpeed

tohtana · 2026-03-04T22:08:48Z

Fixes performance regressions reported in #7882 and #7885.

PR #7780 added dynamic hook count computation for reentrant checkpointing correctness, but placed the call inside every gradient hook closure. For a model with n parameter tensors, this creates significant overhead per backward pass.

Summary:

Added should_refresh_expected_hook_count() predicate that returns true only at backward phase boundaries (first hook, or new reentrant phase), so count_used_parameters_in_backward() is called once per phase instead of once per hook.
Applied this predicate in ZeRO-1/2 (stage_1_and_2.py) and both ZeRO-3 hook sites (stage3.py), reusing the cached_max_expected_hooks_seen value when refresh isn't needed.
Changed enter_backward() to reset hook counters on first real backward entry, preventing pollution from pre-user-backward autograd calls (e.g., TiledFusedLogitsLoss).

With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2, 8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations:

Before fix: 0.1265s/iter
After fix: 0.0505s/iter

Add phase-refresh predicate to BackwardHookStateManager to avoid calling count_used_parameters_in_backward() on every gradient hook. Harden enter_backward() to reset counters on first real backward entry, preventing pre-user-backward hook pollution. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Use should_refresh_expected_hook_count() to only recompute the expected hook count at phase boundaries instead of every hook invocation, reducing O(n^2) overhead to O(p*n). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Apply the same refresh/cache pattern in both reduce_partition_and_remove_grads and reduce_leaf_module_grads to avoid per-hook O(n) overhead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-03-04T22:09:02Z

@codex

chatgpt-codex-connector · 2026-03-04T22:14:44Z

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

delock · 2026-03-05T07:01:32Z

deepspeed/runtime/zero/stage3.py


                        @instrument_w_nvtx
                        def reduce_partition_and_remove_grads(*notneeded):
+                            refresh_expected = self.should_refresh_expected_hook_count()


Hi @tohtana the refresh_expected = self.should_refresh_expected_hook_count() line is used on L1289, why it is placed before L1284 (reenter_backward_if_needed()) and L1286 (reduce_ready_partitions_and_remove_grads()). Is there implicit dependency here?

Thank you for checking, @delock!

Yes, there's actually a dependency. should_refresh_expected_hook_count() detects reentrant phase boundaries by checking backward_active_depth == 0. reenter_backward_if_needed() increments backward_active_depth from 0 → 1 when it detects a new phase. If we called reenter_backward first, it would set backward_active_depth = 1 before the predicate runs, making the condition always false.

I added a comment to clarify it.

mjkvaak-amd · 2026-03-05T10:45:37Z

Confirming that this fix improves AMD 8× MI300X training performance of Qwen-Image by ~10× compared to DeepSpeed v0.18.6.

rraminen · 2026-03-05T14:17:56Z

This fix resolves the ~20% performance regression introduced in DeepSpeed v0.18.5 (311674f) for the DeepSpeed-Megatron GPT-3 13B workload.

Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana added 3 commits March 4, 2026 12:49

Fix ZeRO-3 hooks to cache count_used_parameters_in_backward result

d3a6e99

Apply the same refresh/cache pattern in both reduce_partition_and_remove_grads and reduce_leaf_module_grads to avoid per-hook O(n) overhead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request Mar 4, 2026

Fix performance regression in grad_handling_hook #7882

Closed

jue-jue-zi mentioned this pull request Mar 5, 2026

[BUG] Training with Zero-2 slow down a lot from v0.18.4 to v0.18.5 #7885

Closed

delock reviewed Mar 5, 2026

View reviewed changes

delock approved these changes Mar 5, 2026

View reviewed changes

tohtana marked this pull request as ready for review March 5, 2026 17:06

tohtana requested review from loadams and tjruwase as code owners March 5, 2026 17:06

tohtana and others added 2 commits March 5, 2026 09:09

Add regression tests for hook count performance fix

e1b41ee

Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Add comment clarifying refresh-before-reenter ordering

c43e423

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana force-pushed the tohtana/fix-perf-regression branch from fd933f3 to c43e423 Compare March 5, 2026 17:10

tohtana merged commit 6c59d54 into deepspeedai:master Mar 5, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hook count performance regression from v0.18.5#7886

Fix hook count performance regression from v0.18.5#7886
tohtana merged 5 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-perf-regression

tohtana commented Mar 4, 2026

Uh oh!

tohtana commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

delock Mar 5, 2026

Uh oh!

tohtana Mar 5, 2026

Uh oh!

mjkvaak-amd commented Mar 5, 2026

Uh oh!

rraminen commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tohtana commented Mar 4, 2026

Uh oh!

tohtana commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

delock Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

mjkvaak-amd commented Mar 5, 2026

Uh oh!

rraminen commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rraminen commented Mar 5, 2026 •

edited

Loading