Skip to content

Fix hook count performance regression from v0.18.5#7886

Merged
tohtana merged 5 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-perf-regression
Mar 5, 2026
Merged

Fix hook count performance regression from v0.18.5#7886
tohtana merged 5 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-perf-regression

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Mar 4, 2026

Fixes performance regressions reported in #7882 and #7885.

PR #7780 added dynamic hook count computation for reentrant checkpointing correctness, but placed the call inside every gradient hook closure. For a model with n parameter tensors, this creates significant overhead per backward pass.

Summary:

  1. Added should_refresh_expected_hook_count() predicate that returns true only at backward phase boundaries (first hook, or new reentrant phase), so count_used_parameters_in_backward() is called once per phase instead of once per hook.
  2. Applied this predicate in ZeRO-1/2 (stage_1_and_2.py) and both ZeRO-3 hook sites (stage3.py), reusing the cached_max_expected_hooks_seen value when refresh isn't needed.
  3. Changed enter_backward() to reset hook counters on first real backward entry, preventing pollution from pre-user-backward autograd calls (e.g., TiledFusedLogitsLoss).

With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2, 8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations:

  • Before fix: 0.1265s/iter
  • After fix: 0.0505s/iter

tohtana added 3 commits March 4, 2026 12:49
Add phase-refresh predicate to BackwardHookStateManager to avoid
calling count_used_parameters_in_backward() on every gradient hook.
Harden enter_backward() to reset counters on first real backward
entry, preventing pre-user-backward hook pollution.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Use should_refresh_expected_hook_count() to only recompute the
expected hook count at phase boundaries instead of every hook
invocation, reducing O(n^2) overhead to O(p*n).

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Apply the same refresh/cache pattern in both reduce_partition_and_remove_grads
and reduce_leaf_module_grads to avoid per-hook O(n) overhead.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Collaborator Author

tohtana commented Mar 4, 2026

@codex

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


@instrument_w_nvtx
def reduce_partition_and_remove_grads(*notneeded):
refresh_expected = self.should_refresh_expected_hook_count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tohtana the refresh_expected = self.should_refresh_expected_hook_count() line is used on L1289, why it is placed before L1284 (reenter_backward_if_needed()) and L1286 (reduce_ready_partitions_and_remove_grads()). Is there implicit dependency here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking, @delock!

Yes, there's actually a dependency. should_refresh_expected_hook_count() detects reentrant phase boundaries by checking backward_active_depth == 0. reenter_backward_if_needed() increments backward_active_depth from 0 → 1 when it detects a new phase. If we called reenter_backward first, it would set backward_active_depth = 1 before the predicate runs, making the condition always false.

I added a comment to clarify it.

@mjkvaak-amd
Copy link

Confirming that this fix improves AMD 8× MI300X training performance of Qwen-Image by ~10× compared to DeepSpeed v0.18.6.

@rraminen
Copy link
Contributor

rraminen commented Mar 5, 2026

This fix resolves the ~20% performance regression introduced in DeepSpeed v0.18.5 (311674f) for the DeepSpeed-Megatron GPT-3 13B workload.

@tohtana tohtana marked this pull request as ready for review March 5, 2026 17:06
@tohtana tohtana requested review from loadams and tjruwase as code owners March 5, 2026 17:06
tohtana and others added 2 commits March 5, 2026 09:09
Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana force-pushed the tohtana/fix-perf-regression branch from fd933f3 to c43e423 Compare March 5, 2026 17:10
@tohtana tohtana merged commit 6c59d54 into deepspeedai:master Mar 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants