Fix hook count performance regression from v0.18.5#7886
Fix hook count performance regression from v0.18.5#7886tohtana merged 5 commits intodeepspeedai:masterfrom
Conversation
Add phase-refresh predicate to BackwardHookStateManager to avoid calling count_used_parameters_in_backward() on every gradient hook. Harden enter_backward() to reset counters on first real backward entry, preventing pre-user-backward hook pollution. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Use should_refresh_expected_hook_count() to only recompute the expected hook count at phase boundaries instead of every hook invocation, reducing O(n^2) overhead to O(p*n). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Apply the same refresh/cache pattern in both reduce_partition_and_remove_grads and reduce_leaf_module_grads to avoid per-hook O(n) overhead. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
|
||
| @instrument_w_nvtx | ||
| def reduce_partition_and_remove_grads(*notneeded): | ||
| refresh_expected = self.should_refresh_expected_hook_count() |
There was a problem hiding this comment.
Hi @tohtana the refresh_expected = self.should_refresh_expected_hook_count() line is used on L1289, why it is placed before L1284 (reenter_backward_if_needed()) and L1286 (reduce_ready_partitions_and_remove_grads()). Is there implicit dependency here?
There was a problem hiding this comment.
Thank you for checking, @delock!
Yes, there's actually a dependency. should_refresh_expected_hook_count() detects reentrant phase boundaries by checking backward_active_depth == 0. reenter_backward_if_needed() increments backward_active_depth from 0 → 1 when it detects a new phase. If we called reenter_backward first, it would set backward_active_depth = 1 before the predicate runs, making the condition always false.
I added a comment to clarify it.
|
Confirming that this fix improves AMD 8× MI300X training performance of Qwen-Image by ~10× compared to DeepSpeed v0.18.6. |
|
This fix resolves the ~20% performance regression introduced in DeepSpeed v0.18.5 (311674f) for the DeepSpeed-Megatron GPT-3 13B workload. |
Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
fd933f3 to
c43e423
Compare
Fixes performance regressions reported in #7882 and #7885.
PR #7780 added dynamic hook count computation for reentrant checkpointing correctness, but placed the call inside every gradient hook closure. For a model with n parameter tensors, this creates significant overhead per backward pass.
Summary:
should_refresh_expected_hook_count()predicate that returns true only at backward phase boundaries (first hook, or new reentrant phase), socount_used_parameters_in_backward()is called once per phase instead of once per hook.cached_max_expected_hooks_seenvalue when refresh isn't needed.With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2, 8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations: