-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix ping-pong buffer index reset and removing redundant stream sync #7805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Running the script with Result with Fix ( Baseline ( The results match perfectly. To demonstrate that the gradient corruption (which #7371 tried to fix) was indeed caused by the broken buffer indexing from #6993: Incorrect Behavior (Sync Removed, Buffer Index Bug Remains): |
Signed-off-by: szlent <metarufolds@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
Thanks @undersilence for the fix! This is truly crucial. Looks good for the single-dtype case. One concern for multi‑dtype: I opened another PR on your PR branch to address this by launching an allreduce per dtype. Could you share your thoughts? |
|
Thanks for the detailed review and for opening the PR! You are right that since Isolating the allreduce by dtype is the correct fix here. I really appreciate your help in making this robust! |
Launch allreduce per dtype
|
@undersilence Thank you for merging my PR! This is a great improvement. We really appreciate your contribution. |
…eepspeedai#7805) Fix deepspeedai#7804 and deepspeedai#7188 After investigating the code in `deepspeed/runtime/zero/stage_1_and_2.py`, I have identified the root cause. The regression regarding communication overlap was introduced in PR deepspeedai#7371 (deepspeedai#7371). While the additional two-stream synchronization in that PR fixes gradient corruption, it effectively disables the overlapping behavior. The underlying issue causing the gradient corruption (which deepspeedai#7371 attempted to fix) was actually introduced in PR deepspeedai#6993 (deepspeedai#6993). In that PR, `bucket.clear()` incorrectly resets the ping-pong buffer index to 0 at the end of `reduce_ipg_grads`. This logic disrupts the buffer index swapping mechanism within `reduce_independent_p_g_buckets_and_remove_grads`. To fix this, L121 in `deepspeed/runtime/zero/stage_1_and_2.py` should be removed to prevent resetting the buffer index. Additionally, the stream synchronization logic introduced in deepspeedai#7371 should be removed to restore the `overlap_comm=True` functionality. --------- Signed-off-by: szlent <metarufolds@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Fix #7804 and #7188
After investigating the code in
deepspeed/runtime/zero/stage_1_and_2.py, I have identified the root cause. The regression regarding communication overlap was introduced in PR #7371 (#7371). While the additional two-stream synchronization in that PR fixes gradient corruption, it effectively disables the overlapping behavior.The underlying issue causing the gradient corruption (which #7371 attempted to fix) was actually introduced in PR #6993 (#6993). In that PR,
bucket.clear()incorrectly resets the ping-pong buffer index to 0 at the end ofreduce_ipg_grads. This logic disrupts the buffer index swapping mechanism withinreduce_independent_p_g_buckets_and_remove_grads.To fix this, L121 in
deepspeed/runtime/zero/stage_1_and_2.pyshould be removed to prevent resetting the buffer index. Additionally, the stream synchronization logic introduced in #7371 should be removed to restore theoverlap_comm=Truefunctionality.