-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@mathewjacob1002 please add the related labels in PR title |
the |
@mathewjacob1002 the failure in |
Merged to master and branch-3.5. |
…sing Deepspeed Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone. It allows the user to run distributed training of a function with deepspeed easily. This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket. Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR. - [ ] Add more e2e tests for both running a regular pytorch file and running a function for training - [ ] Write more documentation Closes #42067 from mathewjacob1002/add_func_deepspeed. Authored-by: Mathew Jacob <mathew.jacob@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 392f8d8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…sing Deepspeed ### What changes were proposed in this pull request? Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone. ### Why are the changes needed? It allows the user to run distributed training of a function with deepspeed easily. ### Does this PR introduce _any_ user-facing change? This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket. ### How was this patch tested? Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR. ### Next Steps/Timeline - [ ] Add more e2e tests for both running a regular pytorch file and running a function for training - [ ] Write more documentation Closes apache#42067 from mathewjacob1002/add_func_deepspeed. Authored-by: Mathew Jacob <mathew.jacob@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone.
Why are the changes needed?
It allows the user to run distributed training of a function with deepspeed easily.
Does this PR introduce any user-facing change?
This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket.
How was this patch tested?
Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR.
Next Steps/Timeline