Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

Closed

Conversation

mathewjacob1002
Copy link
Contributor

What changes were proposed in this pull request?

Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone.

Why are the changes needed?

It allows the user to run distributed training of a function with deepspeed easily.

Does this PR introduce any user-facing change?

This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket.

How was this patch tested?

Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR.

Next Steps/Timeline

  • Add more e2e tests for both running a regular pytorch file and running a function for training
  • Write more documentation

Copy link
Contributor

@rithwik-db rithwik-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zhengruifeng zhengruifeng changed the title [SPARK-44264] Support Distributed Training of Functions Using Deepspeed [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023
@zhengruifeng
Copy link
Contributor

@mathewjacob1002 please add the related labels in PR title

@mathewjacob1002 mathewjacob1002 changed the title [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed [SPARK-44264][ML][CORE][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023
@zhengruifeng zhengruifeng changed the title [SPARK-44264][ML][CORE][PYTHON] Support Distributed Training of Functions Using Deepspeed [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023
@zhengruifeng zhengruifeng removed the CORE label Jul 19, 2023
@zhengruifeng
Copy link
Contributor

the [CORE] label automatically added by the labeler is not correct :)

@zhengruifeng
Copy link
Contributor

@mathewjacob1002 the failure in Run Spark on Kubernetes Integration test is unrelated, but you may have to fix the python lint by run dev/reformat-python

@github-actions github-actions bot added the CORE label Jul 19, 2023
@mathewjacob1002 mathewjacob1002 marked this pull request as ready for review July 19, 2023 06:21
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.5.

HyukjinKwon pushed a commit that referenced this pull request Jul 19, 2023
…sing Deepspeed

Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone.
It allows the user to run distributed training of a function with deepspeed easily.

This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket.

Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR.

- [ ] Add more e2e tests for both running a regular pytorch file and running a function for training
- [ ] Write more documentation

Closes #42067 from mathewjacob1002/add_func_deepspeed.

Authored-by: Mathew Jacob <mathew.jacob@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 392f8d8)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…sing Deepspeed

### What changes were proposed in this pull request?
Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone.
### Why are the changes needed?
It allows the user to run distributed training of a function with deepspeed easily.

### Does this PR introduce _any_ user-facing change?
This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket.

### How was this patch tested?
Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR.

### Next Steps/Timeline

- [ ] Add more e2e tests for both running a regular pytorch file and running a function for training
- [ ] Write more documentation

Closes apache#42067 from mathewjacob1002/add_func_deepspeed.

Authored-by: Mathew Jacob <mathew.jacob@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants