[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

mathewjacob1002 · 2023-07-19T02:15:32Z

What changes were proposed in this pull request?

Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone.

Why are the changes needed?

It allows the user to run distributed training of a function with deepspeed easily.

Does this PR introduce any user-facing change?

This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket.

How was this patch tested?

Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR.

Next Steps/Timeline

Add more e2e tests for both running a regular pytorch file and running a function for training
Write more documentation

rithwik-db

lgtm

zhengruifeng · 2023-07-19T05:21:19Z

@mathewjacob1002 please add the related labels in PR title

zhengruifeng · 2023-07-19T05:56:22Z

the [CORE] label automatically added by the labeler is not correct :)

zhengruifeng · 2023-07-19T06:05:19Z

@mathewjacob1002 the failure in Run Spark on Kubernetes Integration test is unrelated, but you may have to fix the python lint by run dev/reformat-python

HyukjinKwon · 2023-07-19T08:29:21Z

Merged to master and branch-3.5.

…sing Deepspeed Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone. It allows the user to run distributed training of a function with deepspeed easily. This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket. Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR. - [ ] Add more e2e tests for both running a regular pytorch file and running a function for training - [ ] Write more documentation Closes #42067 from mathewjacob1002/add_func_deepspeed. Authored-by: Mathew Jacob <mathew.jacob@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 392f8d8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…sing Deepspeed ### What changes were proposed in this pull request? Made the DeepspeedTorchDistributor run() method use the _run() function as the backbone. ### Why are the changes needed? It allows the user to run distributed training of a function with deepspeed easily. ### Does this PR introduce _any_ user-facing change? This adds the ability for the user to pass in a function as the train_object when calling DeepspeedTorchDistributor.run(). The user must have all necessary imports within the function itself, and the function must be picklable. An example use case can be found in the python file linked in the JIRA ticket. ### How was this patch tested? Notebook/file linked in the JIRA ticket. Formal e2e tests will come in future PR. ### Next Steps/Timeline - [ ] Add more e2e tests for both running a regular pytorch file and running a function for training - [ ] Write more documentation Closes apache#42067 from mathewjacob1002/add_func_deepspeed. Authored-by: Mathew Jacob <mathew.jacob@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Used _run function from TorchDistributor instead of reimplementing run.

c1ecf79

github-actions bot added ML CORE PYTHON labels Jul 19, 2023

HyukjinKwon approved these changes Jul 19, 2023

View reviewed changes

rithwik-db approved these changes Jul 19, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-44264] Support Distributed Training of Functions Using Deepspeed~~ [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023

mathewjacob1002 changed the title ~~[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed~~ [SPARK-44264][ML][CORE][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023

zhengruifeng changed the title ~~[SPARK-44264][ML][CORE][PYTHON] Support Distributed Training of Functions Using Deepspeed~~ [SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed Jul 19, 2023

zhengruifeng removed the CORE label Jul 19, 2023

linter

fa3d8d6

github-actions bot added the CORE label Jul 19, 2023

mathewjacob1002 marked this pull request as ready for review July 19, 2023 06:21

HyukjinKwon closed this in 392f8d8 Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

mathewjacob1002 commented Jul 19, 2023

rithwik-db left a comment •

edited

Loading

zhengruifeng commented Jul 19, 2023

zhengruifeng commented Jul 19, 2023

zhengruifeng commented Jul 19, 2023

HyukjinKwon commented Jul 19, 2023

[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

[SPARK-44264][ML][PYTHON] Support Distributed Training of Functions Using Deepspeed #42067

Conversation

mathewjacob1002 commented Jul 19, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Next Steps/Timeline

rithwik-db left a comment • edited Loading

Choose a reason for hiding this comment

zhengruifeng commented Jul 19, 2023

zhengruifeng commented Jul 19, 2023

zhengruifeng commented Jul 19, 2023

HyukjinKwon commented Jul 19, 2023

rithwik-db left a comment •

edited

Loading