[SPARK-44264][ML][PYTHON] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer by mathewjacob1002 · Pull Request #41973 · apache/spark

mathewjacob1002 · 2023-07-12T23:56:28Z

What Was Changed

We enable for a custom function pointer to be passed around the private functions that allow for distributed training of a function.

Why Do We Need This Change

By abstracting the "run_training_on_pytorch_file" function to something that can be passed in, it allows for much easier creation of distributors that run on top of torch.distributed. Specifically, it makes it easy to implement distributed training of picklable functions in DeepspeedTorchDistributor. As mentioned, if there are accelerators that come out in the future built on top of torch.distributed, it will be very easy to support them in Spark. One can simply do the following:

Inherit from TorchDistributor and define a _run_training_on_pytorch_file function or equivalent for your class
When defining run(...), simply return _run() and pass in your custom _run_training_on_pytorch_file function in as the respective argument

Any User-Facing Changes?

No.

How Is This Tested?

The existing tests for TorchDistributor.

…raining_on_pytorch_file function pointer. Motivation is to make it easier for future developers to add their own distributed trainer for other accelerators that come out in the future

…ix the pickling error

…ction pointer

…for running pytorch files

python/pyspark/ml/torch/distributor.py

rithwik-db · 2023-07-14T18:25:41Z

python/pyspark/ml/torch/distributor.py

+            return framework_wrapper(input_params, train_object, *args, **kwargs)
+        else:
+            # We are doing training with a function, will call run_training_on_pytorch_function
+            if not run_pytorch_file_fn:


Remove this and set the parameter to be run_pytorch_file_fn: Optional[Callable] = TorchDistributor._run...

This won't work because of the *args and **kwargs after. Python kind of freaks out and can't do default value in my experience.

cc: @WeichenXu123 what are your thoughts? https://stackoverflow.com/questions/9872824/calling-a-python-function-with-args-kwargs-and-optional-default-arguments <- It seems possible

rithwik-db · 2023-07-14T18:28:46Z

python/pyspark/ml/torch/distributor.py

    @staticmethod
    def _run_training_on_pytorch_function(
-        input_params: Dict[str, Any], train_fn: Callable, *args: Any, **kwargs: Any
+            input_params: Dict[str, Any], train_fn: Callable, run_pytorch_file_fn: Optional[Callable], *args: Any, **kwargs: Any


Same as here

python/pyspark/ml/torch/distributor.py

rithwik-db · 2023-07-14T18:43:51Z

python/pyspark/ml/torch/distributor.py

        self,
        framework_wrapper_fn: Callable,
        train_object: Union[Callable, str],
+        run_pytorch_file_fn: Optional[Callable],


Id probably move this variable before train_object, same with the other functions

do you mind if I ask why?

This is just a nit but we want to keep the code that relates to the training (train_object, *args, **kwargs) away from the utils stuff like framework_wrapper_fn and run_pytorch_file_fn for the sake of readability.

when you say other functions, do you mean all of them? Wouldn't that interfere with the default args comment because they have to be after positional args iirc?

Yes, run_pytorch_file_fn isn't a keyword argument (yet); you can either do this comment or the other, but I'd prefer Weichen to weigh in first.

ok sounds good!

python/pyspark/ml/torch/distributor.py

lu-wang-dl · 2023-07-14T22:54:37Z

python/pyspark/ml/torch/distributor.py

+    @staticmethod  
+    def _get_output_from_framework_wrapper(framework_wrapper: Optional[Callable], input_params: Dict, train_object: Union[Callable, str], run_pytorch_file_fn: Optional[Callable], *args, **kwargs) -> Optional[Any]:
+        if not framework_wrapper:
+            raise RuntimeError("In the _get_output_from_framework_wrapper function, found a framework wrapper that is none. I wonder why this is...")


What does this error message mean?

If there is ever a point where the framework_wrapper function isn't a Callable, we want this error to be thrown because it isn't supposed to happen. The reason we set this to an Optional[Callable] is because my linter complained a lot about it and how we can't assign something to framework_wrapper.

We should make the error message clear.

How does the new one sound @lu-wang-dl?

…pper

lu-wang-dl · 2023-07-15T03:04:28Z

python/pyspark/ml/torch/distributor.py

+        Parameters
+        ----------
+        framework_wrapper: Optional[Callable]
+            Function pointer that will be invoked. Can either be the function that runs distributed training on


User provided function?

Could we add a coment to indicate which one is from the user input?

train_object is from the user - it's either a string representing a filepath or a function pointer that the user wants to run in a distributed fashion. I will try to make this more explicit in the docstring.

lu-wang-dl · 2023-07-15T03:05:15Z

python/pyspark/ml/torch/distributor.py

+        Returns
+        -------
+        Optional[Any]
+            Returns the result of the framework_wrapper


Do we expect framework_wrapper return anything?

It will return out depending on the train_object. This is the same train_object in the rest of the code before, where it's either a path to a file to execute or a function to run in a distributed fashion. What framework_wrapper returns depends on that.

framework_wrapper is the same meaning as before in the run method

lu-wang-dl · 2023-07-15T03:07:13Z

python/pyspark/ml/torch/distributor.py

+            for functions if the train_object is a Callable
+        input_params: Dict
+            A dictionary that maps parameter to arguments for the command to be created.
+        train_object: Union[Callable, str]


I cannot tell the difference between train_object and framework_wrapper from the comments.

Tried again to make it more obvious which is which. But in a nutshell, train_object is passed in from the user, and the framework_wrapper is something that DeepspeedTorchDistributor decides based on the type of train_object.

python/pyspark/ml/torch/distributor.py

Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com>

lu-wang-dl

LGTM on my side. I will let Ricky do the final approval.

WeichenXu123

LGTM

rithwik-db

since weichen is okay with this too, lgtm

HyukjinKwon · 2023-07-19T01:50:03Z

Master to master and branch-3.5.

…ustom "run_training_on_file" Function Pointer ### What Was Changed We enable for a custom function pointer to be passed around the private functions that allow for distributed training of a function. ### Why Do We Need This Change By abstracting the "run_training_on_pytorch_file" function to something that can be passed in, it allows for much easier creation of distributors that run on top of torch.distributed. Specifically, it makes it easy to implement distributed training of picklable functions in DeepspeedTorchDistributor. As mentioned, if there are accelerators that come out in the future built on top of torch.distributed, it will be very easy to support them in Spark. One can simply do the following: 1. Inherit from TorchDistributor and define a _run_training_on_pytorch_file function or equivalent for your class 2. When defining run(...), simply return _run() and pass in your custom _run_training_on_pytorch_file function in as the respective argument ### Any User-Facing Changes? No. ### How Is This Tested? The existing tests for TorchDistributor. Closes #41973 from mathewjacob1002/distributed_func_support_prototype. Lead-authored-by: Mathew Jacob <mathew.jacob@databricks.com> Co-authored-by: Mathew Jacob <134338709+mathewjacob1002@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ee0e687) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

PROTOTYPING: refactoring the TorchDistributor code to take in a run_t…

7c7129f

…raining_on_pytorch_file function pointer. Motivation is to make it easier for future developers to add their own distributed trainer for other accelerators that come out in the future

github-actions bot added ML CORE PYTHON labels Jul 12, 2023

mathewjacob1002 added 5 commits July 12, 2023 17:20

Added the TorchDistributor.run_training_on_pytorch_file

b1b15ee

Made the _get_output_from_framework_wrapper static in an attempt to f…

b9b3f42

…ix the pickling error

accounted for the change to private training functions to include fun…

9875c7d

…ction pointer

Made run -> _run with an extra parameter for function pointer to use …

08b0c01

…for running pytorch files

renamed run_training_on_pytorch_file to run_pytorch_file_fn

c86a9ff

mathewjacob1002 changed the title ~~[DO NOT MERGE/REVIEW] PROTOTYPING: refactoring the TorchDistributor code to take in a run_t…~~ [Spark Ticket] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer Jul 14, 2023

mathewjacob1002 changed the title ~~[Spark Ticket] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer~~ [SPARK-44264] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer Jul 14, 2023

rithwik-db reviewed Jul 14, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

Removed unused import frame

d8d5894

mathewjacob1002 requested a review from rithwik-db July 14, 2023 18:24

rithwik-db reviewed Jul 14, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

mathewjacob1002 requested a review from rithwik-db July 14, 2023 18:37

rithwik-db reviewed Jul 14, 2023

View reviewed changes

mathewjacob1002 requested a review from rithwik-db July 14, 2023 18:58

lu-wang-dl reviewed Jul 14, 2023

View reviewed changes

mathewjacob1002 added 3 commits July 14, 2023 17:09

Added docsting for _get_output_from_framework_wrapper

0dab700

Made _get_output_from_framework_wrapper error message clearer

34bf0ba

Added return value to the docstring of _get_output_from_framework_wra…

2b517a1

…pper

mathewjacob1002 requested a review from lu-wang-dl July 15, 2023 00:22

lu-wang-dl reviewed Jul 15, 2023

View reviewed changes

mathewjacob1002 and others added 3 commits July 14, 2023 22:28

Remove the function name from the error message.

0292b63

Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com>

Tried to make docstring clearer for _get_output_from_framework_wrapper

dadfe15

trying to add Lu's merge

3ae6d21

mathewjacob1002 requested a review from lu-wang-dl July 15, 2023 05:48

lu-wang-dl approved these changes Jul 17, 2023

View reviewed changes

mathewjacob1002 marked this pull request as ready for review July 17, 2023 20:04

WeichenXu123 approved these changes Jul 18, 2023

View reviewed changes

rithwik-db approved these changes Jul 18, 2023

View reviewed changes

Linting

88a4349

HyukjinKwon changed the title ~~[SPARK-44264] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer~~ [SPARK-44264][ML][PYTHON] Refactoring TorchDistributor To Allow for Custom "run_training_on_file" Function Pointer Jul 19, 2023

HyukjinKwon approved these changes Jul 19, 2023

View reviewed changes

HyukjinKwon closed this in ee0e687 Jul 19, 2023

Comments

Conversation

mathewjacob1002 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Was Changed

Why Do We Need This Change

Any User-Facing Changes?

How Is This Tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rithwik-db Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lu-wang-dl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

rithwik-db left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mathewjacob1002 commented Jul 12, 2023 •

edited

Loading

rithwik-db Jul 14, 2023 •

edited

Loading

lu-wang-dl left a comment •

edited

Loading