[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode #40695

zhengruifeng · 2023-04-07T02:58:01Z

What changes were proposed in this pull request?

Add a new proto message for sc.resources
PyTorch Distributor support Local Mode with GPU

Why are the changes needed?

For functionality parity
After this PR, all UTs in test_distributor are reused and enabled in Connect

Does this PR introduce any user-facing change?

Yes, new mode supported in Connect

How was this patch tested?

Enabled UTs

zhengruifeng · 2023-04-08T07:57:02Z

cc @WeichenXu123 @HyukjinKwon

grundprinzip · 2023-04-09T04:39:18Z

python/pyspark/sql/connect/client.py

@@ -867,6 +878,8 @@ def _analyze(self, method: str, **kwargs: Any) -> AnalyzeResult:
                req.unpersist.blocking = cast(bool, kwargs.get("blocking"))
        elif method == "get_storage_level":
            req.get_storage_level.relation.CopyFrom(cast(pb2.Relation, kwargs.get("relation")))
+        elif method == "resources":


I'm not sure this is a good approach as it overloads the semantics of analyze with generic system metadata information.

My suggestion is to implement resource metadata using a command instead.

I feel sc.resources itself is not like a command, but similar to spark.version which also uses the analyze rpc

It's the same issue. I don't appreciate the way it was implemented. The analyze RPC has no semantic relationship to either version nor resources. Please change the resources now as it's too late for version.

WeichenXu123 · 2023-04-10T00:53:06Z

We also need to change TorchDistributor._run_local_training implementation.

The existing code it executes pytorch code in client side, but in spark connect case, we should execute pytorch code in server side (we can reuse _run_distributed_training code in the case, but local mode spark job does not support GPU schdeduling, instead, we can broadcast the selected driver GPU list to all tasks and each task select its GPU id via task rank).

WeichenXu123 · 2023-04-10T01:12:17Z

On second thought,

I propose to make TorchDistributor._run_local_training only supports spark legacy mode,

but for TorchDistributor._run_distributed_training , we should make it support both legacy mode and spark connect mode, i.e., when running on spark local mode cluster, but user set TorchDistributor.local_mode=False, it executes TorchDistributor._run_distributed_training, in this case, current master code does not handle GPU allocation correctly, you need to fix it (we can broadcast the selected driver GPU list to all tasks and each task select its GPU id via task rank).

zhengruifeng · 2023-04-10T02:04:59Z

The existing code it executes pytorch code in client side, but in spark connect case, we should execute pytorch code in server side

yes, I feel it is non-trivial to execute pytorch code in server side, since we need to launch a new Python process in the server side and then communicate with it.

I propose to make TorchDistributor._run_local_training only supports spark legacy mode

agree

running on spark local mode cluster, but user set TorchDistributor.local_mode=False

why not just failing it?

WeichenXu123 · 2023-04-10T02:24:17Z

running on spark local mode cluster, but user set TorchDistributor.local_mode=False

I think we need to support this, because for spark ML algorithm implemented atop TorchDistributor, we hope to support either spark local mode or spark cluster mode.

WeichenXu123 · 2023-04-10T02:31:17Z

yes, I feel it is non-trivial to execute pytorch code in server side, since we need to launch a new Python process in the server side and then communicate with it.

I think it does not require too much work, we can reuse most code of TorchDistributor._run_distributed_training, we just need to fix one issue:
current master code does not handle GPU allocation correctly, we can broadcast the selected driver GPU list to all tasks and each task select its GPU id via task rank.

WeichenXu123 · 2023-04-10T03:13:40Z

Summary: for spark connect mode:

If torchDistributor.local_mode is True, raise error saying no support.

If torchDistributor.local_mode is False, and spark server side is spark local mode, we need to fix the issue #40695 (comment)

If torchDistributor.local_mode is False, and spark server side is spark cluster mode, current master code works fine either with GPU or without GPU config.

grundprinzip · 2023-04-10T04:30:57Z

What is local mode and why would you not support it on the client?

WeichenXu123 · 2023-04-10T04:43:38Z

What is local mode

Let me clarify it,

TorchDistributer has "local mode" configure, if true, it just run torch program in client side, if False, it launches a spark job to run the torch program. So the TorchDistributer side "local mode" has nothing to do with spark master local mode.

why would you not support it on the client?

I assume you mean to run torch program in spark connect client machine, we can support this, but I think it is less meaningful, because client machine should usually run lightweight workloads, but torch programs are heavy workloads and they often requires GPU, which client machine are hard to satisfy the condition.

grundprinzip · 2023-04-10T04:52:36Z

I assume you mean to run torch program in spark connect client machine, we can support this, but I think it is less meaningful, because client machine should usually run lightweight workloads, but torch programs are heavy workloads and they often requires GPU, which client machine are hard to satisfy the condition.

I don't think this is a valid assumption. With Spark Connect you can actually build an environment in which you have GPU locally but don't have a GPU on your cluster. In this case you still want to leverage the same execution flow. I've previously talked to users that were looking for an EC2 setup with a GPU attached and running the workloads from there against Sprk using Spark Connect.

This is very similar to running sklearn locally on the client side.

It's not the only way, but it's a very valid way.

WeichenXu123 · 2023-04-10T05:05:31Z

I don't think this is a valid assumption. With Spark Connect you can actually build an environment in which you have GPU locally but don't have a GPU on your cluster. In this case you still want to leverage the same execution flow. I've previously talked to users that were looking for an EC2 setup with a GPU attached and running the workloads from there against Sprk using Spark Connect.

OK make sense, we can support it too @zhengruifeng , in this case, we can read client side environment variable CUDA_VISIBLE_DEVICES to determine which GPU devices we can use.

WeichenXu123 · 2023-04-11T00:10:22Z

python/pyspark/ml/torch/distributor.py

+                    if CUDA_VISIBLE_DEVICES in os.environ:
+                        return


Shall we add this checking ?

~~I am not sure about this, here follows the distributed mode which respects the CUDA_VISIBLE_DEVICES env~~

which checking? do you mine fail it if CUDA_VISIBLE_DEVICES is available?

WeichenXu123

LGTM except one last comment.

zhengruifeng · 2023-04-11T00:12:29Z

@grundprinzip would you mind taking another look at the changes in protos?

python/pyspark/ml/torch/distributor.py

WeichenXu123 · 2023-04-11T00:20:40Z

python/pyspark/ml/torch/distributor.py

+            else:
+
+                def set_gpus(context: "BarrierTaskContext") -> None:
+                    if CUDA_VISIBLE_DEVICES in os.environ:


Shall we add this checking ?

WeichenXu123 · 2023-04-11T00:22:45Z

python/pyspark/ml/torch/tests/test_distributor.py


        env_vars = {"CUDA_VISIBLE_DEVICES": "3,4,5"}
        self.setup_env_vars(env_vars)
-        self.assertEqual(get_gpus_owned(self.spark), ["3", "4", "5"])
+        self.assertEqual(_get_gpus_owned(self.spark), ["3", "4", "5"])
        self.delete_env_vars(env_vars)


Let's add spark connect mode tests for local_training with GPU on the following cases:
spark.master=local and spark.master=local-cluster

TorchDistributorLocalUnitTests and TorchDistributorLocalUnitTestsOnConnect already test local-cluster

just add TorchDistributorLocalUnitTestsII and TorchDistributorLocalUnitTestsIIOnConnect for local[4]

fix scala 2.13

rename variables

zhengruifeng · 2023-04-12T00:51:36Z

@WeichenXu123 mind taking another look?

zhengruifeng · 2023-04-12T07:52:20Z

merged to master

github-actions bot added CONNECT CORE ML PYTHON SQL labels Apr 7, 2023

zhengruifeng force-pushed the torch_local_mode branch from 9ae0b0d to c53737f Compare April 8, 2023 07:02

grundprinzip reviewed Apr 9, 2023

View reviewed changes

zhengruifeng force-pushed the torch_local_mode branch from c53737f to e06f435 Compare April 10, 2023 07:09

WeichenXu123 reviewed Apr 11, 2023

View reviewed changes

WeichenXu123 approved these changes Apr 11, 2023

View reviewed changes

WeichenXu123 reviewed Apr 11, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Apr 11, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Apr 11, 2023

View reviewed changes

zhengruifeng force-pushed the torch_local_mode branch from ff9c690 to da24f58 Compare April 11, 2023 05:49

zhengruifeng added 3 commits April 11, 2023 19:13

init

b000757

fix scala 2.13

changes for local mode

313e163

use command

d722f73

zhengruifeng added 5 commits April 11, 2023 19:13

disable a ut

049d911

nit

60ea6fb

rename variables

2380659

rename variables

address comments

bb09c10

fix lint

5bb2cb6

zhengruifeng force-pushed the torch_local_mode branch from 57f3963 to 5bb2cb6 Compare April 11, 2023 11:14

zhengruifeng changed the title ~~[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU~~ [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode Apr 12, 2023

WeichenXu123 approved these changes Apr 12, 2023

View reviewed changes

zhengruifeng closed this in f8751e2 Apr 12, 2023

zhengruifeng deleted the torch_local_mode branch April 12, 2023 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode #40695

[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode #40695

zhengruifeng commented Apr 7, 2023 •

edited

zhengruifeng commented Apr 8, 2023

grundprinzip Apr 9, 2023

zhengruifeng Apr 10, 2023

grundprinzip Apr 10, 2023

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

zhengruifeng commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

grundprinzip commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023

grundprinzip commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 Apr 11, 2023 •

edited

zhengruifeng Apr 11, 2023 •

edited

WeichenXu123 left a comment

zhengruifeng commented Apr 11, 2023

WeichenXu123 Apr 11, 2023

WeichenXu123 Apr 11, 2023

zhengruifeng Apr 11, 2023

zhengruifeng commented Apr 12, 2023

zhengruifeng commented Apr 12, 2023

[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode #40695

[SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode #40695

Conversation

zhengruifeng commented Apr 7, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Apr 8, 2023

grundprinzip Apr 9, 2023

Choose a reason for hiding this comment

zhengruifeng Apr 10, 2023

Choose a reason for hiding this comment

grundprinzip Apr 10, 2023

Choose a reason for hiding this comment

WeichenXu123 commented Apr 10, 2023 • edited

WeichenXu123 commented Apr 10, 2023 • edited

zhengruifeng commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023 • edited

WeichenXu123 commented Apr 10, 2023 • edited

grundprinzip commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023

grundprinzip commented Apr 10, 2023

WeichenXu123 commented Apr 10, 2023 • edited

WeichenXu123 Apr 11, 2023 • edited

Choose a reason for hiding this comment

zhengruifeng Apr 11, 2023 • edited

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

zhengruifeng commented Apr 11, 2023

WeichenXu123 Apr 11, 2023

Choose a reason for hiding this comment

WeichenXu123 Apr 11, 2023

Choose a reason for hiding this comment

zhengruifeng Apr 11, 2023

Choose a reason for hiding this comment

zhengruifeng commented Apr 12, 2023

zhengruifeng commented Apr 12, 2023

zhengruifeng commented Apr 7, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 commented Apr 10, 2023 •

edited

WeichenXu123 Apr 11, 2023 •

edited

zhengruifeng Apr 11, 2023 •

edited