[SPARK-41592][PYTHON][ML] Pytorch file Distributed Training #39267

rithwik-db · 2022-12-28T23:38:12Z

What changes were proposed in this pull request?

This is an addition to #39188 to add support for multi node training using PyTorch files. The users would follow the second workflow in the design document to run training on the executors. I added some new utility functions as well as built on top of current functions. This is largely WIP so testing will be added very soon.

Why are the changes needed?

Look at the main ticket for more details.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested with a pseudo-integration test. Integration tests will be added in a future PR.

AmplabJenkins · 2022-12-31T20:45:59Z

Can one of the admins verify this patch?

rithwik-db · 2023-01-11T23:57:18Z

python/pyspark/ml/torch/distributor.py

@@ -407,13 +418,6 @@ def _run_local_training(
        try:
            if self.use_gpu:
                gpus_owned = get_gpus_owned(self.sc)
-


This is actually no longer needed since if num_processes > len(gpus_owned), then we set num_processes = len(gpus_owned)

python/pyspark/ml/torch/distributor.py

lu-wang-dl · 2023-01-12T22:00:41Z

python/pyspark/ml/torch/distributor.py

+
+            CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"
+
+            # The idea of setting the random port to 0 doesn't seem to work?


What does this mean?

Something like the following seems to error:

import socket sock = socket.socket() sock.bind((master_address, 0)) port = sock.getsockname()[1]

So I just find a port using randomness.

What happened if two processes choose the same port?

I believe it will raise a RuntimeError: Address already in use

lu-wang-dl · 2023-01-12T22:02:30Z

python/pyspark/ml/torch/distributor.py

+            if use_gpu:
+                set_gpus(context)
+            else:
+                os.environ[CUDA_VISIBLE_DEVICES] = ""


Do we need to do this?

I think it should be added because in the case the user runs training with TorchDistributor(use_gpu=False, **kwargs).run(train_fn) but accidentally has some PyTorch Lightning code like pl.Trainer(accelerator="gpu") in their train_fn, an error should be raised saying no cuda devices available even though you specified a gpu accelerator.

We already have a check in get_num_tasks that checks when use_gpu=True but no GPUs are available, and I think this code addresses the case when use_gpu=False but the internal code has usage of GPUs.

@WeichenXu123 @lu-wang-dl is my logic reasonable here or did I misunderstand anything?

I still don't understand. If the user run something like pl.Trainer(accelerator="gpu") on a CPU cluster, what is the behavior from PyTorch lighting?

PyTorch Lightning will raise a MisconfigurationException: No supported gpu backend found!. This is what we expect to see if the user sets use_gpu=False and calls pl.Trainer(accelerator="gpu") My understanding is that if a user runs this code on a local cluster with GPUs on each node without os.environ[CUDA_VISIBLE_DEVICES] = "", then the task will be assigned a GPU even when use_gpu=True.

lu-wang-dl · 2023-01-12T22:03:57Z

python/pyspark/ml/torch/distributor.py

+    ) -> Optional[Any]:
+        if not framework_wrapper_fn:
+            raise RuntimeError("Unknown combination of parameters")
+        spark_task_program = self._get_spark_task_program(framework_wrapper_fn, train_fn, *args)


Why not just define the function here?

I guess just for the sake of modularity. We could just define the function here.

lu-wang-dl

Overall LGTM. Just some minor comments/questions.

WeichenXu123 · 2023-01-17T14:18:14Z

python/pyspark/ml/torch/distributor.py

+                import socket
+                import random
+
+                while True:


Shall we add a sleep(0.1) in the loop body ?

and I recommend to set maximum retry number (e.g. 100) for get_free_port loop, to avoid dead loop in some unexpected cases.

WeichenXu123 · 2023-01-17T14:33:25Z

python/pyspark/ml/torch/distributor.py

+            context = BarrierTaskContext.get()
+
+            if use_gpu:
+                set_gpus(context)


We can simplify set_gpus function:

if CUDA_VISIBLE_DEVICES env var exists, do nothing (spark already set CUDA_VISIBLE_DEVICES properly
otherwise generates CUDA_VISIBLE_DEVICES from taskcontext.resources["gpu"].addresses

https://github.com/apache/spark/pull/39267/files#diff-76c395a6b98138662faaec37460ccda966f5cc0df0bccd224dfefcd81b2a7a79R459 <- Is this what you were suggesting?

python/pyspark/ml/torch/distributor.py

WeichenXu123

LGTM, thanks!

HyukjinKwon · 2023-01-19T00:24:14Z

Merged to master.

mattoh91 · 2023-04-11T13:13:34Z

@rithwik-db Can I clarify that the num_processors attribute of the TorchDistributor class refers to the number of spark.executor.cores used, and not the number of spark.executor.instances?
Trying to use >1 num_processors seems to take up more cores / slots on a single executor during training (using spark operator on k8s).

rithwik-db · 2023-04-11T17:44:46Z

If we are using CPUs for training, num_processors attribute refers to the number of spark tasks that will be created for training, and each spark task can use >1 CPU depending on what spark.task.cpus says. (This function is where this logic is defined).

github-actions bot added BUILD CORE ML PYTHON labels Dec 28, 2022

rithwik-db changed the title ~~[SPARK-41592] Pytorch file Distributed Training~~ [WIP][SPARK-41592][PYTHON][ML] Pytorch file Distributed Training Dec 28, 2022

rithwik-db force-pushed the pytorch-file-distributed-training branch from ead876c to 1755625 Compare December 29, 2022 02:18

rithwik-db force-pushed the pytorch-file-distributed-training branch from 1755625 to 7706376 Compare January 3, 2023 18:25

rithwik-db force-pushed the pytorch-file-distributed-training branch from 7706376 to 732e350 Compare January 11, 2023 19:46

github-actions bot removed the BUILD label Jan 11, 2023

rithwik-db force-pushed the pytorch-file-distributed-training branch 2 times, most recently from 5fb8333 to f8c464f Compare January 11, 2023 23:56

rithwik-db commented Jan 11, 2023

View reviewed changes

lu-wang-dl reviewed Jan 12, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Show resolved Hide resolved

lu-wang-dl reviewed Jan 12, 2023

View reviewed changes

rithwik-db mentioned this pull request Jan 13, 2023

[SPARK-41591][PYTHON][ML] Training PyTorch Files on Single Node Multi GPU #39188

Closed

WeichenXu123 reviewed Jan 17, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jan 17, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jan 17, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jan 17, 2023

View reviewed changes

python/pyspark/ml/torch/distributor.py Show resolved Hide resolved

Running PyTorch distributed-ly

02cd8be

rithwik-db force-pushed the pytorch-file-distributed-training branch 3 times, most recently from 7e51d28 to 80d82bc Compare January 18, 2023 04:04

addressed comments

bfd6879

rithwik-db force-pushed the pytorch-file-distributed-training branch from 80d82bc to bfd6879 Compare January 18, 2023 06:17

WeichenXu123 approved these changes Jan 18, 2023

View reviewed changes

HyukjinKwon changed the title ~~[WIP][SPARK-41592][PYTHON][ML] Pytorch file Distributed Training~~ [SPARK-41592][PYTHON][ML] Pytorch file Distributed Training Jan 19, 2023

HyukjinKwon closed this in b3a5a91 Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41592][PYTHON][ML] Pytorch file Distributed Training #39267

[SPARK-41592][PYTHON][ML] Pytorch file Distributed Training #39267

rithwik-db commented Dec 28, 2022 •

edited

AmplabJenkins commented Dec 31, 2022

rithwik-db Jan 11, 2023

lu-wang-dl Jan 12, 2023

rithwik-db Jan 12, 2023

lu-wang-dl Jan 13, 2023 •

edited

rithwik-db Jan 13, 2023

lu-wang-dl Jan 12, 2023

rithwik-db Jan 12, 2023 •

edited

rithwik-db Jan 12, 2023

lu-wang-dl Jan 13, 2023

rithwik-db Jan 13, 2023 •

edited

lu-wang-dl Jan 12, 2023

rithwik-db Jan 12, 2023

lu-wang-dl left a comment

WeichenXu123 Jan 17, 2023

WeichenXu123 Jan 17, 2023 •

edited

WeichenXu123 Jan 17, 2023

rithwik-db Jan 17, 2023

WeichenXu123 left a comment

HyukjinKwon commented Jan 19, 2023

mattoh91 commented Apr 11, 2023

rithwik-db commented Apr 11, 2023


		CUDA_VISIBLE_DEVICES = "CUDA_VISIBLE_DEVICES"

		# The idea of setting the random port to 0 doesn't seem to work?

[SPARK-41592][PYTHON][ML] Pytorch file Distributed Training #39267

[SPARK-41592][PYTHON][ML] Pytorch file Distributed Training #39267

Conversation

rithwik-db commented Dec 28, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Dec 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lu-wang-dl Jan 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rithwik-db Jan 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rithwik-db Jan 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lu-wang-dl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Jan 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 19, 2023

mattoh91 commented Apr 11, 2023

rithwik-db commented Apr 11, 2023

rithwik-db commented Dec 28, 2022 •

edited

lu-wang-dl Jan 13, 2023 •

edited

rithwik-db Jan 12, 2023 •

edited

rithwik-db Jan 13, 2023 •

edited

WeichenXu123 Jan 17, 2023 •

edited