Switch to accelerate for Multi GPU #87

Ssukriti · 2024-03-13T04:45:33Z

Is your feature request related to a problem? Please describe.

Accelerate https://huggingface.co/docs/transformers/en/accelerate is created by HF to help users easily train a Transformers model on any type of distributed setup, whether it is multiple GPU’s on one machine or multiple GPU’s across several machines. We should leverage the library for its ease of use.

Describe the solution you'd like

update the README.md removing instructions for torch.run and replace with accelerate.launch
replace the FSDP JSON with a config yaml like this.
move fsdp_config.json out of tuning/config (which houses code) into somewhere which only houses config fixtures.

Describe alternatives you've considered

torchrun used currently can be less user friendly . See #80

(@fabianlim thanks for the suggestions)

fabianlim · 2024-03-14T10:37:44Z

If we consider the constrained problem of running gpu jobs within a single pod, where each gpu is handled by a single process. There are the following options where mutiple processes are run:

all together in a single container, within a single Kube pod
individually each within a container, all containers housed within a single Kube pod.

There is also the third option where the processes are distributed across multiple Kube pods, but this may be over-complex. This would be the standard Kubeflow training operator approach.

Huggingface's recommendation is to run distributed training jobs using accelerate.

build on top of torchrun. The main process will spawn multiple worker processes for distributed training.
torchrun has a watchdog agent that handles things like fault tolerance.
processes communicates via various backends (e.g. static or c10d).
GPU communicates over GPU-network interfaces (e.g., NCCL).

Option 1:

is docker compliant to run multiple processes to parallelise work (e.g., running workers to parallelize an SQL query). This is very much like a master process spawning many child processes for distributed training.
has some fault tolerance capabilities already built in.

Option 2:

although the HF blog seems to claim that accelerate is backward compatible with torchrun, not sure how much of it is true. Certainly in the code there are a lot of acclerate specific flags that will only be set by accelerate launch.
for kube jobs it is mostly recommended to use the pytorch job controller, here is an example for distributed cpu jobs that can be extrapolated to GPUs. THey use torchrun, but probably also works for accelerate launch (due to the similarities in the API) This is incorrect as this will launch each worker in its own pod.

kmehant · 2024-03-15T07:54:11Z

There is also the third option where the processes are distributed across multiple Kube pods, but this may be over-complex. This would be the standard Kubeflow training operator approach.

The "PyTorchJob" operator/CR from standard Kubeflow training operator allows us to run multiple processes within single container in a pod (Master pod) like the option 1 when we just want to run a multi-gpu single node training job. When we wish to spawn multi-node multi-gpu job, then we would leverage the worker pod where distributed environment variables (node rank, master address, port etc) are automatically injected by the operator. We just simply replicate accelerate launch in the worker pod and the node rank from the operator determines whether the pod is a worker pod or not. Also there is local rank created by torch.distributed which differentiates between all the processes.

In option 1, AFAIK, most of the popular container runtimes are multiprocess friendly and on the resource side, the resource requests and limits are container level in Kubernetes.

Ssukriti · 2024-03-15T23:10:40Z

thanks for findings. We are working on the kubernetes solution in platform team in the next week.

The "PyTorchJob" operator/CR from standard Kubeflow training operator allows us to run multiple processes within single container in a pod (Master pod)

we will be testing it with kubeflow training operator. we will update when work is done as part of issue #88

This was referenced Mar 13, 2024

bug: Boolean values are represented as strings in default fsdp config translates to True #80

Open

Update launch training for multi GPU training #88

Closed

fabianlim mentioned this issue Mar 13, 2024

Multi-GPU switch from TorchRun to Accelerate #91

Closed

kmehant mentioned this issue Mar 13, 2024

feat: move to accelerate launch for distributed training #92

Merged

2 tasks

Ssukriti assigned kmehant and fabianlim Mar 15, 2024

Ssukriti closed this as completed in #92 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to accelerate for Multi GPU #87

Switch to accelerate for Multi GPU #87

Ssukriti commented Mar 13, 2024 •

edited

Loading

fabianlim commented Mar 14, 2024 •

edited

Loading

kmehant commented Mar 15, 2024

Ssukriti commented Mar 15, 2024 •

edited

Loading

Switch to accelerate for Multi GPU #87

Switch to accelerate for Multi GPU #87

Comments

Ssukriti commented Mar 13, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

fabianlim commented Mar 14, 2024 • edited Loading

kmehant commented Mar 15, 2024

Ssukriti commented Mar 15, 2024 • edited Loading

Ssukriti commented Mar 13, 2024 •

edited

Loading

fabianlim commented Mar 14, 2024 •

edited

Loading

Ssukriti commented Mar 15, 2024 •

edited

Loading