How to perform multinode training with torch.distributed.launch? #30

fangruizhu · 2020-10-09T13:26:46Z

Hi, nice work! I tried to do pretraining with main_swav.py on multiple machines.

Here's the main code for distributed training.

python -m torch.distributed.launch main_swav.py --rank 0 \
--world_size 8 \
--dist_url 'tcp://172.31.11.200:23456' \

I comment the line 55-59 in src/utils.py in order to set ranks for each machine. It is okay to run.

But I found that during training, on each machine, only 1 GPU was used. I think it is caused by

swav/src/utils.py

Line 68 in 77f7185

args.gpu_to_work_on = args.rank % torch.cuda.device_count()

. Could you help me figure it out?

Many thanks!

The text was updated successfully, but these errors were encountered:

mathildecaron31 · 2020-10-13T09:32:35Z

Hi @fangruizhu

When training main_swav.py on multiple machines I simple launch the sbatch scripts (https://github.com/facebookresearch/swav/blob/master/scripts/swav_800ep_pretrain.sh for example). In this case SLURM automatically create 8 processes per node and I do not use torch.distributed.launch utility.

I think you can adapt the code to train on multiple nodes while using torch.distributed.launch: I would recommend you to read https://pytorch.org/docs/stable/distributed.html#launch-utility. Reading the "Multi-Node multi-process distributed training: (e.g. two nodes)" paragraph it seems that you should not pass rank or world_size as argument.

Hope that helps

mathildecaron31 · 2020-10-13T09:33:23Z

You can also add print(args.rank) statement to be sure that each process has a different rank.

fangruizhu · 2020-10-14T03:35:09Z

Thanks:) @mathildecaron31 I fix it with torch multiprocessing package and now it can work fine with multinode training.

mathildecaron31 · 2020-10-14T07:14:45Z

Awesome !

priyamdey · 2022-08-20T21:17:58Z

Hi @mathildecaron31 , I had one doubt on the process-to-gpu mapping for the 800ep script. As only 8 processes are being spawned per node, each process will serve 8 gpus (as there are 64 gpus in a node). Although I'm aware that multiple process-to-gpu map configurations are possible, how is the batch-size set for such scenarios? In your script, you've set it to 64. Is this per process or per gpu? per process makes it 4096. But isn't batch-size set for gpu?

fangruizhu closed this as completed Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to perform multinode training with torch.distributed.launch? #30

How to perform multinode training with torch.distributed.launch? #30

fangruizhu commented Oct 9, 2020

mathildecaron31 commented Oct 13, 2020

mathildecaron31 commented Oct 13, 2020

fangruizhu commented Oct 14, 2020

mathildecaron31 commented Oct 14, 2020

priyamdey commented Aug 20, 2022

How to perform multinode training with torch.distributed.launch? #30

How to perform multinode training with torch.distributed.launch? #30

Comments

fangruizhu commented Oct 9, 2020

mathildecaron31 commented Oct 13, 2020

mathildecaron31 commented Oct 13, 2020

fangruizhu commented Oct 14, 2020

mathildecaron31 commented Oct 14, 2020

priyamdey commented Aug 20, 2022