Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How to perform multinode training with torch.distributed.launch? #30

Closed
fangruizhu opened this issue Oct 9, 2020 · 5 comments
Closed

Comments

@fangruizhu
Copy link

Hi, nice work! I tried to do pretraining with main_swav.py on multiple machines.

Here's the main code for distributed training.

python -m torch.distributed.launch main_swav.py --rank 0 \
--world_size 8 \
--dist_url 'tcp://172.31.11.200:23456' \

I comment the line 55-59 in src/utils.py in order to set ranks for each machine. It is okay to run.

But I found that during training, on each machine, only 1 GPU was used. I think it is caused by

args.gpu_to_work_on = args.rank % torch.cuda.device_count()
. Could you help me figure it out?

Many thanks!

@mathildecaron31
Copy link
Contributor

Hi @fangruizhu

When training main_swav.py on multiple machines I simple launch the sbatch scripts (https://github.com/facebookresearch/swav/blob/master/scripts/swav_800ep_pretrain.sh for example). In this case SLURM automatically create 8 processes per node and I do not use torch.distributed.launch utility.

I think you can adapt the code to train on multiple nodes while using torch.distributed.launch: I would recommend you to read https://pytorch.org/docs/stable/distributed.html#launch-utility. Reading the "Multi-Node multi-process distributed training: (e.g. two nodes)" paragraph it seems that you should not pass rank or world_size as argument.

Hope that helps

@mathildecaron31
Copy link
Contributor

You can also add print(args.rank) statement to be sure that each process has a different rank.

@fangruizhu
Copy link
Author

Thanks:) @mathildecaron31 I fix it with torch multiprocessing package and now it can work fine with multinode training.

@mathildecaron31
Copy link
Contributor

Awesome !

@priyamdey
Copy link

Hi @mathildecaron31 , I had one doubt on the process-to-gpu mapping for the 800ep script. As only 8 processes are being spawned per node, each process will serve 8 gpus (as there are 64 gpus in a node). Although I'm aware that multiple process-to-gpu map configurations are possible, how is the batch-size set for such scenarios? In your script, you've set it to 64. Is this per process or per gpu? per process makes it 4096. But isn't batch-size set for gpu?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants