-
Notifications
You must be signed in to change notification settings - Fork 276
How to perform multinode training with torch.distributed.launch? #30
Comments
Hi @fangruizhu When training main_swav.py on multiple machines I simple launch the sbatch scripts (https://github.com/facebookresearch/swav/blob/master/scripts/swav_800ep_pretrain.sh for example). In this case SLURM automatically create 8 processes per node and I do not use I think you can adapt the code to train on multiple nodes while using Hope that helps |
You can also add |
Thanks:) @mathildecaron31 I fix it with torch multiprocessing package and now it can work fine with multinode training. |
Awesome ! |
Hi @mathildecaron31 , I had one doubt on the process-to-gpu mapping for the 800ep script. As only 8 processes are being spawned per node, each process will serve 8 gpus (as there are 64 gpus in a node). Although I'm aware that multiple process-to-gpu map configurations are possible, how is the batch-size set for such scenarios? In your script, you've set it to 64. Is this per process or per gpu? per process makes it 4096. But isn't batch-size set for gpu? |
Hi, nice work! I tried to do pretraining with main_swav.py on multiple machines.
Here's the main code for distributed training.
I comment the line 55-59 in src/utils.py in order to set ranks for each machine. It is okay to run.
But I found that during training, on each machine, only 1 GPU was used. I think it is caused by
swav/src/utils.py
Line 68 in 77f7185
Many thanks!
The text was updated successfully, but these errors were encountered: