gpus >1 with VarNet, parallel #48

zhan4817 · 2020-07-09T15:59:58Z

Hello, I am trying to train the knee datasets by running VarNet with multiple GPUs. Everytime when it finishes the first epoch and is about to enter the second epoch, it gets stuck. However, VarNet with 1 gpu works fine. Can you please help to fix this?

FYI, Unet is OK in my tests with 1 gpu or multi gpus.

z-fabian · 2020-07-13T19:50:32Z

I ran into the same problem. Multi-GPU training hangs after the first epoch, whereas it has no issues on a single GPU on VarNet. For context, I'm using a single compute node with 4x RTX 2080 Ti GPUs (11GB RAM), CUDA10.2 and ddp Pytorch Lightning backend. It might be an OOM issue?

mmuckley · 2020-07-23T22:25:42Z

That is curious. I'm hoping to roll out a big PR soon (perhaps tomorrow or maybe in a couple weeks after my PTO). It will try to make things a bit cleaner so it's easier to see where a ddp issue might be.

Also note that ddp in general is quite finicky - you'll want to make sure your CUDA is as up-to-date as possible, ideally with the latest version of both PyTorch and NCCL.

z-fabian · 2020-08-04T02:28:55Z

Update: it looks like the issue has something to do with unequal load on different GPUs during validation. Training on 4 GPUs works if we simply use DistributedSampler for validation (which makes sure that each GPU has the exact same number of examples) instead of VolumeSampler. I also tried to force the VolumeSampler to distribute equal number of examples to each GPU by discarding some slices and training worked just fine. It is interesting that it is not an issue for U-Net. Varnet forward might take longer time than U-Net leading to large differences in GPU finish times.

mmuckley · 2020-08-06T15:28:30Z

Thanks - this helped me a lot in debugging. An issue with discarding the VolumeSampler is that the SSIM values will change substantially due to the need for the maximum value over the whole volume. In the refactor update I applied a fix for the VolumeSampler that pads the number of samples to make them equal across processes. During the metrics calculation phase I added some logic to detect duplicate slices so this shouldn't effect the metrics.

This should be fixed in the package refactor if you look at experimental/varnet/train_varnet_demo.py.

z-fabian · 2020-08-06T17:15:12Z

Great, thanks for the fix. Adding the dummy slices is probably the best solution here. I tried training on multiple GPUs with the new VolumeSampler and mri_module and it doesn't get stuck anymore at validation. This issue is solved!

mmuckley closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpus >1 with VarNet, parallel #48

gpus >1 with VarNet, parallel #48

zhan4817 commented Jul 9, 2020

z-fabian commented Jul 13, 2020

mmuckley commented Jul 23, 2020

z-fabian commented Aug 4, 2020

mmuckley commented Aug 6, 2020

z-fabian commented Aug 6, 2020

gpus >1 with VarNet, parallel #48

gpus >1 with VarNet, parallel #48

Comments

zhan4817 commented Jul 9, 2020

z-fabian commented Jul 13, 2020

mmuckley commented Jul 23, 2020

z-fabian commented Aug 4, 2020

mmuckley commented Aug 6, 2020

z-fabian commented Aug 6, 2020