Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpus >1 with VarNet, parallel #48

Closed
zhan4817 opened this issue Jul 9, 2020 · 5 comments
Closed

gpus >1 with VarNet, parallel #48

zhan4817 opened this issue Jul 9, 2020 · 5 comments

Comments

@zhan4817
Copy link

zhan4817 commented Jul 9, 2020

Hello, I am trying to train the knee datasets by running VarNet with multiple GPUs. Everytime when it finishes the first epoch and is about to enter the second epoch, it gets stuck. However, VarNet with 1 gpu works fine. Can you please help to fix this?

FYI, Unet is OK in my tests with 1 gpu or multi gpus.

@z-fabian
Copy link
Contributor

I ran into the same problem. Multi-GPU training hangs after the first epoch, whereas it has no issues on a single GPU on VarNet. For context, I'm using a single compute node with 4x RTX 2080 Ti GPUs (11GB RAM), CUDA10.2 and ddp Pytorch Lightning backend. It might be an OOM issue?

@mmuckley
Copy link
Contributor

That is curious. I'm hoping to roll out a big PR soon (perhaps tomorrow or maybe in a couple weeks after my PTO). It will try to make things a bit cleaner so it's easier to see where a ddp issue might be.

Also note that ddp in general is quite finicky - you'll want to make sure your CUDA is as up-to-date as possible, ideally with the latest version of both PyTorch and NCCL.

@z-fabian
Copy link
Contributor

z-fabian commented Aug 4, 2020

Update: it looks like the issue has something to do with unequal load on different GPUs during validation. Training on 4 GPUs works if we simply use DistributedSampler for validation (which makes sure that each GPU has the exact same number of examples) instead of VolumeSampler. I also tried to force the VolumeSampler to distribute equal number of examples to each GPU by discarding some slices and training worked just fine. It is interesting that it is not an issue for U-Net. Varnet forward might take longer time than U-Net leading to large differences in GPU finish times.

@mmuckley
Copy link
Contributor

mmuckley commented Aug 6, 2020

Thanks - this helped me a lot in debugging. An issue with discarding the VolumeSampler is that the SSIM values will change substantially due to the need for the maximum value over the whole volume. In the refactor update I applied a fix for the VolumeSampler that pads the number of samples to make them equal across processes. During the metrics calculation phase I added some logic to detect duplicate slices so this shouldn't effect the metrics.

This should be fixed in the package refactor if you look at experimental/varnet/train_varnet_demo.py.

@z-fabian
Copy link
Contributor

z-fabian commented Aug 6, 2020

Great, thanks for the fix. Adding the dummy slices is probably the best solution here. I tried training on multiple GPUs with the new VolumeSampler and mri_module and it doesn't get stuck anymore at validation. This issue is solved!

@mmuckley mmuckley closed this as completed Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants