New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpus >1 with VarNet, parallel #48
Comments
I ran into the same problem. Multi-GPU training hangs after the first epoch, whereas it has no issues on a single GPU on VarNet. For context, I'm using a single compute node with 4x RTX 2080 Ti GPUs (11GB RAM), CUDA10.2 and ddp Pytorch Lightning backend. It might be an OOM issue? |
That is curious. I'm hoping to roll out a big PR soon (perhaps tomorrow or maybe in a couple weeks after my PTO). It will try to make things a bit cleaner so it's easier to see where a ddp issue might be. Also note that ddp in general is quite finicky - you'll want to make sure your CUDA is as up-to-date as possible, ideally with the latest version of both PyTorch and NCCL. |
Update: it looks like the issue has something to do with unequal load on different GPUs during validation. Training on 4 GPUs works if we simply use DistributedSampler for validation (which makes sure that each GPU has the exact same number of examples) instead of VolumeSampler. I also tried to force the VolumeSampler to distribute equal number of examples to each GPU by discarding some slices and training worked just fine. It is interesting that it is not an issue for U-Net. Varnet forward might take longer time than U-Net leading to large differences in GPU finish times. |
Thanks - this helped me a lot in debugging. An issue with discarding the This should be fixed in the package refactor if you look at |
Great, thanks for the fix. Adding the dummy slices is probably the best solution here. I tried training on multiple GPUs with the new |
Hello, I am trying to train the knee datasets by running VarNet with multiple GPUs. Everytime when it finishes the first epoch and is about to enter the second epoch, it gets stuck. However, VarNet with 1 gpu works fine. Can you please help to fix this?
FYI, Unet is OK in my tests with 1 gpu or multi gpus.
The text was updated successfully, but these errors were encountered: