New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uneven GPU memory caused by multi-gpu training #5
Comments
Strange, I never had that problem before. What sort of GPUs are you using? What about batch size? Is the imbalance causing an issue? It could be related to this: But then moving loss computation into the nn.Module doesn't quite make logical sense (since loss requires multiple images and their relative pose to compute) and also add overhead. |
I think if you can also do something like this in your bash: export CUDA_VISIBLE_DEVICES=0,1 if you want to use a smaller batch size and it should allocate them onto a second GPU. |
Thanks, you mean like this? |
I did some digging. I think this is the nature of PyTorch where it replicates anything that is on data parallel across GPUs. But the first GPU is still the ``master'' so it needs to hold optimizer, parameters, any operation that is not parallelized. |
As for the above, I think you'll need to replace the export statement in the bash file |
Thanks for your reply. I also found that Dataparallel would cause very low training efficiency due to the cross-GPU interactions. Thus I use 2 GPUs to keep the balance between efficiency and batch size. I can easily reproduce the results of your paper. Thanks again. |
Great, thanks, closing this issue |
Hi, Alex,
Thanks for your nice work. I'm facing the problem of uneven GPU memories when training the model with multiple GPUs. It costs much more memory on GPU#0 than others. I think the main reason is that DataParallel can only compute losses on GPU#0. Would you give some advice to balance the GPU memory? Thanks in advance.
The text was updated successfully, but these errors were encountered: