New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geoopt.optim.RiemannianSGD does not work with Distributed Data Parallel #168
Comments
Thanks, let me check what's going on |
Update The problem appears when using If I replace Other than efficiency, is there any other undesirable side effect if I always use the Thanks |
this copy or set was introduced because one of the retractions returned a non-contiguous array. and long time ago pytorch used to have |
Awesome. Do you want to keep the issue open? |
Yes, gonna fix that soon to close it
…On Tue, 27 Apr 2021, 22:31 Didac Suris, ***@***.***> wrote:
Awesome. Do you want to keep the issue open?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#168 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACZJX3SICYOH7X66UYWOAKDTK4GJHANCNFSM43VLORXA>
.
|
@surisdi you can go ahead and open a pr, in on long vacations. I can review it later |
Hi @ferrine, I will do that. But would it be possible to have some context on the Thanks! |
The context was about optimizers to speed up a tiny bit. This is not the case any more |
The solution is to get rid of this function and replace it's usages with copy |
First of all, thank you for this library!
Description of the bug
When training with Distributed Data Parallel (DDP), the gradient between different devices is not correctly synchronized when using RiemannianSGD (or RiemannianAdam). Replacing it with a standard
torch.optim.SGD
works well. Note that when using DDP the gradient is synchronized during.backprop()
(see this link).To Reproduce
Simple code training on ImageNet:
In order to run, use:
CUDA_VISIBLE_DEVICES=0,1 python run.py
Expected behavior
The expected behavior is the one that occurs when the line
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
is uncommented, and the lineoptimizer = geoopt.optim.RiemannianSGD(model.parameters(), lr=0.1, , stabilize=10)
is commented. In that case, the output is:The gradients in the two GPUs are correctly synchronized. However, when using
RiemannianSGD
, the output is:There is some problem with the gradient synchronization, which causes the weights in the two devices to diverge.
Library version information:
python -c 'import torch;print("torch:", torch.version.__version__, end=" ");print("cuda:", torch.version.cuda)'
torch: 1.8.1 cuda: 11.1
the way you installed
geoopt
, github, pippip
OS
Ubuntu 18.04.5 LTS
EDIT: I simplified a little bit the code by removing mixed precision.
The text was updated successfully, but these errors were encountered: