Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About LPF-SGD implementation #1

Closed
lucasliunju opened this issue Feb 27, 2022 · 5 comments
Closed

About LPF-SGD implementation #1

lucasliunju opened this issue Feb 27, 2022 · 5 comments

Comments

@lucasliunju
Copy link

Thanks so much for releasing the code. I have several questions about the implementation about LPF-SGD.

  1. noise.append(- init_mp - temp) the reason that using init_mp to obtain the noise

  2. mp.grad.add_((-(n**2 + 1) / mp.view(-1).norm().item())*batch_loss.item()) why we still need this value to the gradient.

@devansh20la
Copy link
Owner

devansh20la commented Feb 28, 2022

I think you might be looking at the wrong file. This was from another optimizer.

EDIT: I had a naming error. The code you are looking for starts here:

with torch.no_grad():
noise = []
for mp in model.parameters():
if len(mp.shape) > 1:
sh = mp.shape
sh_mul = np.prod(sh[1:])
temp = mp.view(sh[0], -1).norm(dim=1, keepdim=True).repeat(1, sh_mul).view(mp.shape)
temp = torch.normal(0, args.std*temp).to(mp.data.device)
else:
temp = torch.empty_like(mp, device=mp.data.device)
temp.normal_(0, args.std*(mp.view(-1).norm().item() + 1e-16))
noise.append(temp)
mp.data.add_(noise[-1])
# single sample convolution approximation
with torch.set_grad_enabled(True):
outputs = model(inputs)
batch_loss = criterion(outputs, targets) / args.M
batch_loss.backward()
# going back to without theta
with torch.no_grad():
for mp, n in zip(model.parameters(), noise):
mp.data.sub_(n)

@lucasliunju
Copy link
Author

Thank you very much for your reply!

@lucasliunju
Copy link
Author

Dear Devansh,

This is a great work for me and LPF-SGD can significantly improve the performance compared with vanilla momentum sgd. I would like to ask whether cosine learning rate decay can also work for LPF-SGD on WRN and ResNet. Since I find the most learning rate decay is StepLR

Thank you very much!

Best,
Lucas

@devansh20la
Copy link
Owner

I did not try the cosine learning rate decay. I can't think of any reason it shouldn't work, it might just require finetuning to find the best hyperparameters.

@lucasliunju
Copy link
Author

Dear Devansh,
Thank you very much for your reply. I have reproduced the results based on cosine decay, which is also the same as the results in the paper. Thanks again!

Best,
Lucas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants