Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer step #20

Closed
tojiboyevf opened this issue May 8, 2023 · 2 comments
Closed

optimizer step #20

tojiboyevf opened this issue May 8, 2023 · 2 comments

Comments

@tojiboyevf
Copy link

tojiboyevf commented May 8, 2023

Dear @amaralibey,

I have some questions. I would be grateful if you could answer them.

  1. In optimizer_step() function in main.py file you multiply lr by lr_scale till 650 steps but you didn't mention this part in paper. Do you do the same warm up for Adam and AdamW?
  2. How do you select the learning rate for SGD, Adam, AdamW when you increase the batch size? For instance, some authors in self-supervised models select the learning rate by this formula: lr = base_lr * batch_size / 256 and base_lr can be 0.2, 0.3 and other values.
  3. Do you use the same scheduler with the same settings for Adam and AdamW optimizers?
  4. Did you use any framework to find the best hyperparameters?

Thanks for your attention!

@tojiboyevf tojiboyevf reopened this May 8, 2023
@amaralibey
Copy link
Owner

amaralibey commented May 26, 2023

Hello @tojiboyevf, thank you for you interest,

Q: In optimizer_step() function in main.py file you multiply lr by lr_scale till 650 steps but you didn't mention this part in paper. Do you do the same warm up for Adam and AdamW?
A: Yes, we did utilize warm-up in our implementation. Although it did not significantly improve performance, it helped stabilize the models during the first epochs. We applied the warm-up strategy to SGD, Adam and AdamW optimizers.

Q: How do you select the learning rate for SGD, Adam, AdamW when you increase the batch size? For instance, some authors in self-supervised models select the learning rate by this formula: lr = base_lr * batch_size / 256 and base_lr can be 0.2, 0.3 and other values.
A: For SGD, we set the learning rate to 0.03 when the batch size is 100-120 places (which corresponds to 400-480 images). You can adjust the learning rate proportionally based on the batch size. In our case, you can use lr = 0.05 * BS / 120 for SGD and 0.0002 * BS / 120 for AdamW.

Q: Do you use the same scheduler with the same settings for Adam and AdamW optimizers?
A: Yes, we use the same scheduler for all optimizers. Specifically, we multiply the learning rate by 0.3 after every 5 epochs. Although we experimented with different strategies, we found that they yielded similar performance.

Q: Did you use any framework to find the best hyperparameters?
A: No, we just tried some values and went with those that performed best on pitts30-val.

@tojiboyevf
Copy link
Author

Cool! Thanks for your answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants