optimizer step #20

tojiboyevf · 2023-05-08T09:23:54Z

I have some questions. I would be grateful if you could answer them.

In optimizer_step() function in main.py file you multiply lr by lr_scale till 650 steps but you didn't mention this part in paper. Do you do the same warm up for Adam and AdamW?
How do you select the learning rate for SGD, Adam, AdamW when you increase the batch size? For instance, some authors in self-supervised models select the learning rate by this formula: lr = base_lr * batch_size / 256 and base_lr can be 0.2, 0.3 and other values.
Do you use the same scheduler with the same settings for Adam and AdamW optimizers?
Did you use any framework to find the best hyperparameters?

Thanks for your attention!

The text was updated successfully, but these errors were encountered:

amaralibey · 2023-05-26T17:19:29Z

Hello @tojiboyevf, thank you for you interest,

Q: In optimizer_step() function in main.py file you multiply lr by lr_scale till 650 steps but you didn't mention this part in paper. Do you do the same warm up for Adam and AdamW?
A: Yes, we did utilize warm-up in our implementation. Although it did not significantly improve performance, it helped stabilize the models during the first epochs. We applied the warm-up strategy to SGD, Adam and AdamW optimizers.

Q: How do you select the learning rate for SGD, Adam, AdamW when you increase the batch size? For instance, some authors in self-supervised models select the learning rate by this formula: lr = base_lr * batch_size / 256 and base_lr can be 0.2, 0.3 and other values.
A: For SGD, we set the learning rate to 0.03 when the batch size is 100-120 places (which corresponds to 400-480 images). You can adjust the learning rate proportionally based on the batch size. In our case, you can use lr = 0.05 * BS / 120 for SGD and 0.0002 * BS / 120 for AdamW.

Q: Do you use the same scheduler with the same settings for Adam and AdamW optimizers?
A: Yes, we use the same scheduler for all optimizers. Specifically, we multiply the learning rate by 0.3 after every 5 epochs. Although we experimented with different strategies, we found that they yielded similar performance.

Q: Did you use any framework to find the best hyperparameters?
A: No, we just tried some values and went with those that performed best on pitts30-val.

tojiboyevf · 2023-05-26T19:24:06Z

Cool! Thanks for your answers!

tojiboyevf closed this as completed May 8, 2023

tojiboyevf reopened this May 8, 2023

tojiboyevf closed this as completed May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizer step #20

optimizer step #20

tojiboyevf commented May 8, 2023 •

edited

Loading

amaralibey commented May 26, 2023 •

edited

Loading

tojiboyevf commented May 26, 2023

optimizer step #20

optimizer step #20

Comments

tojiboyevf commented May 8, 2023 • edited Loading

amaralibey commented May 26, 2023 • edited Loading

tojiboyevf commented May 26, 2023

tojiboyevf commented May 8, 2023 •

edited

Loading

amaralibey commented May 26, 2023 •

edited

Loading