You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding adjusting learning rate with LAMP.
In your case you have a fixed learning rate which is "0.000125", and then you divide or multiple some numbers to get the correct base learning rate depend on the number of GPUs:
Then you apply another equation to get the final learning rate:
BASE_LR_BATCHSIZE = 32
total_gpus = num_gpus_per_machine * config.machines
global_batch_size = config.batch_size * total_gpus
# linear LR scaling (https://arxiv.org/abs/1706.02677)
lr = config.base_lr * (global_batch_size / BASE_LR_BATCHSIZE)
This means using 16x nodes at amazon we will get a bigger batch size and bigger learning rate:
0.00020833333 * (96 * 16 * 8 / 32) = 0.07999999872
While a single node at amazon will get a smaller batch size and smaller learning rate:
0.00003125 * (96 * 1 * 8 / 32) = 0.00075
My questions are:
Why the BASE_LR_BATCHSIZE is 32 and not 96 ?
If I want to train the model for x number of nodes using y batch size per GPU, how I can determine the correct base_lr ?
Thanks a lot.
The text was updated successfully, but these errors were encountered:
Hello,
I have a question regarding adjusting learning rate with LAMP.
In your case you have a fixed learning rate which is "0.000125", and then you divide or multiple some numbers to get the correct base learning rate depend on the number of GPUs:
Then you apply another equation to get the final learning rate:
This means using 16x nodes at amazon we will get a bigger batch size and bigger learning rate:
0.00020833333 * (96 * 16 * 8 / 32) = 0.07999999872
While a single node at amazon will get a smaller batch size and smaller learning rate:
0.00003125 * (96 * 1 * 8 / 32) = 0.00075
My questions are:
Thanks a lot.
The text was updated successfully, but these errors were encountered: