You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Hi, I noticed that whether it is unsupervised NMT training or MLM training, the learning rate is 0.0001. Is this the learning rate when training with 8 GPUs? If I use 4 GPUs, how to adjust the learning rate and warm-up? Thank you very much.
The text was updated successfully, but these errors were encountered:
0.0001 is overall what worked the best in our experiments on 8 GPU. On 64 GPU, we found that the best was between 0.0001 and 0.0003. For 4, I think 0.0001 should do the trick. But it is true that the model is quite sensitive to learning rate tuning (unless it does not have many layers), so I would suggest trying a few values around that to see what is the best.
Hi, I noticed that whether it is unsupervised NMT training or MLM training, the learning rate is 0.0001. Is this the learning rate when training with 8 GPUs? If I use 4 GPUs, how to adjust the learning rate and warm-up? Thank you very much.
The text was updated successfully, but these errors were encountered: