-
Notifications
You must be signed in to change notification settings - Fork 38
Did you ever try MNMT systems? #15
Comments
There is a possible bug in the If you want to test this yourself you could try two different things: |
I don't think I'll get to it anytime soon since this is rather low priority for me right now but in case you check it and manage to resolve it, let me know! It would be really beneficial to have stable performance for these type of models as well =) |
I had a look at this, and indeed there was a bug in the |
I don't have the exact version at hand right now but if it wasn't part of the |
Seems like I run into training instabilities now with the same hyper parameter config that works with default Adam and without StableEmbedding. Initially it seems to work fine but after a few update steps training produces NaNs. |
Hi Robin, that is as expected. Can you try again with the StableEmbedding layer? What I was meant is that previously the StableEmbedding layer did not work correctly, and if you use it now you should see better results. |
Ah maybe that was badly formulated, my bad! I'm using 8-bit Adam + Stable Embeddings and run into this issue. |
Thanks for the feedback. In that case, could you share a bit more about your data, model, hyperparameters? The more information the better. Would also be helpful to share the software you are using (fairseq, any particular branch etc.). The best would be some setup that I might be able to replicate. |
That might be a bit hard as I'm using an internal fairseq fork and training on internal data and models. I haven't checked if this reproduces the issue, but what you can try is training the multilingual_transformer on a few language pairs with |
As reported in the paper, for training a bi-directional transformer model on WMT14 or WMT16 the performance of 8-bit Adam stays relatively consistent with the 32-bit counterparts. I was also able to verify this on other data sources for training bi-directional models with my own setup.
However, I've also tried multiple variations of 8-bit optimizers on multilingual neural machine translation (MNMT) models in fairseq and there it seems that even with
--no-scale-embedding
as well as theStableEmbedding
the performance is roughly 3 BLEU behind the counterparts. The--no-scale-embedding
flag amounts to roughly 7 BLEU gain, while the xavier init amounts to roughly 0.4 BLEU gain. Didn't look into the effect of the layer norm of the stable embeddings yet.Did you do any testing on that and have practical tips on getting the performance up?
The text was updated successfully, but these errors were encountered: