Did you ever try MNMT systems? #15

SirRob1997 · 2021-11-15T21:05:21Z

As reported in the paper, for training a bi-directional transformer model on WMT14 or WMT16 the performance of 8-bit Adam stays relatively consistent with the 32-bit counterparts. I was also able to verify this on other data sources for training bi-directional models with my own setup.

However, I've also tried multiple variations of 8-bit optimizers on multilingual neural machine translation (MNMT) models in fairseq and there it seems that even with --no-scale-embedding as well as the StableEmbedding the performance is roughly 3 BLEU behind the counterparts. The --no-scale-embedding flag amounts to roughly 7 BLEU gain, while the xavier init amounts to roughly 0.4 BLEU gain. Didn't look into the effect of the layer norm of the stable embeddings yet.

Did you do any testing on that and have practical tips on getting the performance up?

The text was updated successfully, but these errors were encountered:

TimDettmers · 2021-11-17T04:47:06Z

There is a possible bug in the StableEmbedding layer that exists depending on when/how/if the parameters are registered with the GlobalOptimManager. I have to look into this, probably get to this sometime next week.

If you want to test this yourself you could try two different things:

Call GlobalOptimManager.get_instance().register_parameters(model.parameters()) before the parameters are on the GPU. For fairseq that is here
Change the order of these two lines and see if it makes a difference.

SirRob1997 · 2021-11-17T10:13:15Z

I don't think I'll get to it anytime soon since this is rather low priority for me right now but in case you check it and manage to resolve it, let me know! It would be really beneficial to have stable performance for these type of models as well =)

TimDettmers · 2021-12-04T20:04:32Z

I had a look at this, and indeed there was a bug in the StableEmbedding layer. Could you inspect the code on which you ran this multilingual NMT baseline? If you do not use GlobalOptimManager.get_instance().register_parameters(model.parameters()) this would explain the poor performance.

SirRob1997 · 2021-12-04T20:26:30Z

I don't have the exact version at hand right now but if it wasn't part of the StableEmbedding layer when this issue was raised, I didn't use it as I simply imported the layer as is. I'll try to kick off an experiment on version 0.26.0 and see if it fixes the issues for me. Will report back!

SirRob1997 · 2021-12-10T14:40:44Z

Seems like I run into training instabilities now with the same hyper parameter config that works with default Adam and without StableEmbedding. Initially it seems to work fine but after a few update steps training produces NaNs.

TimDettmers · 2021-12-10T16:08:25Z

Hi Robin, that is as expected. Can you try again with the StableEmbedding layer? What I was meant is that previously the StableEmbedding layer did not work correctly, and if you use it now you should see better results.

SirRob1997 · 2021-12-10T16:09:52Z

Ah maybe that was badly formulated, my bad! I'm using 8-bit Adam + Stable Embeddings and run into this issue.

TimDettmers · 2021-12-13T19:09:09Z

Thanks for the feedback. In that case, could you share a bit more about your data, model, hyperparameters? The more information the better. Would also be helpful to share the software you are using (fairseq, any particular branch etc.). The best would be some setup that I might be able to replicate.

SirRob1997 · 2021-12-13T19:46:52Z

That might be a bit hard as I'm using an internal fairseq fork and training on internal data and models. I haven't checked if this reproduces the issue, but what you can try is training the multilingual_transformer on a few language pairs with --share-encoder-embeddings and --share-decoder-embeddings. In case you are able to reproduce it, great! Otherwise, I'm afraid this is a me issue and probably not many people will run into this problem.

TimDettmers added bug Something isn't working question Further information is requested labels Nov 17, 2021

TimDettmers added a commit that referenced this issue Nov 29, 2021

Added module override, bnb.nn.Embedding #13 #15 #19

20e1677

TimDettmers closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did you ever try MNMT systems? #15

Did you ever try MNMT systems? #15

SirRob1997 commented Nov 15, 2021 •

edited

Loading

TimDettmers commented Nov 17, 2021

SirRob1997 commented Nov 17, 2021

TimDettmers commented Dec 4, 2021

SirRob1997 commented Dec 4, 2021

SirRob1997 commented Dec 10, 2021

TimDettmers commented Dec 10, 2021

SirRob1997 commented Dec 10, 2021

TimDettmers commented Dec 13, 2021 •

edited

Loading

SirRob1997 commented Dec 13, 2021 •

edited

Loading

Did you ever try MNMT systems? #15

Did you ever try MNMT systems? #15

Comments

SirRob1997 commented Nov 15, 2021 • edited Loading

TimDettmers commented Nov 17, 2021

SirRob1997 commented Nov 17, 2021

TimDettmers commented Dec 4, 2021

SirRob1997 commented Dec 4, 2021

SirRob1997 commented Dec 10, 2021

TimDettmers commented Dec 10, 2021

SirRob1997 commented Dec 10, 2021

TimDettmers commented Dec 13, 2021 • edited Loading

SirRob1997 commented Dec 13, 2021 • edited Loading

SirRob1997 commented Nov 15, 2021 •

edited

Loading

TimDettmers commented Dec 13, 2021 •

edited

Loading

SirRob1997 commented Dec 13, 2021 •

edited

Loading