Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Did you ever try MNMT systems? #15

Closed
SirRob1997 opened this issue Nov 15, 2021 · 9 comments
Closed

Did you ever try MNMT systems? #15

SirRob1997 opened this issue Nov 15, 2021 · 9 comments
Labels
bug Something isn't working question Further information is requested

Comments

@SirRob1997
Copy link
Contributor

SirRob1997 commented Nov 15, 2021

As reported in the paper, for training a bi-directional transformer model on WMT14 or WMT16 the performance of 8-bit Adam stays relatively consistent with the 32-bit counterparts. I was also able to verify this on other data sources for training bi-directional models with my own setup.

However, I've also tried multiple variations of 8-bit optimizers on multilingual neural machine translation (MNMT) models in fairseq and there it seems that even with --no-scale-embedding as well as the StableEmbedding the performance is roughly 3 BLEU behind the counterparts. The --no-scale-embedding flag amounts to roughly 7 BLEU gain, while the xavier init amounts to roughly 0.4 BLEU gain. Didn't look into the effect of the layer norm of the stable embeddings yet.

Did you do any testing on that and have practical tips on getting the performance up?

@TimDettmers TimDettmers added bug Something isn't working question Further information is requested labels Nov 17, 2021
@TimDettmers
Copy link
Contributor

There is a possible bug in the StableEmbedding layer that exists depending on when/how/if the parameters are registered with the GlobalOptimManager. I have to look into this, probably get to this sometime next week.

If you want to test this yourself you could try two different things:

  1. Call GlobalOptimManager.get_instance().register_parameters(model.parameters()) before the parameters are on the GPU. For fairseq that is here
  2. Change the order of these two lines and see if it makes a difference.

@SirRob1997
Copy link
Contributor Author

I don't think I'll get to it anytime soon since this is rather low priority for me right now but in case you check it and manage to resolve it, let me know! It would be really beneficial to have stable performance for these type of models as well =)

@TimDettmers
Copy link
Contributor

I had a look at this, and indeed there was a bug in the StableEmbedding layer. Could you inspect the code on which you ran this multilingual NMT baseline? If you do not use GlobalOptimManager.get_instance().register_parameters(model.parameters()) this would explain the poor performance.

@SirRob1997
Copy link
Contributor Author

I don't have the exact version at hand right now but if it wasn't part of the StableEmbedding layer when this issue was raised, I didn't use it as I simply imported the layer as is. I'll try to kick off an experiment on version 0.26.0 and see if it fixes the issues for me. Will report back!

@SirRob1997
Copy link
Contributor Author

Seems like I run into training instabilities now with the same hyper parameter config that works with default Adam and without StableEmbedding. Initially it seems to work fine but after a few update steps training produces NaNs.

@TimDettmers
Copy link
Contributor

Hi Robin, that is as expected. Can you try again with the StableEmbedding layer? What I was meant is that previously the StableEmbedding layer did not work correctly, and if you use it now you should see better results.

@SirRob1997
Copy link
Contributor Author

Ah maybe that was badly formulated, my bad! I'm using 8-bit Adam + Stable Embeddings and run into this issue.

@TimDettmers
Copy link
Contributor

TimDettmers commented Dec 13, 2021

Thanks for the feedback. In that case, could you share a bit more about your data, model, hyperparameters? The more information the better. Would also be helpful to share the software you are using (fairseq, any particular branch etc.). The best would be some setup that I might be able to replicate.

@SirRob1997
Copy link
Contributor Author

SirRob1997 commented Dec 13, 2021

That might be a bit hard as I'm using an internal fairseq fork and training on internal data and models. I haven't checked if this reproduces the issue, but what you can try is training the multilingual_transformer on a few language pairs with --share-encoder-embeddings and --share-decoder-embeddings. In case you are able to reproduce it, great! Otherwise, I'm afraid this is a me issue and probably not many people will run into this problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants