Failed to train the Mask-predict with larger model/hidden dimension #7
Comments
@yinhanliu Hoping for your advices 😉 |
Have you tried using the hyperparameters for transformer big? https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#3-train-a-model. I think I ran into a similar problem at some point and switching to the hyperparameters from Ott et al. 2018 large got it work. |
Also, don't forget to preprocess your data with the code of this branch. |
@jungokasai @Marjan-GH Thanks! I adjusted the original command to:
And seemingly this works for me, I picked one line of the training log. The
|
@yinhanliu @omerlevy @Marjan-GH Hi, Dear authors, I have trained the large scale MaskPredict (Hidden size: 1024/4096, vocab size: 6w+), the #Param is ~270M. Because the amount of parameters increases, the translation effect should be better than the normal scale MaskPredict (512/2048) as expected ! However, the BLEU score of the large scale model on ENDE newstest14 is only ~26. I'm pretty sure the model has converged, Some indicators are shown below: The loss of the latest large scale model that I used for evaluation is as follows:
which looks obviously better than the normal scale MaskPredict, where I reproduced your result on same ENDE dataset and the BLEU can reach ~27, the loss of that normal model is:
So, I am wondering that if the MaskPredict model only fits the base(512/2048) scale and it does not work under the large scale setting ??? Looking forward to your reply Best |
That seems a bit strange. The perplexity and length loss in validation are smaller, so I would suspect the large transformer would be at least as good as the base one in BLEU as well. It does not look like a training issue. Just for a sanity check, could you check the BLEU and loss on the validation data with both base and large? Roughly speaking, there should be correlation between BLEU and the loss, but if not there might be something wrong with the inference. Otherwise it might be overfitting? I wouldn't have expected this to happen though with dropout 0.3. Also, please make sure you are distilling from the same large autoregressive transformer. |
Hi @jungokasai , The BLEU scores on validation set using the best single checkpoint are: large scale model↓
base scale model↓
There do exist positive correlation between validation BLEU and the loss. However, the BLEU scores on test set with the best single checkpoint are: large scale model↓
base scale model↓
This phenomenon is strange. Can it be said that the MaskPredict architecture is not suitable for large scale? B.T.W., all my distilled data is derived from a pretrained powerful big AT model, refer to that issue response |
Hi Liam, |
@omerlevy Because of my relative limited computing resources, the experiment took a long time. Looking forward to your results! This will be helpful for researchers who follow this paper, thank you! |
Elegant work! In addition to training a transformer_base-scale model, I am still trying to train a large model, (e.g., 1024 model dim. & 4096 hidden dim), such that I can fine-tune Mask-predict with XLM.
However, when I simply change the dimension and fix other arguments, the training is failed, that is, the
ppl
is even becoming bigger. Can you give me some advices?Below is my training command:
and the following is the log of one training step:
BTW, because I reused the XLM vocabulary list, the vocab size of larger Mask-predict is more than 60k+.
The text was updated successfully, but these errors were encountered: