Will you release the distillation dataset of wmt-en-de? #3

SunbowLiu · 2019-10-03T08:22:37Z

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

ftakanashi · 2019-10-04T09:58:29Z

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

SunbowLiu · 2019-10-04T10:15:39Z

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8
They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

ftakanashi · 2019-10-04T11:23:02Z

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8
They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

Thank you very much. The reason why I couldn't reproduce the result seems to be the problem of the preprocess of the data. I lowercased all my data and lead to too much representations in the corpus. When I directly use your data, it works! Thanks again!

yinhanliu · 2019-10-10T18:52:42Z

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Thanks for your interest. Please use the code here

https://github.com/pytorch/fairseq/tree/master/examples/translation

with this command

python train.py your-data-bin --arch transformer --share-all-embeddings --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --max-tokens 8192 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --max-update 300000 --update-freq 2 --fp16 --max-source-positions 10000 --max-target-positions 10000 --save-dir checkpoints

ftakanashi · 2019-11-05T01:30:39Z

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did.
I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

SunbowLiu · 2019-11-06T03:55:10Z

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did.
I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

Hi,

I have successfully trained the wmt-en-de from scratch. I use the distillation data set produced by a powerful Transformer big model (~29.3 BLEU score)(https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models) which can reproduce a final BLEU score >27.2.
Note that mask-predict use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2.

PanXiebit · 2020-05-09T02:41:45Z

Hi @SunbowLiu
Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

SunbowLiu · 2020-05-09T02:55:57Z

Hi @SunbowLiu
Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

The only way might be training from scratch.

dmortem · 2020-11-08T06:18:49Z

Hi, when I used the checkpoint_best.pt provided in readme and the inference script "python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict", I can only got the bleu of 20.90. What is the problem? Are there any other hyperparameters I need to modify in the inference script?

I see "average the 5 best checkpoints to create the final model" in the paper. So is the checkpoint_best.pt provided in the link the final model? If not, I wonder how to average the best checkpoints? Do we forward 5 models and average the prediction distribution?

Thank you!

SunbowLiu closed this as completed Oct 11, 2019

yinhanliu mentioned this issue Nov 25, 2019

can not reproduce the result on wmt14 en-de #6

Open

alphadl mentioned this issue Mar 2, 2020

Failed to train the Mask-predict with larger model/hidden dimension #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will you release the distillation dataset of wmt-en-de? #3

Will you release the distillation dataset of wmt-en-de? #3

SunbowLiu commented Oct 3, 2019

ftakanashi commented Oct 4, 2019

SunbowLiu commented Oct 4, 2019

ftakanashi commented Oct 4, 2019

yinhanliu commented Oct 10, 2019

ftakanashi commented Nov 5, 2019

SunbowLiu commented Nov 6, 2019

PanXiebit commented May 9, 2020

SunbowLiu commented May 9, 2020

dmortem commented Nov 8, 2020

Will you release the distillation dataset of wmt-en-de? #3

Will you release the distillation dataset of wmt-en-de? #3

Comments

SunbowLiu commented Oct 3, 2019

ftakanashi commented Oct 4, 2019

SunbowLiu commented Oct 4, 2019

ftakanashi commented Oct 4, 2019

yinhanliu commented Oct 10, 2019

ftakanashi commented Nov 5, 2019

SunbowLiu commented Nov 6, 2019

PanXiebit commented May 9, 2020

SunbowLiu commented May 9, 2020

dmortem commented Nov 8, 2020