Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Will you release the distillation dataset of wmt-en-de? #3

Closed
SunbowLiu opened this issue Oct 3, 2019 · 9 comments
Closed

Will you release the distillation dataset of wmt-en-de? #3

SunbowLiu opened this issue Oct 3, 2019 · 9 comments

Comments

@SunbowLiu
Copy link

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

@ftakanashi
Copy link

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

@SunbowLiu
Copy link
Author

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8
They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

@ftakanashi
Copy link

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8
They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

Thank you very much. The reason why I couldn't reproduce the result seems to be the problem of the preprocess of the data. I lowercased all my data and lead to too much representations in the corpus. When I directly use your data, it works! Thanks again!

@yinhanliu
Copy link

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Thanks for your interest. Please use the code here

https://github.com/pytorch/fairseq/tree/master/examples/translation

with this command

python train.py your-data-bin --arch transformer --share-all-embeddings --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --max-tokens 8192 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --max-update 300000 --update-freq 2 --fp16 --max-source-positions 10000 --max-target-positions 10000 --save-dir checkpoints

@ftakanashi
Copy link

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did.
I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

@SunbowLiu
Copy link
Author

Hi,
I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.
I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.
Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did.
I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

Hi,

I have successfully trained the wmt-en-de from scratch. I use the distillation data set produced by a powerful Transformer big model (~29.3 BLEU score)(https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models) which can reproduce a final BLEU score >27.2.
Note that mask-predict use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2.

@PanXiebit
Copy link

Hi @SunbowLiu
Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

@SunbowLiu
Copy link
Author

Hi @SunbowLiu
Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

The only way might be training from scratch.

@dmortem
Copy link

dmortem commented Nov 8, 2020

Hi, when I used the checkpoint_best.pt provided in readme and the inference script "python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict", I can only got the bleu of 20.90. What is the problem? Are there any other hyperparameters I need to modify in the inference script?

I see "average the 5 best checkpoints to create the final model" in the paper. So is the checkpoint_best.pt provided in the link the final model? If not, I wonder how to average the best checkpoints? Do we forward 5 models and average the prediction distribution?

Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants