couldn't match SOTA performance on wmt14 EnDe #32

yilinyang7 · 2019-03-06T09:16:11Z

Dear authors,

I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).

With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.

Training script:
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en/fairseq --lgs 'en-de' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 6000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0,1,2,3'

Translate results:

valid_en-de_mt_ppl-> 5.401580
valid_en-de_mt_acc -> 65.806969
valid_en-de_mt_bleu -> 28.990000
test_en-de_mt_ppl -> 5.942769
test_en-de_mt_acc -> 66.605212
test_en-de_mt_bleu -> 25.630000

My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?

Best.

The text was updated successfully, but these errors were encountered:

glample · 2019-03-06T11:19:18Z

Hi,

Yes, unfortunately I also tried, and I have never been able to reproduce fairseq results with XLM on the supervised tasks, there was always a difference of 1 or 2 BLEU. This is a bit annoying, because probably if we could match the supervised results we would also be better in unsupervised / semi-supervised, etc.

I really don't think that the differences in terms of architecture (we have one extra layer norm after the embeddings I believe, and when I compared I didn't use GELU) could explain the difference in BLEU. There are a couple of things that we don't have that fairseq has, such as "smoothed softmax", "average checkpointing", etc. and I think these are more the features we are missing in order to get SOTA results in supervised MT. If you see other features that may explain the differences I can try to implement them and retry on the supervised task.

yilinyang7 · 2019-03-06T11:31:48Z

Thank you for your clarification. I'll keep looking into it, and let you know when I find anything.

sugeeth14 · 2019-05-14T10:06:00Z

Hi @yilinyang7 I wanted to pretrain a language model with MLM objective and use it to train on supervised MT for En-De translation. I am however unable to do so with some error If you did supervised machine translation using the pretrained language model can you please elaborate on the steps to follow to do so. It would be great if you could share the commands with which you tried. Kindly share the progress.
Thanks in advance .

glample closed this as completed Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

couldn't match SOTA performance on wmt14 EnDe #32

couldn't match SOTA performance on wmt14 EnDe #32

yilinyang7 commented Mar 6, 2019 •

edited

Loading

glample commented Mar 6, 2019

yilinyang7 commented Mar 6, 2019

sugeeth14 commented May 14, 2019

couldn't match SOTA performance on wmt14 EnDe #32

couldn't match SOTA performance on wmt14 EnDe #32

Comments

yilinyang7 commented Mar 6, 2019 • edited Loading

glample commented Mar 6, 2019

yilinyang7 commented Mar 6, 2019

sugeeth14 commented May 14, 2019

yilinyang7 commented Mar 6, 2019 •

edited

Loading