You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).
With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.
My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?
Best.
The text was updated successfully, but these errors were encountered:
Yes, unfortunately I also tried, and I have never been able to reproduce fairseq results with XLM on the supervised tasks, there was always a difference of 1 or 2 BLEU. This is a bit annoying, because probably if we could match the supervised results we would also be better in unsupervised / semi-supervised, etc.
I really don't think that the differences in terms of architecture (we have one extra layer norm after the embeddings I believe, and when I compared I didn't use GELU) could explain the difference in BLEU. There are a couple of things that we don't have that fairseq has, such as "smoothed softmax", "average checkpointing", etc. and I think these are more the features we are missing in order to get SOTA results in supervised MT. If you see other features that may explain the differences I can try to implement them and retry on the supervised task.
Hi @yilinyang7 I wanted to pretrain a language model with MLM objective and use it to train on supervised MT for En-De translation. I am however unable to do so with some error If you did supervised machine translation using the pretrained language model can you please elaborate on the steps to follow to do so. It would be great if you could share the commands with which you tried. Kindly share the progress.
Thanks in advance .
Dear authors,
I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).
With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.
Training script:
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en/fairseq --lgs 'en-de' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 6000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0,1,2,3'
Translate results:
My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?
Best.
The text was updated successfully, but these errors were encountered: