Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

mdragus · 2020-04-25T17:55:50Z

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${data_dir}/data-bin --path ./maskPredict_${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

mdragus · 2020-04-26T15:11:07Z

Actually if you download the data referenced by this file you can reproduce the results from the paper almost perfectly: https://github.com/facebookresearch/Mask-Predict/blob/master/get_data.sh. It is unfortunate that there seems to be some pre tokenization step that is not included in this repository.

mdragus closed this as completed Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

mdragus commented Apr 25, 2020

mdragus commented Apr 26, 2020

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

Comments

mdragus commented Apr 25, 2020

mdragus commented Apr 26, 2020