Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Not able to reproduce the BLEU score with the saved model for en-de or de-en #10

Closed
mdragus opened this issue Apr 25, 2020 · 1 comment

Comments

@mdragus
Copy link

mdragus commented Apr 25, 2020

I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:

python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt

python generate_cmlm.py ${data_dir}/data-bin --path ./maskPredict_${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict

at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.

Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?

@mdragus
Copy link
Author

mdragus commented Apr 26, 2020

Actually if you download the data referenced by this file you can reproduce the results from the paper almost perfectly: https://github.com/facebookresearch/Mask-Predict/blob/master/get_data.sh. It is unfortunate that there seems to be some pre tokenization step that is not included in this repository.

@mdragus mdragus closed this as completed Apr 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant