You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:
at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.
Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?
The text was updated successfully, but these errors were encountered:
Actually if you download the data referenced by this file you can reproduce the results from the paper almost perfectly: https://github.com/facebookresearch/Mask-Predict/blob/master/get_data.sh. It is unfortunate that there seems to be some pre tokenization step that is not included in this repository.
I tried reproducing the results with the provided saved model on the newstest corpora found here: https://nlp.stanford.edu/projects/nmt/. I ran the following commands:
python preprocess.py --source-lang ${src} --target-lang ${tgt} --testpref ${raw_text}/newstest2014 --destdir ${data_dir}/data-bin --workers 60 --srcdict ${model_path}/maskPredict_${src}${tgt}/dict.${src}.txt --tgtdict ${model_path}/maskPredict${src}_${tgt}/dict.${tgt}.txt
python generate_cmlm.py ${data_dir}/data-bin --path ./maskPredict_${src}_${tgt}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 2 --decoding-iterations 10 --dehyphenate --decoding-strategy mask_predict
at first I got a BLEU score of only 9, then I noticed that the dictionary in data-bin was different than the dictionary in the saved model so I manually removed the "finalize" call when saving the dictionary from preproess.py (it seemed like an optimization only anyway). This made the dictionaries identical, but the BLEU score was still only 16. Looking manually at the results, they seem reasonable in the sense that there doesn't seem to be a discrepancy between vocabulary of trained model and vocab used in evaluation. The one issue that is salient is the fact that there are around 10% UNK tokens in both source and target. 10% UNKs seems rather high.
Does anyone know what the possible issue is? Is it possible that the uploaded model is not the one that got the BLEU scores reported in the paper?
The text was updated successfully, but these errors were encountered: