parfda WMT'15 SMT Datasets
We make the English, Czech, Finnish, French, German, and Russian datasets available used when building parfda Moses SMT systems for research purposes:
Reference translations for the test set are available from http://www.statmt.org/wmt15/translation-task.html. Results are presented in the following citation from WMT'15 (http://www.statmt.org/wmt15/).
Citation:
Ergun Biçici, Qun Liu, and Andy Way. ParFDA for Fast Deployment of Accurate Statistical Machine Translation Systems, Benchmarks, and Statistics. In Proceedings of the EMNLP 2015 Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September 2015.
The datasets and the SMT results can serve as a benchmark for SMT research where further linguistic processing can be performed. The datasets allow fast deployment of accurate SMT systems and can be used for benchmarking the performance of SMT systems.
Language model corpora used contain 15M sentences some of which are selected from LDC Gigaword corpora by ParFDA: [5 use the LDC English Gigaword 5th edition]
- Czech - English
- Finnish - English
- French - English
- German - English
- Russian - English
[1 use the LDC French Gigaword 3rd edition]
- English - French
LICENSE: Dublin City University License for Open Data allowing use for research and academic purposes.