Home

chardmeier edited this page Oct 24, 2013 · 10 revisions
Clone this wiki locally

Docent - Document-Level Local Search Decoder for Phrase-Based SMT

Docent is a decoder for phrase-based Statistical Machine Translation (SMT). Unlike most existing SMT decoders, it treats complete documents, rather than single sentences, as translation units and permits the inclusion of features with cross-sentence dependencies to facilitate the development of discourse-level models for SMT. Docent implements the local search decoding approach described by Hardmeier et al. (EMNLP 2012).

Docent is aimed at researchers who want to develop discourse-wide models for SMT without being hampered by the locality constraints of dynamic-programming beam search, the search algorithm used by most other SMT decoders. This is an area of active research. Docent provides you with a framework to do your own research and development; the models it comes with are unlikely to improve your SMT system when used out of the box. If you're looking for a mature SMT system to use in a production environment, Docent is not what you want. If you're excited by unsolved research problems and you want to find out how you can improve SMT to translate texts as texts rather than bags of sentences, try it out!

If you use Docent in your published work, please cite our ACL 2013 system demonstration paper:

@inproceedings{Hardmeier:2013,
    Author = {Hardmeier, Christian and Stymne, Sara and Tiedemann, J\"{o}rg and Nivre, Joakim},
    Booktitle = {Proceedings of the 51st Annual Meeting of the
            Association for Computational Linguistics: System Demonstrations},
    Month = {August},
    Pages = {193--198},
    Publisher = {Association for Computational Linguistics},
    Title = {Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation},
    Address = {Sofia, Bulgaria},
    Year = {2013}}

To refer to the document-level search algorithm used in Docent, you should cite our EMNLP 2012 paper instead:

@inproceedings{Hardmeier:2012a,
    Author = {Hardmeier, Christian and Nivre, Joakim and Tiedemann, J\"{o}rg},
    Booktitle = {Proceedings of the 2012 Joint Conference on Empirical
            Methods in Natural Language Processing and Computational Natural Language
            Learning},
    Month = {July},
    Pages = {1179--1190},
    Publisher = {Association for Computational Linguistics},
    Title = {Document-Wide Decoding for Phrase-Based Statistical Machine Translation},
    Address = {Jeju Island, Korea},
    Year = {2012}}

Getting Started

Installing Docent can be a bit tricky because of its library dependencies. Try following the instructions in the README file and use your imagination to solve the problems you may encounter. If you can't manage to solve them, or if you find errors in the code, contact us at docent (at) stp.lingfil.uu.se.

Docent doesn't currently include training code, so we recommend that you use the Moses toolkit to train your models. Docent reads binary phrase tables created with processPhraseTable and language models in the KenLM probing hash binary format. Use the example configuration files provided in the tests/config directory and the instructions in the README file as a starting point for your own systems. Some documentation about the configuration file format can be found on the Docent Configuration wiki page.

Input files can be provided in NIST-XML format. See the end of NIST's 2009 MT evaluation plan for a description of this format. It is very similar to the file format used by the WMT shared tasks, but unfortunately the files distributed by WMT don't conform exactly. In particular, you will have to add an XML header and surrounding <mteval> tags to use the .sgm files from WMT with Docent.

If your models require annotated input (e.g. with coreference links), the MMAX format is also available. The models included in the distribution don't require MMAX input.

Acknowledgements

Developing Docent would have been impossible if we hadn't been able to draw upon the work in Moses and KenLM.

The name Docent (for Document-Centered Translator) was invented by Sara Stymne.