This repository holds the scripts, models and descriptions for the output of research of neural machine translation at TartuNLP. A live demo of the latest models is available at translate.ut.ee. Below you will find a brief description of the latest approach, links to trained MT models and source code for running the models as an MT server and an API service.
Our first MT project was called "KaMa: kasutatav eesti masintõlge" (Usable Estonian Machine Translation). Kama is also a national Estonian food item :-)
Our current approach is multilingual multi-domain neural machine translation. That means that a single NMT model translates between several languages and is also aware of the text domain of the texts it works with.
More specifically, we use the Transformer with the output language and text domain as additional input information. The approach is described in this paper.
Besides multilingual translation, the approach exhibits interesting additional functionality, such as handling code-switching and monolingual zero-shot translation, that can be used for error correction and style adaptation. Some examples from our current 7-language model:
- Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> You can write a lot of mixed languages, and it translates everything into one language.
- Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> Te võite kirjutada palju sega keeli, ja see tõlgib kõik ühte keelde.
- Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> Вы можете написать много смешанных языков, и это переводит все в одно язык.
- Ich legen Buch an Regal neben Tisch. -> Ich lege das Buch an Regal neben dem Tisch.
- Ma arvab et homme miski põnev näeb. -> Ma arvan, et homme näeb midagi põnevat.
- Наш программа переводит текст с ошибок в правильную. -> Наша программа переводит текст с ошибками в правильный.
English correction does not work all too well, with some rare examples:
- I be large reader, I has big library. -> I am a big reader, I have a big library.
Cross-lingual:
- That is freaky -> See on kohutav. (formal) / See on vastik. (informal)
- That is freaky -> Это ужасно. (formal) / Это отвратительно. (informal)
Monolingual:
- Kes oled? -> Kes te olete? (formal)
- Wer bist du? -> Wer sind Sie? (formal)
- I will be remunerated. -> I'll be rewarded. (informal)
All our models are currently trained with Sockeye using open parallel corpora, pre-processed with our truecaser and Google's SentencePiece.
Models with their language and domain combinations:
- English-German-French
- Domains/corpora: Europarl-OpenSubtitles-JRCAcquis
- Sockeye version 1.18.56
- English-Estonian-Latvian
- Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA
- Sockeye version 1.18.56
- English-Estonian-Latvian-Russian
- Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA-UNcorpus-DGTTM-ParaCrawl-NewsCommentary
- Sockeye version 1.18.56
- English-Estonian-Latvian-Lithuanian-Russian-German-Finnish
- Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA-DGTTM-ParaCrawl-NewsCommentary
- Sockeye version 1.18.106
NMT provider implementation: Nazgul
NMT API server implementation: Sauron
Integration with translation frameworks:
The work has been part of several projects and collaborations. National projects:
- KaMa: kasutatav eesti masintõlge (Usable Estonian Machine Translation), 2015--2017, funded by NPELT
- Neurotõlge: Adaptive, Multilingual and Reliable Machine Translation for Estonian, 2018--2020, funded by NPELT
Related projects:
- Bergamot, Horizon 2020 Research and Innovation Action, grant agreement No 825303