Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.
/ kama Public archive

Models, scripts and reports for multilingual multi-domain neural machine translation, developed at TartuNLP / University of Tartu

License

Notifications You must be signed in to change notification settings

TartuNLP/kama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 

Repository files navigation

Kama

This repository holds the scripts, models and descriptions for the output of research of neural machine translation at TartuNLP. A live demo of the latest models is available at translate.ut.ee. Below you will find a brief description of the latest approach, links to trained MT models and source code for running the models as an MT server and an API service.

Why "Kama"?

Our first MT project was called "KaMa: kasutatav eesti masintõlge" (Usable Estonian Machine Translation). Kama is also a national Estonian food item :-)

Description

Our current approach is multilingual multi-domain neural machine translation. That means that a single NMT model translates between several languages and is also aware of the text domain of the texts it works with.

More specifically, we use the Transformer with the output language and text domain as additional input information. The approach is described in this paper.

Besides multilingual translation, the approach exhibits interesting additional functionality, such as handling code-switching and monolingual zero-shot translation, that can be used for error correction and style adaptation. Some examples from our current 7-language model:

Code switching examples

  • Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> You can write a lot of mixed languages, and it translates everything into one language.
  • Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> Te võite kirjutada palju sega keeli, ja see tõlgib kõik ühte keelde.
  • Sie können kirjutada daudz gemischt языки, and see переведёт kõik into vienu keelde. -> Вы можете написать много смешанных языков, и это переводит все в одно язык.

Error correction examples

  • Ich legen Buch an Regal neben Tisch. -> Ich lege das Buch an Regal neben dem Tisch.
  • Ma arvab et homme miski põnev näeb. -> Ma arvan, et homme näeb midagi põnevat.
  • Наш программа переводит текст с ошибок в правильную. -> Наша программа переводит текст с ошибками в правильный.

English correction does not work all too well, with some rare examples:

  • I be large reader, I has big library. -> I am a big reader, I have a big library.

Style adaptation examples

Cross-lingual:

  • That is freaky -> See on kohutav. (formal) / See on vastik. (informal)
  • That is freaky -> Это ужасно. (formal) / Это отвратительно. (informal)

Monolingual:

  • Kes oled? -> Kes te olete? (formal)
  • Wer bist du? -> Wer sind Sie? (formal)
  • I will be remunerated. -> I'll be rewarded. (informal)

Models

All our models are currently trained with Sockeye using open parallel corpora, pre-processed with our truecaser and Google's SentencePiece.

Models with their language and domain combinations:

  • English-German-French
    • Domains/corpora: Europarl-OpenSubtitles-JRCAcquis
    • Sockeye version 1.18.56
  • English-Estonian-Latvian
    • Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA
    • Sockeye version 1.18.56
  • English-Estonian-Latvian-Russian
    • Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA-UNcorpus-DGTTM-ParaCrawl-NewsCommentary
    • Sockeye version 1.18.56
  • English-Estonian-Latvian-Lithuanian-Russian-German-Finnish
    • Domains/corpora: Europarl-OpenSubtitles-JRCAcquis-EMEA-DGTTM-ParaCrawl-NewsCommentary
    • Sockeye version 1.18.106

Models on owncloud.ut.ee

Software

NMT provider implementation: Nazgul

NMT API server implementation: Sauron

Integration with translation frameworks:

Projects

The work has been part of several projects and collaborations. National projects:

  • KaMa: kasutatav eesti masintõlge (Usable Estonian Machine Translation), 2015--2017, funded by NPELT
  • Neurotõlge: Adaptive, Multilingual and Reliable Machine Translation for Estonian, 2018--2020, funded by NPELT

Related projects:

  • Bergamot, Horizon 2020 Research and Innovation Action, grant agreement No 825303

About

Models, scripts and reports for multilingual multi-domain neural machine translation, developed at TartuNLP / University of Tartu

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published