Japanese--Russian--English News Commentary Parallel Data
This repository contains manually curated parallel sentences for Japanese--Russian, Japanese--English, and Russian--English language pairs in news domain.
The Japanese--Russian is one of the most distant language pairs and has only limited quantity of parallel data to train machine translation (MT) systems. To promote the research on low-resource MT, we have curated parallel sentences, which can be used as development and test data, through the following procedure:
- Downloaded from OPUS News Commentary data for Japanese--Russian with 586 sentence pairs and Japanese--English with 637 sentence pairs.
- The above Japanese--Russian and Japanese--English data share many lines in the Japanese side. Therefore, we first compiled a Russian--Japanese--English tri-text data.
- From each line, we identified corresponding parts across languages, and split off unaligned parts into a new line.
- As a result, we obtained 1,654 lines of data comprising trilingual, bilingual, and monolingual segments (mainly sentences).
- For the sake of comparability, we randomly chose 600 trilingual sentences to create a test set, and concatenated the rest of them and bilingual sentences to form development sets.
Distribution of tri-texts
Development and test splits (available in this repository)
|System description||Resources Used||Ja-to-Ru||Ru-to-Ja|
|Uni-directional Transformer NMT||(a)||0.70||1.96|
|Multi-to-multi Transformer NMT involving English||(a)||3.72||8.35|
|Same but with multi-lingual multi-stage fine-tuning||(a) (b) (c) (d)||7.49||12.10|
Data used for above systems are as follows:
(a) Global Voices parallel data retrieved from OPUS (v2015; included in this repository)
(b) ASPEC: Asian Scientific Paper Excerpt Corpus (out-of-domain Japanese--English parallel data)
(c) UN provided for WMT 18 (out-of-domain Russian--English parallel data)
(d) Yandex provided for WMT 18 (out-of-domain Russian--English parallel data)
- Aizhan Imankulova, Raj Dabre, Atsushi Fujita, and Kenji Imamura. Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 17th Machine Translation Summit (MT Summit), Aug., 2019. (to appear)
- National Institute of Information and Communications Technology (henceforth, NICT) has made the database publicly available under the conditions of license specified below.
- NICT bears no responsibility for the contents of the database and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the database.
- If any copyright infringement or other problems are found in the database, please contact us at atsushi.fujita[at]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
The dataset has been developed as a part of work at Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology.