Skip to content

Data for the paper "How Should Markup Tags Be Translated?" (WMT 2020)

License

Notifications You must be signed in to change notification settings

amazon-science/mt-markup-tags

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How Should Markup Tags Be Translated?

This repository provides the dev and test sets used by the paper "How Should Markup Tags Be Translated?" by Greg Hanneman and Georgiana Dinu, published at WMT 2020. The sets are intended to help develop, test, and compare machine translation techniques that treat segment-internal markup tags in addition to plain-text content.

Format

The sets are released as parallel text files, in UTF-8 encoding, with one segment per line and one language per file. Languages covered are English (en), German (de), French (fr), and Hungarian (hu). Inline markup tags are represented in the XLIFF 1.2 standard.

As in the paper, we distinguish three classes of dev and/or test sets:

  • The "EUR-Lex" dev and test sets: data/{dev,test}-eurlex.{en,de,fr,hu}
  • The "EUR-Lex mono" test set: data/test-eurlex-mono.en
  • The "Glossary" dev and test sets: data/{dev,test}-glossary.{en,de,fr,hu}

Construction of the Sets

These dev and test sets are derived from EUR-Lex, the European Union's online repository of legal documents, which are provided synchronously in several structured formats and in the union's 24 official languages. Each document is available as monolingual downloads. In this work, we use only English, German, French, and Hungarian. The different sets below differ in how the raw EUR-Lex documents have been processed.

EUR-Lex Dev and Test

The "EUR-Lex" sets are derived from a cohesive block of documents in Microsoft Word format, from CELEX numbers 52019DC0601 through 52019DC0680. Certain documents with the same CELEX number consist of multiple files; these are kept separate and distinguished with suffixes "_1", "_2", etc.

For each set of four monolingual documents, we first extracted each one from Word to XLIFF format using the open-source Okapi Tikal document filter. Aside from performing automatic paragraph and sentence segmentation according to pre-defined rules, the Okapi filter also converts the inline Microsoft markup to XLIFF 1.2 tags. We then checked the extracted XLIFF documents for parallelism at several levels:

  • Any document set that does not contain the same number of paragraphs across all four languages is entirely rejected.
  • Any paragraph that does not have the same number of sentences across languages is skipped.
  • Any sentence that does not have the same set of XLIFF tags across all languages is likewise skipped.

Each successfully extracted document is then assigned to either the dev or the test set. The dev set is made up of documents 601_1, 601_2, 604, 617_2, 625, 630, 633, 640_2, 650, and 652. Documents 610, 615, 616, 617_1, 637_1, 637_2, 638, 639, 640_1, and 641 are in the test set. Segments in each set are deduplicated to unique sentence 4-tuples. This leaves 1888 lines in the EUR-Lex dev set and 1450 in the EUR-Lex test set.

EUR-Lex Mono Test

The "EUR-Lex mono" test set is made up of the same documents assigned to the EUR-Lex test set, above. In this case, only the English documents are extracted to XLIFF, and all of the parallelism restrictions are dropped. After deduplication of English sentences, this referenceless test set contains 2525 lines.

Glossary Dev and Test

The "Glossary" sets are derived from another cohesive block of documents in Microsoft Word format, from CELEX numbers 52019DC0520 to 52019DC0599.

Sets of monolingual documents are extract and filtered as for the EUR-Lex sets, with the additional step of removing all the inline tags to obtain plain text. Given four-way parallel segments, we then searched within each line for a synchronous occurrence of entries from in-house human-curated translation glossaries for EN--DE, EN--FR, and EN--HU. If found, the terms were surrounded by a pair of identical XLIFF <g>...</g> tags in each language.

The successfully tagged segments from each document are assigned to either the dev or the test set. The dev set contains documents 526, 530, 531, 532, 533, 540, 542, 548, 560, 566, 594, and 598. In the test set are documents 520, 521, 522, 523, 524, 525, 527, 528, 529, 534, 541, and 597. The final sizes are 286 lines for dev and 289 for test.

Citation

If you use these tagged data sets in your work, please cite the publication

@inproceedings{HannemanDinu-MTMarkupTags,
    title = "How Should Markup Tags Be Translated?",
    author = "Hanneman, Greg and Dinu, Georgiana",
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
}

Note that this work is currently forthcoming in November 2020.

License

This data is re-released under the Creative Commons Attribution 4.0 International (CC-BY 4.0) licence. We acklowledge the European Commission and EUR-Lex as the original source of the data.

Contact

For general question on either these data sets or the accompanying publication, contact Greg Hanneman at ghannema@amazon.com. See also the CONTRIBUTING file for details on how to contribute to this repository.

About

Data for the paper "How Should Markup Tags Be Translated?" (WMT 2020)

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published