Skip to content

biomedical-translation-corpora/corpora

Repository files navigation

Welcome to the repositories of the WMT Biomedical Translation Task

Here we host various datasets that we have compiled for the Biomedical Translation Task at WMT.

  • Medline dataset of titles and abstracts of scientific publications (FR/EN, PT/EN, ES/EN, DE/EN, ZH/EN, RO/EN, IT/EN, RU/EN)
  • Scielo of scientific publications (FR/EN, PT/EN, ES/PT)
  • EDP dataset of scientific publications (FR/EN)
  • ReBEC clinical trials (PT/EN)

List of corpora

Medline corpus

datasets 2016 2017 2018 2019 2020 2021 2022
training WMT'16 WMT'19 WMT'20 WMT'221
test set WMT'18 WMT'19 WMT'20 WMT'21 WMT'22

1 The parallel abstracts can be retrieved from Medline using our script: wmtbio22_train_data.py. It uses biopython and you'll need a valid email to access the data in Medline.

training 2016 2019 2020 2022
en/es x x x
en/fr x x x
en/pt x x x
en/de x x
en/it x x
en/ru x x
test set 2018 2019 2020 2021
en/es x x x x
en/fr x x x x
en/pt x x x x
en/de x x x x
en/zh x x x x
en/ro x
en/it x x
en/ru x x

Scielo corpus

test set 2016 2017
en/es, en/fr, en/pt test WMT'16 test WMT'17
training parallel monolingual
en/es, en/fr, en/pt training monolingual

EDP corpus

test set 2017
en/fr test WMT'17

ReBEC corpus

training
en/pt dataset

Publications

Please cite our publications if you use our corpora.

(WMT'22 Biomedical Task) Neves M, Jimeno Yepes A, SiuA, Roller R, Thomas P, Vicente Navarro M, Yeganova L, Wiemann D, Di Nunzio GM, Vezzani F, Gerardin C, Bawden R, Johan Estrada D, Lima-Lopez S, Farre-Maduel E, Krallinger M, Grozea C, Neveol A. Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports. PDF BibText

(WMT'21 Biomedical Task) Yeganova L, Wiemann D, Neves M, Vezzani F, Siu A, Jauregi Unanue I, Oronoz M, Mah N, Névéol A, Martinez D, Bawden R, Di Nunzio GM, Roller R, Thomas P, Grozea C, Perez-de-Viñaspre O, Vicente Navarro M and Jimeno Yepes A. Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set, 6th Conference on Machine Translation, EMNLP 2021. PDF and BibText

(WMT'20 Biomedical Task) Bawden R, Di Nunzio GM, Grozea C, Jauregi Unanue I, Jimeno Yepes A, Mah N, Martinez D, Neveol A, Neves M, Oronoz M, Perez de Viñaspre O, Piccardi M, Roller R, Siu A, Thomas P, Vezzani F, Vicente Navarro M, Wiemann D, Yeganova L. Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages, 5th Conference on Machine Translation, EMNLP 2020, online. PDF and BibText

(Survey of Authors’ Abstract Writing Practice) Neveol A, Jimeno Yepes A, Neves M. MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish-and Portuguese-Speaking Authors’ Abstract Writing Practice, 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France. PDF and BibText

(WMT'19 Biomedical Task) Bawden R, Bretonnel Cohen K, Grozea C, Jimeno Yepes A, Kittner M, Krallinger M, Mah N, Neveol A, Neves M, Soares F, Siu A, Verspoor A, Vicente Navarro M. Findings of the WMT 2019 Biomedical Translation Shared Task: Evaluation for MEDLINE Abstracts and Biomedical Terminologies , 4th Conference on Machine Translation, ACL 2019, Florence, Italy. PDF and BibText

(WMT'18 Biomedical Task) Neves M, Jimeno Yepes A, Névéol A, Grozea C, Siu A, Kittner M, Verspoor K. Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets, Proceedings of the Third Conference on Machine Trasnlation (WMT) at EMNLP, 2018, Brussels, Belgium. PDF and BibText

(Parallel Biomedical Corpora) Névéol A, Jimeno Yepes A, Neves M, Verspoor K. Parallel Corpora for the Biomedical Domain, International Conference on Language Resources and Evaluation (LREC), 2018, Myazaki, Japan. PDF and BibText

(WMT'17 Biomedical Task) Jimeno Yepes A, Névéol A, Neves M, Verspoor K, Bojar O, Boyer A, Grozea C, Haddow H, Kittner M, Lichtblau Y, Pecina P, Roller R, Rosa R, Siu A, Thomas P, Trescher S. Findings of the WMT 2017 Biomedical Translation Shared Task, Proceedings of the Second Conference on Machine Translation (WMT17) at the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. PDF and BibText

(WMT'16 Biomedical Task) Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Névéol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K and Zampieri M. Findings of the 2016 Conference on Machine Translation, ACL 2016, Proceedings of the First Conference on Machine Translation (WMT16), pp. 131-198, 2016, Berlin, Germany. PDF and BibText

(Scielo corpus) Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia. PDF and Bibtex

Support or Contact

Please contact us by mail. Please also join our discussion forum.

  • Antonio Jimeno Yepes (RMIT University, Australia)
  • Aurélie Névéol (LIMSI, CNRS, France)
  • Mariana Neves (German Federal Institute for Risk Assessment, Germany)

About

Parallel corpora for the biomedical domain

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages