Skip to content
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus (CC0 Licensed)
Branch: master
Clone or download

Latest commit

Latest commit a299e80 Mar 2, 2020


Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore initial commit Feb 7, 2020 initial commit Feb 7, 2020 initial commit Feb 7, 2020
LICENSE initial commit Feb 7, 2020 update README Mar 2, 2020 initial commit Feb 7, 2020
stats.png initial commit Feb 7, 2020

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

License: CC0-1.0 Open In Colab

End-to-end speech translation (E2E ST) has recently witnessed an increased interest given its system simplicity, lower inference latency and less compounding errors compared to cascaded one (speech recognition + machine translation). E2E ST model training, however, is often hampered by the lack of parallel data. Thus, we created CoVoST, a large & diverse multilingual speech-to-text translation corpus based on Common Voice (2019-06-12 release). It includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations. We also provide an additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish) into English.

Please check out our paper for more details and the VizSeq example for exploring CoVoST data.

CoVoST Statistics

What's New

  • 2020-02-27: Example added for exploring CoVoST data with VizSeq
  • 2020-02-13: Paper accepted to LREC 2020 (Oral)
  • 2020-02-07: CoVoST released

Getting Data


  1. Download the 2019-06-12 release of Common Voice (NOT the latest 2019-12-10 one from the web page) for speeches and transcripts:

  2. Download translations for all the 11 languages, where validated.<lang>_en.en are matched with the transcripts in validated.tsv.

Tatoeba Evaluation Data

  1. Download transcripts and translations and extract files to data/tt/*.

  2. Download speech data:

python --root <mp3 download root (default to data/tt/mp3)>

Exploring Data

VizSeq Example Open In Colab


CoVoST data CC0
Tatoeba sentences CC BY 2.0 FR
Tatoeba speeches Various CC licenses (please check out data/tt/tatoeba_s2t.<lang>_en.<lang>_lic)
Anything else CC BY-NC 4.0


Please cite as

    title={CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus},
    author={Changhan Wang and Juan Pino and Anne Wu and Jiatao Gu},


Changhan Wang (, Juan Miguel Pino (, Jiatao Gu (

You can’t perform that action at this time.