Skip to content
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus (CC0 Licensed)
Python
Branch: master
Clone or download

Latest commit

Latest commit a299e80 Mar 2, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore initial commit Feb 7, 2020
CODE_OF_CONDUCT.md initial commit Feb 7, 2020
CONTRIBUTING.md initial commit Feb 7, 2020
LICENSE initial commit Feb 7, 2020
README.md update README Mar 2, 2020
get_tt_speech.py initial commit Feb 7, 2020
stats.png initial commit Feb 7, 2020

README.md

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

License: CC0-1.0 Open In Colab

End-to-end speech translation (E2E ST) has recently witnessed an increased interest given its system simplicity, lower inference latency and less compounding errors compared to cascaded one (speech recognition + machine translation). E2E ST model training, however, is often hampered by the lack of parallel data. Thus, we created CoVoST, a large & diverse multilingual speech-to-text translation corpus based on Common Voice (2019-06-12 release). It includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations. We also provide an additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish) into English.

Please check out our paper for more details and the VizSeq example for exploring CoVoST data.

CoVoST Statistics

What's New

  • 2020-02-27: Example added for exploring CoVoST data with VizSeq
  • 2020-02-13: Paper accepted to LREC 2020 (Oral)
  • 2020-02-07: CoVoST released

Getting Data

CoVoST

  1. Download the 2019-06-12 release of Common Voice (NOT the latest 2019-12-10 one from the web page) for speeches and transcripts:

  2. Download translations for all the 11 languages, where validated.<lang>_en.en are matched with the transcripts in validated.tsv.

Tatoeba Evaluation Data

  1. Download transcripts and translations and extract files to data/tt/*.

  2. Download speech data:

python get_tt_speech.py --root <mp3 download root (default to data/tt/mp3)>

Exploring Data

VizSeq Example Open In Colab

License

License
CoVoST data CC0
Tatoeba sentences CC BY 2.0 FR
Tatoeba speeches Various CC licenses (please check out data/tt/tatoeba_s2t.<lang>_en.<lang>_lic)
Anything else CC BY-NC 4.0

Citation

Please cite as

@misc{wang2020covost,
    title={CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus},
    author={Changhan Wang and Juan Pino and Anne Wu and Jiatao Gu},
    year={2020},
    eprint={2002.01320},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Contact

Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)

You can’t perform that action at this time.