Skip to content

Exgc/AVMuST-TED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

AVMuST-TED

The first Audio-Visual Multi-lingual Speech Translation dataset.

Paper: MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Introduction

AVMuST-TED contains over 706 hours of videos and the translation text of 4 different languages.

Target Language Hours Sents Vocab Tokens Labeling Type
Spanish (Es) 198h 258K 95K 2.0M Machine Translation + Manual Review
French (Fr) 185h 244K 91K 1.9M Machine Translation + Manual Review
Italian (It) 165h 218K 95K 1.6M Machine Translation + Manual Review
Portuguese (Pt) 158h 205K 84K 1.5M Machine Translation + Manual Review

Acquisition of AVMuST-TED

  • Transcription & Translation

    The transcription and translations in the AVMuST-TED dataset are taken directly from the high reliability translated subtitles in TED. All translations and transcriptions are performed strictly in accordance with the TED Translation Guidelines. The transcription and translations go through the following steps before publication:

    ./resource/TED.png

    1. Transcription

      TED provides an original transcript for all TED and TED-Ed content. For TEDx talks, volunteers are able to utilize auto-generated transcriptions as a base, or create their own from scratch.

    2. Translation

      Subtitles are translated from the original language into the target language, using a dynamic subtitle editor. Note that TEDx transcripts must be reviewed and published before any translations can be created.

    3. Review

      Before publication, subtitles are reviewed by an experienced volunteer. Once reviewed, subtitles publish directly to the TED.com video player and YouTube for the world to enjoy.

  • Audio and Visual Speech Preparation

    We follow the previous audio-video speech dataset acquisitions pipeline to extract audio speech and visual speech (i.e., talking head). The visual speech corpus is provided as face-centered video in .avi files with a resolution of 224*224 and a frame rate of 25fps. The audio speech corpus is provided as the single-track, 16-bit 16kHz .wav files.

  • AV Synchronization and Active Speaker Detection

    To detect active speakers, SyncNet is further used to determine which face's lip movements match the audio, and if there is no match, the clip is rejected as being a voice-over.

Tasks on AVMuST-TED

  • Visual Speech Translation (Lip Translation)
Method M En-Es En-Fr En-It En-Pt
Cascaded V 12.7 11.3 11.5 13.2
AV-Hubert V 14.2 12.6 12.9 14.8
Cascaded A(+Noise) 16.0 12.9 12.6 15.5
AV-Hubert A(+Noise) 17.6 14.5 14.1 17.1
MixSpeech(ours) V 18.5 15.1 14.3 17.2
  • Audio Speech Translation
Src-Tgt Method M -20db -10db 0db 10db 20db Avg. Clean
En-Es Cascaded A 1.4±0.1 5.8±0.2 21.1±0.3 25.5±0.3 26.3±0.2 16.0 26.6
AV-Hubert A 1.5±0.2 6.7±0.2 22.3±0.4 27.7±0.2 28.6±0.3 17.6 28.9
En-Fr Cascaded A 1.3±0.2 4.5±0.3 16.6±0.4 20.9±0.3 21.3±0.1 12.9 21.7
AV-Hubert A 1.4±0.2 5.5±0.3 18.5±0.4 23.2±0.2 23.6±0.1 14.5 23.9
En-It Cascaded A 0.9±0.3 4.0±0.3 16.1±0.2 20.7±0.1 21.2±0.2 12.6 21.5
AV-Hubert A 1.0±0.2 5.1±0.5 18.3±0.3 22.7±0.2 23.6±0.2 14.1 23.8
En-Pt Cascaded A 1.1±0.3 5.4±0.5 20.1±0.4 24.9±0.2 26.0±0.1 15.5 26.2
AV-Hubert A 1.2±0.2 6.3±0.4 22.2±0.3 27.4±0.3 28.4±0.1 17.1 28.6
  • Audio-Visual Speech Translation
Src-Tgt Method M -20db -10db 0db 10db 20db Avg. Clean
En-Es Cascaded AV 6.7±0.2 15.3±0.4 24.6±0.4 26.3±0.2 26.7±0.2 19.9 26.9
AV-Hubert AV 6.9±0.3 16.4±0.5 26.6±0.3 28.7±0.1 28.9±0.2 21.5 29.1
En-Fr Cascaded AV 4.6±0.1 11.4±0.5 19.4±0.3 21.5±0.2 22.0±0.2 15.8 22.3
AV-Hubert AV 4.9±0.2 12.1±0.3 21.6±0.4 23.7±0.3 24.3±0.1 17.3 24.6
En-It Cascaded AV 4.8±0.3 11.8±0.4 19.5±0.3 21.4±0.2 22.1±0.1 15.9 22.3
AV-Hubert AV 5.0±0.4 12.4±0.6 21.9±0.3 23.7±0.1 24.1±0.2 17.4 24.5
En-Pt Cascaded AV 5.8±0.4 13.8±0.6 23.5±0.4 25.8±0.2 26.3±0.1 19.0 26.4
AV-Hubert AV 6.1±0.3 15.5±0.4 26.0±0.3 28.2±0.3 28.6±0.2 20.9 28.8
  • Other Cross-Lingual Audio-Visual Speech Tasks

AVMuST-TED can also be used for quantitative verification of cross-lingual multimodal generation tasks, such as Cross-lingual Talking Head Generation.

Download

BaiduDisk 链接: https://pan.baidu.com/s/1Yr6tm56ygEJpCCKe37cAQg?pwd=b9up

License

CC-BY-NC 4.0

Citation

@article{cheng2023mixspeech,
  title={MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition},
  author={Cheng, Xize and Li, Linjun and Jin, Tao and Huang, Rongjie and Lin, Wang and Wang, Zehan and Liu, Huangdai and Wang, Ye and Yin, Aoxiong and Zhao, Zhou},
  journal={arXiv preprint arXiv:2303.05309},
  year={2023}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published