Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
The Chinese-LiPS dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios.
- Total Duration: 100.84 hours
- Number of Speakers: 207 professional speakers
- Number of Clips: 36,208 video clips
- Audio Format: Stereo WAV, 48 kHz sampling rate
- Video Format:
- Slide Video: 1080p resolution, 30 fps
- Lip-Reading Video: 720p resolution, 30 fps
- Annotations: JSON format with transcriptions and extracted text from slides
Split | Duration (hrs) | # Segments | # Speakers |
---|---|---|---|
Train | 85.37 | 30,341 | 175 |
Validation | 5.35 | 1,959 | 11 |
Test | 10.12 | 3,908 | 21 |
Total | 100.84 | 36,208 | 207 |
The dataset is structured into several compressed files:
-
image.zip: First-frame images from slide videos (used for OCR and vision-language models).
-
pptdataset.zip: Processed data with 16 kHz audio, 96Γ96 25-frame lip-reading videos, and JSON annotations.
-
train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains:
βββ ID1_age_gender_topic/ β βββ WAV/ β β βββ ID1_age_gender_topic_001.json # Annotation file β β βββ ID1_age_gender_topic_001.wav # Audio file (48 kHz) β βββ PPT/ β β βββ ID1_age_gender_topic_001_PPT.mp4 # Slide video (1080p 30fps) β βββ FACE/ β β βββ ID1_age_gender_topic_001_FACE.mp4 # Lip-reading video (720p 30fps) βββ ...
-
meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields.
The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others.
-
meta_test.json: Includes OCR and InternVL2 prompts for the test set.
wav_path: Path to the audio file. ppt_path: Path to the first-frame image of the slide video. ocr_text: Text extracted by PaddleOCR. vl2_text: Text extracted by InternVL2. gt_text: Ground truth transcription of the audio. ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results).
You can download the dataset from the following sources:
- Download from OneDrive
- Download from Huggingface
- Download from Baidu Netdisk (Password: vg2a)
@misc{zhao2025chineselipschineseaudiovisualspeech,
title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides},
author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
year={2025},
eprint={2504.15066},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2504.15066}
}