Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

⭐ Introduction

The Chinese-LiPS dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios.

🚀 Dataset Details

Total Duration: 100.84 hours
Number of Speakers: 207 professional speakers
Number of Clips: 36,208 video clips
Audio Format: Stereo WAV, 48 kHz sampling rate
Video Format:
- Slide Video: 1080p resolution, 30 fps
- Lip-Reading Video: 720p resolution, 30 fps
Annotations: JSON format with transcriptions and extracted text from slides

Dataset Statistics

Split	Duration (hrs)	# Segments	# Speakers
Train	85.37	30,341	175
Validation	5.35	1,959	11
Test	10.12	3,908	21
Total	100.84	36,208	207

📂 Dataset Organization

The dataset is structured into several compressed files:

image.zip: First-frame images from slide videos (used for OCR and vision-language models).
pptdataset.zip: Processed data with 16 kHz audio, 96×96 25-frame lip-reading videos, and JSON annotations.

train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains:

├── ID1_age_gender_topic/
│   ├── WAV/
│   │   ├── ID1_age_gender_topic_001.json  # Annotation file
│   │   ├── ID1_age_gender_topic_001.wav   # Audio file (48 kHz)
│   ├── PPT/
│   │   ├── ID1_age_gender_topic_001_PPT.mp4  # Slide video (1080p 30fps)
│   ├── FACE/
│   │   ├── ID1_age_gender_topic_001_FACE.mp4  # Lip-reading video (720p 30fps)
├── ...

meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields.

The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others.

meta_test.json: Includes OCR and InternVL2 prompts for the test set.

wav_path: Path to the audio file.
ppt_path: Path to the first-frame image of the slide video.
ocr_text: Text extracted by PaddleOCR.
vl2_text: Text extracted by InternVL2.
gt_text: Ground truth transcription of the audio.
ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results).

📥 Download

You can download the dataset from the following sources:

Download from OneDrive
Download from Huggingface
Download from Baidu Netdisk (Password: vg2a)

📚 Citation

@misc{zhao2025chineselipschineseaudiovisualspeech,
  title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, 
  author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
  year={2025},
  eprint={2504.15066},
  archivePrefix={arXiv},
  primaryClass={cs.MM},
  url={https://arxiv.org/abs/2504.15066}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

⭐ Introduction

🚀 Dataset Details

Dataset Statistics

📂 Dataset Organization

📥 Download

📚 Citation

About

Uh oh!

Releases

Packages

flageval-baai/Chinese-LiPS

Folders and files

Latest commit

History

Repository files navigation

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

⭐ Introduction

🚀 Dataset Details

Dataset Statistics

📂 Dataset Organization

📥 Download

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages