Skip to content

flageval-baai/Chinese-LiPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 

Repository files navigation

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Hugging Face Datasets License: CC BY-NC-SA-4.0 GitHub Pages arXiv

⭐ Introduction

The Chinese-LiPS dataset is a multimodal dataset designed for audio-visual speech recognition (AVSR) in Mandarin Chinese. This dataset combines speech, video, and textual transcriptions to enhance automatic speech recognition (ASR) performance, especially in educational and instructional scenarios.

πŸš€ Dataset Details

  • Total Duration: 100.84 hours
  • Number of Speakers: 207 professional speakers
  • Number of Clips: 36,208 video clips
  • Audio Format: Stereo WAV, 48 kHz sampling rate
  • Video Format:
    • Slide Video: 1080p resolution, 30 fps
    • Lip-Reading Video: 720p resolution, 30 fps
  • Annotations: JSON format with transcriptions and extracted text from slides

Dataset Statistics

Split Duration (hrs) # Segments # Speakers
Train 85.37 30,341 175
Validation 5.35 1,959 11
Test 10.12 3,908 21
Total 100.84 36,208 207

πŸ“‚ Dataset Organization

The dataset is structured into several compressed files:

  • image.zip: First-frame images from slide videos (used for OCR and vision-language models).

  • pptdataset.zip: Processed data with 16 kHz audio, 96Γ—96 25-frame lip-reading videos, and JSON annotations.

  • train.zip, test.zip, val.zip: Data split into training, testing, and validation sets. Each contains:

    β”œβ”€β”€ ID1_age_gender_topic/
    β”‚   β”œβ”€β”€ WAV/
    β”‚   β”‚   β”œβ”€β”€ ID1_age_gender_topic_001.json  # Annotation file
    β”‚   β”‚   β”œβ”€β”€ ID1_age_gender_topic_001.wav   # Audio file (48 kHz)
    β”‚   β”œβ”€β”€ PPT/
    β”‚   β”‚   β”œβ”€β”€ ID1_age_gender_topic_001_PPT.mp4  # Slide video (1080p 30fps)
    β”‚   β”œβ”€β”€ FACE/
    β”‚   β”‚   β”œβ”€β”€ ID1_age_gender_topic_001_FACE.mp4  # Lip-reading video (720p 30fps)
    β”œβ”€β”€ ...
    
  • meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv: Metadata files with ID, TOPIC, WAV, PPT, FACE, and TEXT fields.

    The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others.

  • meta_test.json: Includes OCR and InternVL2 prompts for the test set.

    wav_path: Path to the audio file.
    ppt_path: Path to the first-frame image of the slide video.
    ocr_text: Text extracted by PaddleOCR.
    vl2_text: Text extracted by InternVL2.
    gt_text: Ground truth transcription of the audio.
    ocr_vl2_text: OCR text reprocessed by InternVL2 (not a concatenation of PaddleOCR and InternVL2 results).
    

πŸ“₯ Download

You can download the dataset from the following sources:

πŸ“š Citation

@misc{zhao2025chineselipschineseaudiovisualspeech,
  title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, 
  author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
  year={2025},
  eprint={2504.15066},
  archivePrefix={arXiv},
  primaryClass={cs.MM},
  url={https://arxiv.org/abs/2504.15066}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published