KMSAV: Korean Multi-speaker Spontaneous Audio-Visaul Speech Recognition Dataset

Introduction

We are excited to announce the release of our Korean Audio-Visual Speech Recognition (AVSR) Dataset. This dataset has been collected from a wide range of YouTube videos containing dialoges of multiple participants and is intended for researchers and developers working on audio-visual speech recognition and related fields.

This dataset aims to facilitate advancements in the field of AVSR by providing a diverse and challenging collection of samples. By making this dataset publicly available, we hope to foster collaboration, stimulate new ideas, and contribute to the community's understanding of multimodal speech recognition.

Dataset Description

The Multimodal AVSR Dataset includes:

83 hours of audio-visual content spoken in Korean
Videos sourced from YouTube, covering various topics and domains
Various numbers of participants in each video
Varied background noise levels and acoustic environments
High-quality manually verified transcriptions for each video

Data Format

Each data entry in the dataset consists of:

URLs to video files
Metadata for AVSR containing transcription and face region of active speaker for each utterance
Set of scripts to extract videos of audio-visual utterances from metadata

Usage Guidelines

To ensure the responsible use of this dataset, please follow these guidelines:

Attribution : Please credit the creators of the dataset by linking back to this repository in your research publications or project documentation.
Privacy : Do not use the dataset to identify or infer sensitive information about individuals featured in the videos.
Non-commercial use : This dataset is made available for academic and research purposes only. Commercial use of the dataset is prohibited.

Download and Data Preparation

To access the dataset, clone this repository and follow the Data Preparation guidelines for dataset preprocessing.

License

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The source code is licensed under the MIT License. See LICENSE for details.

Citation

If you use this dataset in your research, please cite it as follows:

@misc{kmsav,
    title={KMSAV: Korean Multi-speaker Spontaneous Audio-Visual Speech Recognition Dataset},
    author={Kiyoung Park, Changhan Oh and Sunghee Dong},
    year={2024},
    journal={ETRI Journal},
}

Acknowledgements

This work is supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01376, Development of the multi-speaker conversational speech recognition technology)

Contact

For any questions or concerns related to this dataset, please reach out to us by opening an issue on this repository or by contacting us at pkyoung@etri.re.kr .

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
avhubert		avhubert
data		data
trans_guide		trans_guide
utils		utils
whisper-asr		whisper-asr
.gitignore		.gitignore
HOWTO.md		HOWTO.md
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMSAV: Korean Multi-speaker Spontaneous Audio-Visaul Speech Recognition Dataset

Introduction

Dataset Description

Data Format

Usage Guidelines

Download and Data Preparation

License

Citation

Acknowledgements

Contact

About

Releases 1

Packages

Languages

License

etri/kmsav

Folders and files

Latest commit

History

Repository files navigation

KMSAV: Korean Multi-speaker Spontaneous Audio-Visaul Speech Recognition Dataset

Introduction

Dataset Description

Data Format

Usage Guidelines

Download and Data Preparation

License

Citation

Acknowledgements

Contact

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages