Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 760 Bytes

File metadata and controls

34 lines (23 loc) · 760 Bytes

Prepare Datasets for Training

Support datasets:

  • MDCC
  • AISHELL-1
  • THCHS-30
  • MAGICDATA Mandarin Chinese Read Speech Corpus

Download and Extract MDCC Dataset

sh mdcc.sh

Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

Download and Extract AISHELL-1 Dataset

sh aishell_1.sh

Download and Extract THCHS-30 Dataset

sh thchs_30.sh

Download and Extract MAGICDATA Mandarin Chinese Dataset

sh magicdata_mcrsc.sh