Skip to content

A benchmark corpus for ASR hypothesis revising task

Notifications You must be signed in to change notification settings

Alfred0622/HypR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 

Repository files navigation

HypR: A comprehensive study for ASR hypothesis revising with a reference corpus

Yi-Wei Wang, Ke-Han Lu, and Kuan-Yu Chen

| Paper | Download Datasets |

Abstract

With the development of deep learning, automatic speech recognition (ASR) has made significant progress. To further enhance the performance, revising recognition results is one of the lightweight but efficient manners. Various methods can be roughly classified into N-best reranking methods and error correction models. The former aims to select the hypothesis with the lowest error rate from a set of candidates generated by ASR for a given input speech. The latter focuses on detecting recognition errors in a given hypothesis and correcting these errors to obtain an enhanced result. However, we observe that these studies are hardly comparable to each other as they are usually evaluated on different corpora, paired with different ASR models, and even use different datasets to train the models. Accordingly, we first concentrate on releasing an ASR hypothesis revising (HypR) dataset in this study. HypR contains several commonly used corpora (AISHELL-1, TED-LIUM 2, and LibriSpeech) and provides 50 recognition hypotheses for each speech utterance. The checkpoint models of the ASR are also published. In addition, we implement and compare several classic and representative methods, showing the recent research progress in revising speech recognition results. We hope the publicly available HypR dataset can become a reference benchmark for subsequent research and promote the school of research to an advanced level.

HypR Dataset

Dataset format

{
  utt_id(str): the utterance id.
  ref(str): the reference text.
  hyps(list[str]): 50-best hypothesis.
  att_score(list[float]): the attention decoding score from the ASR system for each hypothesis.
  ctc_score(list[float]): the ctc decoding score from the ASR system for each hypothesis.
  lm_score(list[float]): the language model decoding score from language model for each hypothesis. This field only included when LM is used.
  score(list[float]): the overall score for each hypothesis.
}

Example

{
 'utt_id': '1272-128104-0000',
 'ref': 'mister quilter is the apostle of the middle classes and we are glad to welcome his gospel'
 'hyps': ['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel',
  'mister quiltter is the apostle of the middle classes and we are glad to welcome his gospel',
  'mister quiltar is the apostle of the middle classes and we are glad to welcome his gospel', ...],
 'att_score': [-1.76402, -10.33629, -4.36253, -5.41844, -7.47313, ...],
 'ctc_score': [-5.13712, -0.45096, -7.37892, -7.42734, -8.10472, ...],
 'score': [-3.45057, -5.39363, -5.87073, -6.42289, -7.78892, ...]
}

Following the default settings of the ESPNet toolkit, the att_score, ctc_score, and lm_score in HypR are calculated by summing the log-probability score of every token in the hypothesis with the natural logarithm base. The final score is computed by: $\text{score} = [(1 - \lambda_{CTC}) \times \text{att\_score} + \lambda_{CTC}\times\text{ctc\_score}] + \lambda_{LM} \times \text{lm\_score}$

The hyperparameters of $\lambda_{CTC}$ and $\lambda_{LM}$ for each dataset are listed below:

Dataset AISHELL-1 TED-LIUM 2 LibriSpeech
$\lambda_{CTC}$ 0.5 0.3 0.4
$\lambda_{LM}$ 0.7 0.5 0.7

Download

We have made HypR accessible via the Huggingface Datasets platform. To explore advanced usage, please refer to the official tutorial.

pip install datasets

The datasets on Huggingface are named in the following format: ASR-HypR/{LibriSpeech, TEDLIUM2, AISHELL1}_{withLM, withoutLM}. You can find more information about these datasets at https://huggingface.co/ASR-HypR.

from datasets import load_dataset

# load dataset
dataset = load_dataset("ASR-HypR/LibriSpeech_withoutLM")

# load datasplit
dataset = load_dataset("ASR-HypR/LibriSpeech_withoutLM", split="dev_clean")

Licence

HypR is built upon data from the Librispeech, TED-LIUM, and AISHELL corpora, and we respect the licenses associated with these sources. HypR is freely available for academic purposes, but please consult the original licenses from these datasets for commercial use.

Citation

If you find our work useful, please cite our paper:

@misc{wang2023hypr,
      title={HypR: A comprehensive study for ASR hypothesis revising with a reference corpus}, 
      author={Yi-Wei Wang and Ke-Han Lu and Kuan-Yu Chen},
      year={2023},
      eprint={2309.09838},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Please provide proper attribution to the referenced paper when using the corresponding dataset.

@INPROCEEDINGS{AISHELL1,
  author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao},
  booktitle={Proceedings of O-COCOSDA}, 
  title={AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline}, 
  year={2017},
  volume={},
  number={},
  pages={1-5},
  doi={10.1109/ICSDA.2017.8384449}}

@article{AISHELL2,
  author       = {Jiayu Du and
                  Xingyu Na and
                  Xuechen Liu and
                  Hui Bu},
  title        = {{AISHELL-2:} Transforming Mandarin {ASR} Research Into Industrial
                  Scale},
  journal      = {CoRR},
  volume       = {abs/1808.10583},
  year         = {2018},
  url          = {http://arxiv.org/abs/1808.10583},
  eprinttype    = {arXiv},
  eprint       = {1808.10583},
  timestamp    = {Mon, 03 Sep 2018 13:36:40 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1808-10583.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{TEDLIUM2,
    title = "Enhancing the {TED}-{LIUM} Corpus with Selected Data for Language Modeling and More {TED} Talks",
    author = "Rousseau, Anthony  and
      Del{\'e}glise, Paul  and
      Est{\`e}ve, Yannick",
    booktitle = "Proceedings of LREC'14",
    year = "2014",
    pages = "3935--3939",
    url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/1104_Paper.pdf",
    abstract = "In this paper, we present improvements made to the TED-LIUM corpus we released in 2012. These enhancements fall into two categories. First, we describe how we filtered publicly available monolingual data and used it to estimate well-suited language models (LMs), using open-source tools. Then, we describe the process of selection we applied to new acoustic data from TED talks, providing additions to our previously released corpus. Finally, we report some experiments we made around these improvements.",
}
@INPROCEEDINGS{LibriSpeech,
  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle= {Proceedings of ICASSP}, 
  title={Librispeech: An ASR corpus based on public domain audio books}, 
  year={2015},
  volume={},
  number={},
  pages={5206-5210},
  doi={10.1109/ICASSP.2015.7178964}}

About

A benchmark corpus for ASR hypothesis revising task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published