Acoustic models of New Zealand English

douglasbagnall edited this page Jul 21, 2012 · 4 revisions

We want to make freely usable acoustic models of New Zealand English (NZE) for use in open source speech recognition systems like Pocketsphinx and Julius. This wiki is for collating information about the project, including descriptions of tasks, pointers to corpora, and ultimately links to models.

Adaptation vs. Creation

It seems there are two ways to arrive at a new acoustic model: you can create it from scratch using a large corpus of transcribed speech, or you can alter an existing model using a small corpus. These are described in the Sphinx adaptation and training tutorials.

Advantages of adaptation

  • Adaptation is faster, requiring much less speech and probably less work.
  • It works incrementally, so we should see progress quickly.
  • The North American English model that we would probably start with is very mature, and has been successfully used as a base even for non-English models.
  • because the existing model is already a black box, there is no particular argument for using a redistributable speech corpus, which makes finding a usable corpus much simpler. The resulting model will be unrestricted in its use, modifiable, but not so re-creatable or diagnosable.

Advantages of creation

  • The model can be open source in itself, rather than a black box.
  • Given enough well recorded and transcribed speech, the model should be a better fit.
  • If the speech corpus is freely available, the whole model can be open source in itself, rather than just the tools to create it. This will appeal to some people. Unfortunately, open speech corpora are scarce.

It seems adaptation is the most practical solution.

Available corpora

Existing corpora tend to have restrictive terms on their redistribution (which makes sense on privacy grounds). Voxforge is an exception.

The Wellington Corpus of Spoken New Zealand English

For “bona fide researchers” only, though they think this use qualifies. It contains a lot (500,000 words) of private casual speech, and the restrictions are mainly to protect the privacy of the participants. Names and other identifying features have been changed in many of the transcriptions, which affects the quality of the corpus for speech recognition purposes. The transcriptions are not tightly aligned to the speech in time, so the corpus is difficult to break down into snippets of the recommended size.

The New Zealand Component of the International Corpus of English (ICE-NZ)

Contains overlap with the Wellington Corpus.

“The Corpus must be used for non-profit linguistic research purposes only. The licence cannot be transferred, lent, or re-sold.”

The ONZE Canterbury Corpus

Seemingly smaller than the Wellington Corpus. I could not find terms of use. The transcriptions are said to be relatively well aligned with the recordings in time.


Voxforge attempts to be an open source (GPLv2) speech corpus. It contains about 500 NZE sentences read by around 10 speakers. Almost all speakers are male.

It is not legally possible to distribute a model combining the GPL voxforge corpus and the existing sourceless models, but it is OK to create such a model and not share it. This is useful for testing. The voxforge speech can also be used as test data for other models.