Speech recognition basics

The intent of creating this page is to provide such minimal knowledge of speech recognition that would be enough to work with CMU Sphinx projects like pocketsphinx and sphinx4. Subjects covered here were gathered mainly from the forum discussions.

Abstract concept

The process of speech recognition can be divided into these general steps:

Audio acquisition
Audio processing.
Base units search.
Language model search.
Post-processing.
Related topics.

This wiki covers the first four, because the last step is usually carried out by users rather than by the speech recognition system. But before we deep into, it's worth to outline basic components of a typical speech recognition engine and list some special terms.

ASR - automatic speech recognition.
(Recognition) hypothesis - a text string obtained as the result of recognition.
Dictionary - a list of words with words.
Basic units- pieces of speech that treated by ASR system as atomic.
Acoustic model - one or multiple files containing some statistical information about basic units of speech.
Language model - a file containing information about construction of.

Acquisition

Two ways of the audio input are distinguished in speech recognition. They are online and offline. Here online means that we don't know the size of data we are going to process, so the speech recognizer accepts it by small portions and on each step offers the best recognition result. With offline recognition we already have the whole amount of data, so the recognition can be done a bit more precisely. In both cases additional passes can be made to refine the hypothesis.

TODO: CMVN computation

Processing

Audio data is represented by a number of real numbers denoting the audio volume in a given moment of time. Numbers can have different precision (sample) format , and usually it is single precision 16-byte little endian is used. Also it's important how many measurements are done in a second. This is called sampling frequency or sampling rate. 16Khz is a standard for everything except telephone networks where frequencies above 4000Hz are filtered out, so the rate of 8Khz is used instead. Speech recognizers may accept different file types, but for simplicity it's usually bounded to a single supported WAVE format, so in most cases it's the users responsibility to convert to this format from MP3.

Base units

Language model

Language model is a higher level of recognition. Consider this simple example to get an idea of why this is required. Try to break the string ""... Recognizer meets the same difficulty. Although it seems that people make pauses between words when they speak, in reality they do not - end of one word becomes the beginning of another word. But people do not speak words randomly.

Fixed grammars

N-gram models

Post-processing

After a recognition hypothesis was obtained, it can be post-processed to extract additional information or enhance with a new one. Those include but are not limited to:

Punctuation.
Number parsing: "one two three" → "123"
Token parsing: ""
Sentiment analysis.
Emotion recognition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly