# How to talk to your computer
## An introduction into Automatic Speech Recognition (ASR) in Python

Sarah Braden

Twitter: @ifmoonwascookie

28 October 2015

DesertPy Meetup

# Fun stuff people do with ASR
* voice control of your computer
* intelligen cars/homes
* speech transcription
* closed captioning
* speech translation
* voice search
* language learning
* language testing


# What is CMUSphinx?
[CMU Sphinx](http://cmusphinx.sourceforge.net/)

* Devloped by Carnegie Mellon University
* Speaker-independent continuous speech recognition engine
* BSD-like license which allows commercial distribution

* Support for several languages:
    * US English
    * UK English
    * European French
    * Mandarin
    * German
    * Dutch
    * Russian
* Ability to build a models for other languages

# What is CMUSphinx?

* Designed specifically for low-resource platforms
* Multiple tools
    * Pocketsphinx - recognizer library written in C
    * Sphinxtrain - acoustic model training tools
    * Sphinxbase - support library required by Pocketsphinx and Sphinxtrain
    * Sphinx4 - adjustable, modifiable recognizer written in Java

# Let's be real

* No, this is not going to be as awesome as Google or Siri right out of the box
* The word error rate is going to vary depending on your language and audio quality
* It takes a while to figure out the details, but it is better than starting from scratch
* You can train it to "fit" your voice and get more accurate results with sphinxtrain (http://cmusphinx.sourceforge.net/wiki/tutorialam)

## Install version 5prealpha from source

Install both [sphinxbase-5prealapha](http://sourceforge.net/projects/cmusphinx/files/sphinxbase/5prealpha/) and [pocketsphinx-5prealpha](http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/)

* Compiling from the code on source forge is the best option. 
* Despite the name, 5prealpha is the recommended "stable" release. 
* Make sure to install both pocketsphinx-5prealpha and sphinxbase-5prealpha. 
* Follow the instructions in the README included in the download. 
* It takes a while to install and compile.

The command should work if you have CMUSphinx installed correctly. This command listens to the mic input and attempts to print to screen what it hears.

```bash
$ pocketsphinx_continuous -inmic yes
```

Another command which takes an input .wav file:

```bash
$ pocketsphinx_continuous -infile test_mono_16k.wav
```

Get the full list of options by typing:

```bash
$ pocketsphinx_continuous
```

## Useful Tips

* pocketsphinx requires mono recordings. Don't forget to make sure your audio matches the correct sample rate that the acoustic model expects.

* an easy way to create a test audio file is to use Audacity (http://audacityteam.org/) and set the sample rate.

In [None]:
from os import path
import pocketsphinx

base_dir = '/usr/local/share/pocketsphinx/model/en-us'

# acoustic model
hidden_markov_model = path.join(base_dir, 'hub4wsj_sc_8k')

# language model
language_model = path.join(base_dir, 'cmusphinx-5.0-en-us.lm')

# dictionary
dictionary = path.join(base_dir, 'cmudict-en-us.dict')

# Create a decoder with certain model
config = pocketsphinx.Decoder.default_config()
config.set_string('-hmm', hidden_markov_model)
config.set_string('-dict', dictionary)
config.set_string('-lm', language_model)
config.set_float('-samprate', 8000.0)

decoder = pocketsphinx.Decoder(config)

In [None]:
filename = 'the_customer_is_eating_an_apple.wav'

with open(filename, 'rb') as wav:
    decoder.decode_raw(wav)

result = decoder.hyp().hypstr

print result

## 8k audio

8k audio is what telephones and voip use.

The result printed to screen is: "it directed eating an apple"

## 16k audio 

16k audio is the default sample rate for CMU Sphinx.

hidden_markov_model = path.join(base_dir, 'en-us')

The result printed to screen is: "the customer is eating an apple"

### What does the language model do?

* Restricts the word search by defining which word could follow previously recognized words
* Contains statistics of word sequences called N-grams
* starts with a header, introduced by the keyword \data\, listing the number of N-grams of each length
<code>
ngram 1=19794
ngram 2=1377200
ngram 3=3178194
</code>
Each N-gram line starts with the logarithm (base 10) of conditional probability p of that N-gram, followed by the words making up the N-gram.

The file looks like a lot of this:

-1.4612 zero one two 

-2.0386 zero one three 

-1.6586 zurich switzerland hello

-0.4470 zoom in on 

-2.5581 zoom in please

## What does the dictionary do?

* The dictionary I used above has 133425 pronunciations of words!
* Contains a mapping from words to phones 
* What if you have an accent?

The file looks like a lot of this:

abbreviate AH B R IY V IY EY T

abbreviated AH B R IY V IY EY T AH D

abbreviated(2) AH B R IY V IY EY T IH D

zoological Z UW L AA JH IH K AH L

zoologist Z OW AA L AH JH AH S T

zoologists Z OW AA L AH JH AH S T S

zoologists(2) Z OW AA L AH JH AH S

# Grammars and Keywords

* Java Speech Grammar Format (JSGF)
* Keyword Spotting

In [None]:
#JSGF V1.0;

grammar example;

public <s> = <simple>;

<simple> = up
     | down
     | left
     | right
     ;

## Using a grammar file

Instead of setting the language model, you set the grammar file in the configuration of your decoder.

config.set_string('-jsgf', grammar_file.jsgf)

## Setting a keyphrase or multiple keyphrases
Instead of setting the language model, you set the keyphrase or keyphrase file in the configuration of your decoder.

config.set_string('-keyphrase', 'bananas')

or

config.set_string('-kws', keyphrases.txt)

keyphrases.txt is a file with keyphrases to spot, one per line.