Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module
branch: master

typo

latest commit 2c13f37765
Tanel Alumäe authored
Failed to load latest commit information.
lib CMN for unseen devices is now calculated from global mean
scripts Converting Estonian models to UTF-8
views doc updates
LICENSE doc updates
README.rdoc Default conf is now uses English models
conf-english.yaml
conf-estonian.yaml typo
conf.yaml
config.ru Refectored
dummy.fsg Experimental suppert for user grammars
unicorn.conf.rb use 1 decoder process by default

README.rdoc

Introduction

Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module.

Requirements

  • Ruby 1.8

  • Sinatra

  • Rack

  • Unicorn

  • PocketSphinx (NOTE: some features of the server require patched PocketSphinx, see below)

  • Some acoustic and language models for PocketSphinx

Installing

CMU Sphinx

  • Install sphinxbase from SVN (make, make install)

Apply PocketSphinx patch

In cmusphinx/pocketsphinx directory:

wget http://www.phon.ioc.ee/~tanela/ps_gst.patch
patch  -p0 -i ps_gst.patch

Make sure you have GStreamer devevelopment packages installed. In Debian Squeeze:

apt-get install libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev

And configure, make, make install as usual.

Install Ruby gems: Unicorn and Sinatra, UUID tools, JSON, locale

This assumes you have ruby and rubygems installed.

You might want to do this as root:

gem install unicorn
gem install sinatra
gem install uuidtools
gem install json
gem install locale

Install ruby-gstreamer package (might vary depending on your distribution):

apt-get install libgst-ruby1.8

Additional tools

English GF-based recognizer also need:

Run ruby-pocketsphinx-server

Clone the git repository:

git clone git://github.com/alumae/ruby-pocketsphinx-server.git

Before executing, add `/usr/local/lib` to the path where GStreamer plugins are looked for:

export GST_PLUGIN_PATH=/usr/local/lib

Running

unicorn -c unicorn.conf.rb config.ru

If you installed Unicorn as a Ruby gem, you might need to execute:

/var/lib/gems/1.8/bin/unicorn -c unicorn.conf.rb config.ru

Test the default configuration (English WSJ language model with HUB4 acostic models), using a raw audio file in the PocketSphinx test directory (replace `$(POCKETSPHINX_DIR)` with the Pocketsphinx source directory):

curl -T $(POCKETSPHINX_DIR)/test/data/wsj/n800_440c0207.wav -H "Content-Type: audio/x-wav"  "http://localhost:8080/recognize"

Response should be:

{
  "status": 0,
  "hypotheses": [
    {
      "utterance": "the agency isn't likely to take any action until the union's rank and file votes on the contract into three weeks"
    },
    {
      "utterance": "the agency isn't likely to take any action until the union's rank and file puts on the contract into three weeks"
    },
    {
      "utterance": "the agency isn't likely to take any action until the union's rank and file funds from the contract into three weeks"
    },
    {
      "utterance": "the agency isn't likely to take any action until the union's rank and file for from the contract into three weeks"
    },
    {
      "utterance": "the agency isn't likely to take any action until the union's rank and file parts of the contract into three weeks"
    }
  ],
  "id": "8686a37b5674cbdc63deb13f73de81a5"
}

Configuration

Web service

Unicorn configuration is in file unicorn.conf.rb. See unicorn.bogomips.org/examples/unicorn.conf.rb for more info.

Recognizer

See conf.yaml

Using the web service

Some of the more advanced examples below are specific to the Estonian configuration.

Example 1

Record a sentence to a wav file, in mono (hit Ctrl-C when done speaking):

rec -c 1 sentence.wav

Send it to the web service:

curl   -X POST --data-binary @sentence.wav -H "Content-Type: audio/x-wav"  http://localhost:8080/recognize

Output (encoded using json, the example uses Estonian models):

{
  "status": 0,
  "hypotheses": [
    {
      "utterance": [
        "t\u00e4na on v\u00e4ljas \u00fcsna ilus ilm"
      ]
    }
  ],
  "id": "e30f54561135d681599915562d77d240"
}

Example 2

Record a raw file using arecord:

arecord --format=S16_LE  --file-type raw  --channels 1 --rate 16000 > sentence2.raw

Send it to web service:

curl -X POST --data-binary @sentence2.raw -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Example 3

Record a 5 second audio, pipe it to curl, which streams it directly to web service using PUT (and gets almost instant response):

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize

Support for JSGF grammars

Users can use their own grammars to recognize certain sentences. The grammars should be in JSGF format.

Example JSGF (let's call it robot.jsgf)

#JSGF V1.0;

grammar robot;

public <command> = (liigu | mine ) [ ( üks | kaks | kolm | neli | viis ) meetrit ] (edasi | tagasi);

NB! Grammars should be in the same charset that the server is using for dictionary, which currently is latin-1 (sorry for that).

You need to upload the JSGF file to somewhere where the server can fetch it, let's say www.example.com/robot.txt

Now, let the server download and compile it:

curl -vv  http://localhost:8080/fetch-lm?url=http://www.example.com/robot.jsgf

This should result in HTTP/1.1 200 OK.

Now you can use the grammar to recognize a sentence that is accepted by the grammar:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | \
curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  http://localhost:8080/recognize?lm=http://www.example.com/robot.jsgf

Result:

{
  "status": 0,
  "hypotheses": [
    {
      "utterance": "mine viis meetrit tagasi"
    }
  ],
  "id": "9e3895e9ee0b5138e73c6fca30f51a58"
}

If you update the grammar on the server, you need to make the /fetch-jsgf request again, as the server doesn't check for changes every time a recognition request is done (for efficiency reasons).

Support for GF grammars

GF (Grammatical Framework) grammars are supported.

A GF grammar must be compiled into a .pgf file. To upload it to the server, use the fetch-pgf API call, e.g.:

curl "http://bark.phon.ioc.ee/speech-api/v1/fetch-lm?url=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&lang=Est"

The 'lang' attribute (defaults to 'Est') specifies input languages of the grammar. Many comma-separated languages can be specified, e.g lang=Est,Est2

To recognize with a GF, use similar request as with JSGF, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf

You can also specify output language(s) that will be used to linearize the raw recognition result, e.g.:

arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000"  "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&output-lang=App"

Output:

{
 "status": 0,
 "hypotheses": [
   {
     "utterance": "viis minutit sekundites",
     "linearizations": [
       {
         "lang": "App",
         "output": "5 ' IN \""
       },
       {
         "lang": "App",
         "output": "5 min IN s"
       }
     ]
   }
 ],
 "id": "83486feaca30995401ed4a66951a3f23"
}

Multiple output languages can be used, by using comma-separated values: “..&output-lang=App,App2”

Something went wrong with that request. Please try again.