## Instructions

Install deepspeech by

```
pip install deepspeech
```

The files used in this demo can be obtained as follow.

1. Following https://deepspeech.readthedocs.io/en/r0.9/
    - Download pre-trained English model and external scorer files
    ```
    curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
    curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
    ```
    - Download example audio files
    ```
    curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
    tar xvf audio-0.9.3.tar.gz
    ```
2. Download client.py which contains useful utility functions (resampling, etc.) at https://deepspeech.readthedocs.io/en/r0.9/_downloads/67bac4343abf2261d69231fdaead59fb/client.py

In [1]:
import numpy as np
import wave
import deepspeech
import client

In [2]:
# Parameters
# Files obtained as instructed in https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
DATADIR = '2830-3980-0043.wav'
MODELDIR = 'deepspeech-0.9.3-models.pbmm'
SCORERDIR = 'deepspeech-0.9.3-models.scorer'
USE_SCORER = True

In [3]:
# Load model
model = deepspeech.Model(MODELDIR)
if USE_SCORER:
    model.enableExternalScorer(SCORERDIR)
modelSamplingRate = model.sampleRate()
    
# Print some model parameters
print('Beam width', model.beamWidth())
print('Sampling rate expected by the model', modelSamplingRate)

Beam width 500
Sampling rate expected by the model 16000


In [4]:
# Load audio
# Note that deepspeech accepts audio of data type int16
fin = wave.open(DATADIR, 'rb')
fs_orig = fin.getframerate()
if fs_orig != modelSamplingRate:
    print('Warning: original sample rate ({}) is different than {}hz. \
          Resampling might produce erratic speech recognition.' \
          .format(fs_orig, modelSamplingRate))
    fs_new, audio = client.convert_samplerate(DATADIR, modelSamplingRate)
else:
    audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

audio_length = fin.getnframes() * (1/fs_orig)
fin.close()

In [5]:
# Prediction
# sttWithMetadata returns a Metadata object, containing a list of CandidateTranscript objects
predictions = model.sttWithMetadata(audio, num_results=5)
for candidate in predictions.transcripts:
    print(client.metadata_to_string(candidate))

experience proves this
experience proves his
experience proves this 
experience proves is
experience proves that


In [6]:
predictions

Metadata(transcripts=[
  CandidateTranscript(confidence=-16.056909561157227, tokens=[
    TokenMetadata(text='e', timestep=34, start_time=0.6800000071525574),
    TokenMetadata(text='x', timestep=36, start_time=0.7199999690055847),
    TokenMetadata(text='p', timestep=40, start_time=0.7999999523162842),
    TokenMetadata(text='e', timestep=41, start_time=0.8199999928474426),
    TokenMetadata(text='r', timestep=43, start_time=0.85999995470047),
    TokenMetadata(text='i', timestep=45, start_time=0.8999999761581421),
    TokenMetadata(text='e', timestep=47, start_time=0.9399999976158142),
    TokenMetadata(text='n', timestep=49, start_time=0.9799999594688416),
    TokenMetadata(text='c', timestep=50, start_time=1.0),
    TokenMetadata(text='e', timestep=52, start_time=1.0399999618530273),
    TokenMetadata(text=' ', timestep=56, start_time=1.1200000047683716),
    TokenMetadata(text='p', timestep=61, start_time=1.2200000286102295),
    TokenMetadata(text='r', timestep=62, start_time=1.2