# DEEPSPEECH
___
This notebook demonstrates how to install and use the [deepspeech](https://deepspeech.readthedocs.io/en/v0.7.4/) library developed by Mozilla. It installs the library itself, a pretrained model, a scorer, and simple audio files to try the model. Additional more complex audio files can downloaded from [here](https://www.voiptroubleshooter.com/open_speech/american.html).

Several different ways to use the library are shown:
* Simple - using a plain CLI command
* CLI + using variables for the model, scorer, and audio file
* Same as above + saving output in a separate file
* Full version of code from Mozilla's client.py file (gives you fullest control of the process)
___
The [SpaCy](https://spacy.io/usage) library and its largest state-of-the-art pre-trained deep-learning model for English are used in the end to detect sentence boundaries. Its performance is greatly dependent on the quality of the deepspeech output

## 1. Install dependecies
To be installed once. To be reinstalled after the environment is reset (uncomment as needed)

In [1]:
# INSTALL DEEPSPEECH MODELS & SOX FOR AUDIO CONVERSION
#!pip install deepspeech
#!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/deepspeech-0.7.3-models.pbmm
#!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/deepspeech-0.7.3-models.scorer
#!conda install -c conda-forge sox --yes

In [None]:
# SAMPLE TEST AUDIO FILES (COME WITH DEEPSPEECH)
#!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/audio-0.7.3.tar.gz
#!tar -xvf audio-0.7.3.tar.gz

In [4]:
# MORE COMPLEX AUDIO FILES WITH SAMPLE RATE 16kHz - REQUIRE CONVERSION WITH SoX
#!curl -LO https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
#!curl -LO https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0011_8k.wav
#!mv OSR_us_000_0010_8k.wav ./audio
#!mv OSR_us_000_0011_8k.wav ./audio

In [3]:
# INSTALL SPACY FOR SENTENCE BOUNDARY DETECTION
#!pip install -U spacy
#!python -m spacy download en_core_web_lg

Requirement already up-to-date: spacy in /opt/conda/lib/python3.6/site-packages
Requirement already up-to-date: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: numpy>=1.15.0 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.6/site-packages (from spacy)
Requirement already up-to-date: blis<0.5.0,>=0.4.0 in /opt/conda

## 2. Inference

In [4]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [5]:
def detect_sents(text):
    '''
    This function detects sentence boundaries using the largest state-of-the-art deep-learning model from SpaCy.
    Performance greatly depends on the quality of the speech-to-text output
    '''
        
    if not text: return None
    sents = []
    doc = nlp(text.lower())
    
    if doc:
        sents = [sent.text.strip().capitalize() + '.' for sent in list(doc.sents)]
                
    return ' '.join(sents)

### Simple way
deepspeech --model _model file_ --scorer *scorer file* --audio *path_to_audio_file*

In [6]:
!deepspeech --model deepspeech-0.7.3-models.pbmm --scorer deepspeech-0.7.3-models.scorer --audio audio/OSR_us_000_0011_8k.wav        #audio/2830-3980-0043.wav

Loading model from file deepspeech-0.7.3-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.4-0-gfcd9563
2020-06-27 04:41:18.252687: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0283s.
Loading scorer from files deepspeech-0.7.3-models.scorer
Loaded scorer in 0.0131s.
Running inference.
the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds
Inference took 33.818s for 32.785s audio file.


### Compose from parts, run in command line, capture output

In [7]:
# SET THESE VARIABLES ONCE FOR ENTIRE FILE
path_to_model = 'deepspeech-0.7.3-models.pbmm'
path_to_scorer = 'deepspeech-0.7.3-models.scorer'
path_to_audio = './audio/OSR_us_000_0011_8k.wav'

In [8]:
%%capture output
s = 'deepspeech --model ' + path_to_model + ' --scorer ' + path_to_scorer + ' --audio ' + path_to_audio

!{s} 

In [9]:
#  PRINT ALL
output = str(output)
print(output)

Loading model from file deepspeech-0.7.3-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.4-0-gfcd9563
2020-06-27 04:41:53.715901: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.019s.
Loading scorer from files deepspeech-0.7.3-models.scorer
Loaded scorer in 0.000447s.
Running inference.
the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds
Inference took 32.052s for 32.785s audio file.



In [10]:
# SELECT AND PRINT ONLY THE STTed TEXT REMOVING StdErr
idx1 = output.find('Running inference.')
idx2 = output.find('Inference took')
output_stt = output[ idx1+18: idx2 ].strip()
print(output_stt)

the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds


In [11]:
# DETECT SENTENCES
print(detect_sents(output_stt))

The boy was there when the sun rose. He roasteth pain salmon. The source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet. The potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds.


### Same, but save output to file too
A detailed description of various options of saving to file is [provided in two answers here](https://askubuntu.com/questions/420981/how-do-i-save-terminal-output-to-a-file)

In [12]:
%%capture output2
s = 'deepspeech --model ' + path_to_model + ' --scorer ' + path_to_scorer + ' --audio ' + path_to_audio
s += '| tee -a output.txt'                                                                  # THIS SAVES ONLY STTed TEXT TO FILE

!{s}

In [13]:
output2 = str(output2)
print(output2)

Loading model from file deepspeech-0.7.3-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.4-0-gfcd9563
2020-06-27 04:42:27.497597: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0233s.
Loading scorer from files deepspeech-0.7.3-models.scorer
Loaded scorer in 0.000631s.
Running inference.
Inference took 31.607s for 32.785s audio file.
the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds



In [14]:
# SELECT AND PRINT ONLY THE STTed TEXT REMOVING StdErr
idx = output2.find('s audio file.')
output_stt2 = output2[idx + 14 :].strip()
print(output_stt2)

the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds


In [15]:
# DETECT SENTENCES
print(detect_sents(output_stt2))

The boy was there when the sun rose. He roasteth pain salmon. The source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet. The potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds.


## 3. Inference: full version

In [16]:
from __future__ import absolute_import, division, print_function

import argparse
import numpy as np
import shlex
import subprocess
import sys
import wave
import json

from deepspeech import Model, version
from timeit import default_timer as timer

try:
    from shhlex import quote
except ImportError:
    from pipes import quote

In [17]:
def convert_samplerate(audio_path, desired_sample_rate):
    # sox -S OSR_us_000_0010_8k.wav output.wav rate -L -s 16000
    sox_cmd = 'sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - '.format(quote(audio_path), desired_sample_rate)
    try:
        output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
        raise RuntimeError('SoX returned non-zero status: {}'.format(e.stderr))
    except OSError as e:
        raise OSError(e.errno, 'SoX not found, use {}hz files or install it: {}'.format(desired_sample_rate, e.strerror))

    return desired_sample_rate, np.frombuffer(output, np.int16)


def metadata_to_string(metadata):
    return ''.join(token.text for token in metadata.tokens)


def words_from_candidate_transcript(metadata):
    word = ""
    word_list = []
    word_start_time = 0
    # Loop through each character
    for i, token in enumerate(metadata.tokens):
        # Append character to word if it's not a space
        if token.text != " ":
            if len(word) == 0:
                # Log the start time of the new word
                word_start_time = token.start_time

            word = word + token.text
        # Word boundary is either a space or the last character in the array
        if token.text == " " or i == len(metadata.tokens) - 1:
            word_duration = token.start_time - word_start_time

            if word_duration < 0:
                word_duration = 0

            each_word = dict()
            each_word["word"] = word
            each_word["start_time "] = round(word_start_time, 4)
            each_word["duration"] = round(word_duration, 4)

            word_list.append(each_word)
            # Reset
            word = ""
            word_start_time = 0

    return word_list


def metadata_json_output(metadata):
    json_result = dict()
    json_result["transcripts"] = [{
        "confidence": transcript.confidence,
        "words": words_from_candidate_transcript(transcript),
    } for transcript in metadata.transcripts]
    return json.dumps(json_result, indent=2)



class VersionAction(argparse.Action):
    def __init__(self, *args, **kwargs):
        super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

    def __call__(self, *args, **kwargs):
        print('DeepSpeech ', version())
        exit(0)

In [18]:
print('Loading model from file {}'.format(path_to_model), file=sys.stderr)
model_load_start = timer()

ds = Model(path_to_model)

model_load_end = timer() - model_load_start
print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

desired_sample_rate = ds.sampleRate()

if path_to_scorer:
    print('Loading scorer from files {}'.format(path_to_scorer), file=sys.stderr)
    scorer_load_start = timer()
    ds.enableExternalScorer(path_to_scorer)
    scorer_load_end = timer() - scorer_load_start
    print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)

Loading model from file deepspeech-0.7.3-models.pbmm
Loaded model in 0.0943s.
Loading scorer from files deepspeech-0.7.3-models.scorer
Loaded scorer in 0.000282s.


In [19]:
path_to_audio = './audio/OSR_us_000_0011_8k.wav'

fin = wave.open(path_to_audio, 'rb')
fs_orig = fin.getframerate()
if fs_orig != desired_sample_rate:
    print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
    fs_new, audio = convert_samplerate(path_to_audio, desired_sample_rate)
else:
    audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

audio_length = fin.getnframes() * (1/fs_orig)
fin.close()

print('Running inference.', file=sys.stderr)
inference_start = timer()

output_stt_full = ds.stt(audio).strip()

inference_end = timer() - inference_start
print('Done! Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length))

Running inference.


Done! Inference took 31.773s for 32.785s audio file.


In [20]:
print(output_stt_full)

the boy was there when the sun rose he roasteth pain salmon the source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet the potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds


In [21]:
# DETECT SENTENCES
print(detect_sents(output_stt_full))

The boy was there when the sun rose. He roasteth pain salmon. The source of the huge river is a clear spring checked the boston followed through helped the women get back to her feet. The potestas to pass the evening smoke fires lackman had the soft cushion broke the man's fall to salt breath came across the sea the girl at the booty bonds.
