Speech-To-Text Module #2

chidiewenike · 2020-07-26T21:23:35Z

Objective

Explore offline Speech-To-Text (STT) libraries that will convert raw audio bytes to a string.

Key Result

Create a function that will output a string from raw audio bytes input.

Details

The function will take, as input, raw audio bytes. The properties of the audio is TBD. The raw audio bytes are then converted to a string by an offline/local STT library. Beyond memory, the priority should be a library that allows custom speech adaption. Speech adaption will allow some sort of user input (list of words, transcripts, etc) to disambiguate uncommon words.

When selecting the appropriate library, priorities are as follows:

Memory
Customizable speech understanding
Customizability of sound properties
Runtime

mfekadu · 2020-07-27T01:21:54Z

@hhokari and @Jason-Ku check out this fantastic open-source STT/TTS project by Mozilla:

https://github.com/mozilla/DeepSpeech

They have releases here (pre-trained models) (seems kept up-to-date because latest release was last month):

https://github.com/mozilla/DeepSpeech/releases/latest

and their source data here:

https://voice.mozilla.org

Jason-Ku · 2020-08-02T01:46:44Z

I just set up and tried out DeepSpeech, it's pretty darn cool and pretty much works out of the box! awesome find, Michael

Jason-Ku · 2020-08-02T07:00:03Z

Some preliminary testing shows that the STT module was running with just shy of 12.8% of my computer's memory (16GB), so we're looking at just over 2GB memory

chidiewenike · 2020-08-02T07:04:58Z

Do you know what libraries are pulled in? Which model are you using? I remember there being a TFLite model as well which is built for mobile apps and embedded systems.

chidiewenike · 2020-08-02T07:06:26Z

We have up to 8 GB of memory so it won't cause any serious issues but it does increase the cost per device.

Jason-Ku · 2020-08-02T07:07:43Z

I'm using this model: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.pbmm

Just realized it's not the latest one (0.8.0) so I'll download that when my internet starts working again and give that a shot.

Not sure what all the libraries being pulled in are

chidiewenike · 2020-08-02T08:17:41Z

See if you can work with the TFLite model. That is built to be a bit more lightweight.

mfekadu · 2020-08-02T19:56:27Z

what about this one?

native_client.rpi3.cpu.linux.tar.xz
940 KB

download link

Jason-Ku · 2020-08-02T22:18:23Z

I give that model a shot later tonight @mfekadu! or @hhokari can try that one out.

Here are some metrics from a sample usage of the tflite model:

NOTES:

memory usage is measured in MEBIBYTES
the memory profiler itself uses quite a bit of memory
I only put the profiler on the main method
totally bungled the sentence

how did copleston pacific ranch
Memory usage (in chunks of .1 seconds): [26.81640625, 26.8515625, 69.7421875, 75.79296875, 76.296875, 93.7890625, 96.40625, 98.66015625, 101.80078125, 102.08203125, 28.9921875]
Maximum memory usage: 102.08203125
Filename: audio.py

Line #    Mem usage    Increment   Line Contents
================================================
    92   26.852 MiB   26.852 MiB   @profile
    93                             def stt():
    94   26.855 MiB    0.004 MiB       parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
    95   26.855 MiB    0.000 MiB       parser.add_argument('--model', required=True,
    96   26.855 MiB    0.000 MiB                           help='Path to the model (protocol buffer binary file)')
    97   26.855 MiB    0.000 MiB       parser.add_argument('--scorer', required=False,
    98   26.855 MiB    0.000 MiB                           help='Path to the external scorer file')
    99   26.855 MiB    0.000 MiB       parser.add_argument('--audio', required=True,
   100   26.855 MiB    0.000 MiB                           help='Path to the audio file to run (WAV format)')
   101   26.855 MiB    0.000 MiB       parser.add_argument('--beam_width', type=int,
   102   26.855 MiB    0.000 MiB                           help='Beam width for the CTC decoder')
   103   26.855 MiB    0.000 MiB       parser.add_argument('--lm_alpha', type=float,
   104   26.859 MiB    0.004 MiB                           help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
   105   26.859 MiB    0.000 MiB       parser.add_argument('--lm_beta', type=float,
   106   26.859 MiB    0.000 MiB                           help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
   107   26.859 MiB    0.000 MiB       parser.add_argument('--version', action=VersionAction,
   108   26.859 MiB    0.000 MiB                           help='Print version and exits')
   109   26.859 MiB    0.000 MiB       parser.add_argument('--extended', required=False, action='store_true',
   110   26.859 MiB    0.000 MiB                           help='Output string from extended metadata')
   111   26.859 MiB    0.000 MiB       parser.add_argument('--json', required=False, action='store_true',
   112   26.859 MiB    0.000 MiB                           help='Output json from metadata with timestamp of each word')
   113   26.859 MiB    0.000 MiB       parser.add_argument('--candidate_transcripts', type=int, default=3,
   114   26.859 MiB    0.000 MiB                           help='Number of candidate transcripts to include in JSON output')
   115   26.863 MiB    0.004 MiB       args = parser.parse_args()
   116                             
   117   26.863 MiB    0.000 MiB       print('Loading model from file {}'.format(args.model), file=sys.stderr)
   118   26.863 MiB    0.000 MiB       model_load_start = timer()
   119                                 # sphinx-doc: python_ref_model_start
   120   27.707 MiB    0.844 MiB       ds = Model(args.model)
   121                                 # sphinx-doc: python_ref_model_stop
   122   27.707 MiB    0.000 MiB       model_load_end = timer() - model_load_start
   123   27.711 MiB    0.004 MiB       print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)
   124                             
   125   27.711 MiB    0.000 MiB       if args.beam_width:
   126                                     ds.setBeamWidth(args.beam_width)
   127                             
   128   27.715 MiB    0.004 MiB       desired_sample_rate = ds.sampleRate()
   129                             
   130   27.715 MiB    0.000 MiB       if args.scorer:
   131   27.715 MiB    0.000 MiB           print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
   132   27.715 MiB    0.000 MiB           scorer_load_start = timer()
   133   27.863 MiB    0.148 MiB           ds.enableExternalScorer(args.scorer)
   134   27.863 MiB    0.000 MiB           scorer_load_end = timer() - scorer_load_start
   135   27.863 MiB    0.000 MiB           print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)
   136                             
   137   27.863 MiB    0.000 MiB           if args.lm_alpha and args.lm_beta:
   138                                         ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
   139                             
   140   27.863 MiB    0.000 MiB       fin = wave.open(args.audio, 'rb')
   141   27.863 MiB    0.000 MiB       fs_orig = fin.getframerate()
   142   27.863 MiB    0.000 MiB       if fs_orig != desired_sample_rate:
   143   27.863 MiB    0.000 MiB           print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
   144   28.145 MiB    0.281 MiB           fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
   145                                 else:
   146                                     audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
   147                             
   148   28.145 MiB    0.000 MiB       audio_length = fin.getnframes() * (1/fs_orig)
   149   28.145 MiB    0.000 MiB       fin.close()
   150                             
   151   28.145 MiB    0.000 MiB       print('Running inference.', file=sys.stderr)
   152   28.145 MiB    0.000 MiB       inference_start = timer()
   153                                 # sphinx-doc: python_ref_inference_start
   154   28.145 MiB    0.000 MiB       if args.extended:
   155                                     print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
   156   28.145 MiB    0.000 MiB       elif args.json:
   157                                     print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
   158                                 else:
   159  102.641 MiB   74.496 MiB           print(ds.stt(audio))
   160                             
   161                                 # sphinx-doc: python_ref_inference_stop
   162  102.641 MiB    0.000 MiB       inference_end = timer() - inference_start
   163  102.641 MiB    0.000 MiB       print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)

mfekadu · 2020-08-03T00:03:55Z

Super cool @Jason-Ku

Perhaps we can make good use of the extra memory by fine-tuning the pre-trained model to ensure that domain-specific words will work (e.g. Cal Poly != Cow Police).

Highlighted below is the data they used to train on:

Jason-Ku · 2020-08-09T19:42:44Z

what about this one?
native_client.rpi3.cpu.linux.tar.xz
940 KB
download link

Not sure how to get this working, unzipped it and theres no model files here, just a bunch of hex data

might need to compile in c
bummer: /usr/bin/ld: unknown architecture of input file `deepspeech' is incompatible with i386:x86-64 output

snekiam · 2020-08-09T21:41:02Z

It's setup for ARM - I'll test it on a raspberry pi

snekiam · 2020-08-09T21:45:04Z

Running on pi, looks like it needs SoX installed:

./deepspeech: error while loading shared libraries: libsox.so.3: cannot open shared object file: No such file or directory

mfekadu · 2020-08-09T21:50:57Z

@Jason-Ku 's memory profiling python script

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function

import argparse
import numpy as np
import shlex
import subprocess
import sys
import wave
import json
import time

from deepspeech import Model, version
from timeit import default_timer as timer
from memory_profiler import memory_usage

try:
    from shhlex import quote
except ImportError:
    from pipes import quote


def convert_samplerate(audio_path, desired_sample_rate):
    sox_cmd = 'sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - '.format(quote(audio_path), desired_sample_rate)
    try:
        output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
        raise RuntimeError('SoX returned non-zero status: {}'.format(e.stderr))
    except OSError as e:
        raise OSError(e.errno, 'SoX not found, use {}hz files or install it: {}'.format(desired_sample_rate, e.strerror))

    return desired_sample_rate, np.frombuffer(output, np.int16)


def metadata_to_string(metadata):
    return ''.join(token.text for token in metadata.tokens)


def words_from_candidate_transcript(metadata):
    word = ""
    word_list = []
    word_start_time = 0
    # Loop through each character
    for i, token in enumerate(metadata.tokens):
        # Append character to word if it's not a space
        if token.text != " ":
            if len(word) == 0:
                # Log the start time of the new word
                word_start_time = token.start_time

            word = word + token.text
        # Word boundary is either a space or the last character in the array
        if token.text == " " or i == len(metadata.tokens) - 1:
            word_duration = token.start_time - word_start_time

            if word_duration < 0:
                word_duration = 0

            each_word = dict()
            each_word["word"] = word
            each_word["start_time "] = round(word_start_time, 4)
            each_word["duration"] = round(word_duration, 4)

            word_list.append(each_word)
            # Reset
            word = ""
            word_start_time = 0

    return word_list


def metadata_json_output(metadata):
    json_result = dict()
    json_result["transcripts"] = [{
        "confidence": transcript.confidence,
        "words": words_from_candidate_transcript(transcript),
    } for transcript in metadata.transcripts]
    return json.dumps(json_result, indent=2)



class VersionAction(argparse.Action):
    def __init__(self, *args, **kwargs):
        super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

    def __call__(self, *args, **kwargs):
        print('DeepSpeech ', version())
        exit(0)


@profile
def stt():
    parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
    parser.add_argument('--model', required=True,
                        help='Path to the model (protocol buffer binary file)')
    parser.add_argument('--scorer', required=False,
                        help='Path to the external scorer file')
    parser.add_argument('--audio', required=True,
                        help='Path to the audio file to run (WAV format)')
    parser.add_argument('--beam_width', type=int,
                        help='Beam width for the CTC decoder')
    parser.add_argument('--lm_alpha', type=float,
                        help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
    parser.add_argument('--lm_beta', type=float,
                        help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
    parser.add_argument('--version', action=VersionAction,
                        help='Print version and exits')
    parser.add_argument('--extended', required=False, action='store_true',
                        help='Output string from extended metadata')
    parser.add_argument('--json', required=False, action='store_true',
                        help='Output json from metadata with timestamp of each word')
    parser.add_argument('--candidate_transcripts', type=int, default=3,
                        help='Number of candidate transcripts to include in JSON output')
    args = parser.parse_args()

    print('Loading model from file {}'.format(args.model), file=sys.stderr)
    model_load_start = timer()
    # sphinx-doc: python_ref_model_start
    ds = Model(args.model)
    # sphinx-doc: python_ref_model_stop
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    if args.beam_width:
        ds.setBeamWidth(args.beam_width)

    desired_sample_rate = ds.sampleRate()

    if args.scorer:
        print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
        scorer_load_start = timer()
        ds.enableExternalScorer(args.scorer)
        scorer_load_end = timer() - scorer_load_start
        print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)

        if args.lm_alpha and args.lm_beta:
            ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)

    fin = wave.open(args.audio, 'rb')
    fs_orig = fin.getframerate()
    if fs_orig != desired_sample_rate:
        print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
        fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
    else:
        audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

    audio_length = fin.getnframes() * (1/fs_orig)
    fin.close()

    print('Running inference.', file=sys.stderr)
    inference_start = timer()
    # sphinx-doc: python_ref_inference_start
    if args.extended:
        print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
    elif args.json:
        print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
    else:
        print(ds.stt(audio))

    # sphinx-doc: python_ref_inference_stop
    inference_end = timer() - inference_start
    print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)


if __name__ == '__main__':
    mem_usage = memory_usage(stt)
    print('Memory usage (in chunks of .1 seconds): %s' % mem_usage)
    print('Maximum memory usage: %s' % max(mem_usage))

hhokari · 2020-08-09T23:11:24Z

I was just able to get deep speech running; really cool!

snekiam · 2020-08-10T00:39:08Z

Ran on a Raspberry Pi 4b with 1gb of ram:

Its possible that benchmarking slows things down, but we're very much CPU bound on this, not ram bound. Ran on a ~10sec audio file, found here (preamble10.wav)

mfekadu · 2020-08-10T00:42:48Z

That's great @hhokari !

Thanks for the analysis and soundfile @snekiam !

mfekadu · 2020-08-10T00:46:54Z

For some reason that audiofile preamble10.wav does not work nicely with my deepspeech executable on my mac

@snekiam

WAVE: RIFF header not found

➜  native_client.amd64.cpu.osx ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio audio/preamble10.wav -t
TensorFlow: v2.2.0-17-g0854bb5188
DeepSpeech: v0.8.0-0-gf56b07da
2020-08-09 17:45:05.920602: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
formats: can't open input file `audio/preamble10.wav': WAVE: RIFF header not found
Assertion failed: (input), function GetAudioBuffer, file client.cc, line 228.
[1]    39132 abort      ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio  -t

mfekadu · 2020-08-10T01:03:10Z

I realized that my audiofile was corrupted during the download. re-downloaded and that fixed it.

New issue:
I found some interesting mistakes (highlighted below) that occur on my CPU but not within your screenshot
@snekiam

mfekadu · 2020-08-10T01:05:16Z

The screenshot above is also using the pbmm model (deepspeech-0.8.0-models.pbmm) rather than the tflite model

Here is a link to the docs about the pre-trained models

https://deepspeech.readthedocs.io/en/v0.8.0/USING.html

snekiam · 2020-08-10T03:07:27Z

Some more interesting info on preamble10.wav - potentially to do with why it took so long to process:

22.05khz is potentially a higher sample rate than we're going to use. We also might want to consider using a USB accelerator, like the coral if things don't perform well, but I'm not 100% convinced that we'll need it - deepspeech should be able to do realtime audio on a Pi 4b, according to mozilla. I'm going to try a Pi 4b-specific compilation rather than the Pi 3b version I ran earlier.

snekiam · 2020-08-10T03:30:13Z

Some more interesting data - Mozilla claims deepspeech is real-time on the Pi 4, and we're not constrained by the 1gb memory on this specific Pi 4b. I wonder if we're limited by the SD card. I'll try a quick flash drive tomorrow, kinda interested in trying to boot from USB rather than SD anyways.

This shows the difference between the deepspeech reported time and the system's - 2.164s for a 1.975s audio file is much better than the 22 seconds for the preamble file above! The above file is a lower bitrate and sampling rate, which might have an effect also:

Ideally, we'll want to process audio as it comes in rather than reading from disk anyway, which may make the read/write speed of our medium irrelevant.

chidiewenike · 2020-08-10T04:41:38Z

Audio data should be read from the audio stream buffer and stored on RAM. That is what we do for the wake-word on NIMBUS and the GCP STT API.

chidiewenike · 2020-08-10T04:51:22Z

Based on their documentation, they seem to use 16 KHz although the Baidu paper suggests that both 16 KHz and 8 KHz datasets were used. They seem to use Sox to resample their data. That process might add some time.

chidiewenike added enhancement New feature or request good-new-member-issue labels Jul 26, 2020

chidiewenike assigned Jason-Ku Jul 26, 2020

chidiewenike changed the title ~~STT Module~~ Speech-To-Text Module Jul 26, 2020

hhokari assigned hhokari and Jason-Ku and unassigned Jason-Ku Jul 26, 2020

mfekadu mentioned this issue Jul 27, 2020

Text-To-Speech Module #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech-To-Text Module #2

Speech-To-Text Module #2

chidiewenike commented Jul 26, 2020

mfekadu commented Jul 27, 2020 •

edited

Loading

Jason-Ku commented Aug 2, 2020

Jason-Ku commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

Jason-Ku commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

mfekadu commented Aug 2, 2020 •

edited

Loading

Jason-Ku commented Aug 2, 2020 •

edited

Loading

mfekadu commented Aug 3, 2020 •

edited

Loading

Jason-Ku commented Aug 9, 2020 •

edited

Loading

snekiam commented Aug 9, 2020

snekiam commented Aug 9, 2020

mfekadu commented Aug 9, 2020 •

edited

Loading

hhokari commented Aug 9, 2020

snekiam commented Aug 10, 2020

mfekadu commented Aug 10, 2020

mfekadu commented Aug 10, 2020 •

edited

Loading

mfekadu commented Aug 10, 2020

mfekadu commented Aug 10, 2020 •

edited

Loading

snekiam commented Aug 10, 2020

snekiam commented Aug 10, 2020

chidiewenike commented Aug 10, 2020

chidiewenike commented Aug 10, 2020

Speech-To-Text Module #2

Speech-To-Text Module #2

Comments

chidiewenike commented Jul 26, 2020

Objective

Key Result

Details

mfekadu commented Jul 27, 2020 • edited Loading

Jason-Ku commented Aug 2, 2020

Jason-Ku commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

Jason-Ku commented Aug 2, 2020

chidiewenike commented Aug 2, 2020

mfekadu commented Aug 2, 2020 • edited Loading

Jason-Ku commented Aug 2, 2020 • edited Loading

NOTES:

mfekadu commented Aug 3, 2020 • edited Loading

Jason-Ku commented Aug 9, 2020 • edited Loading

snekiam commented Aug 9, 2020

snekiam commented Aug 9, 2020

mfekadu commented Aug 9, 2020 • edited Loading

hhokari commented Aug 9, 2020

snekiam commented Aug 10, 2020

mfekadu commented Aug 10, 2020

mfekadu commented Aug 10, 2020 • edited Loading

mfekadu commented Aug 10, 2020

mfekadu commented Aug 10, 2020 • edited Loading

snekiam commented Aug 10, 2020

snekiam commented Aug 10, 2020

chidiewenike commented Aug 10, 2020

chidiewenike commented Aug 10, 2020

mfekadu commented Jul 27, 2020 •

edited

Loading

mfekadu commented Aug 2, 2020 •

edited

Loading

Jason-Ku commented Aug 2, 2020 •

edited

Loading

mfekadu commented Aug 3, 2020 •

edited

Loading

Jason-Ku commented Aug 9, 2020 •

edited

Loading

mfekadu commented Aug 9, 2020 •

edited

Loading

mfekadu commented Aug 10, 2020 •

edited

Loading

mfekadu commented Aug 10, 2020 •

edited

Loading