Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech-To-Text Module #2

Open
chidiewenike opened this issue Jul 26, 2020 · 24 comments
Open

Speech-To-Text Module #2

chidiewenike opened this issue Jul 26, 2020 · 24 comments
Assignees
Labels

Comments

@chidiewenike
Copy link
Collaborator

Objective

Explore offline Speech-To-Text (STT) libraries that will convert raw audio bytes to a string.

Key Result

Create a function that will output a string from raw audio bytes input.

Details

The function will take, as input, raw audio bytes. The properties of the audio is TBD. The raw audio bytes are then converted to a string by an offline/local STT library. Beyond memory, the priority should be a library that allows custom speech adaption. Speech adaption will allow some sort of user input (list of words, transcripts, etc) to disambiguate uncommon words.

When selecting the appropriate library, priorities are as follows:

  1. Memory
  2. Customizable speech understanding
  3. Customizability of sound properties
  4. Runtime
@chidiewenike chidiewenike changed the title STT Module Speech-To-Text Module Jul 26, 2020
@hhokari hhokari assigned hhokari and Jason-Ku and unassigned Jason-Ku Jul 26, 2020
@mfekadu
Copy link
Member

mfekadu commented Jul 27, 2020

@hhokari and @Jason-Ku check out this fantastic open-source STT/TTS project by Mozilla:

They have releases here (pre-trained models) (seems kept up-to-date because latest release was last month):

and their source data here:

@Jason-Ku
Copy link
Collaborator

Jason-Ku commented Aug 2, 2020

I just set up and tried out DeepSpeech, it's pretty darn cool and pretty much works out of the box! awesome find, Michael

@Jason-Ku
Copy link
Collaborator

Jason-Ku commented Aug 2, 2020

Some preliminary testing shows that the STT module was running with just shy of 12.8% of my computer's memory (16GB), so we're looking at just over 2GB memory

@chidiewenike
Copy link
Collaborator Author

Do you know what libraries are pulled in? Which model are you using? I remember there being a TFLite model as well which is built for mobile apps and embedded systems.

@chidiewenike
Copy link
Collaborator Author

We have up to 8 GB of memory so it won't cause any serious issues but it does increase the cost per device.

@Jason-Ku
Copy link
Collaborator

Jason-Ku commented Aug 2, 2020

I'm using this model: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.pbmm

Just realized it's not the latest one (0.8.0) so I'll download that when my internet starts working again and give that a shot.

Not sure what all the libraries being pulled in are

@chidiewenike
Copy link
Collaborator Author

See if you can work with the TFLite model. That is built to be a bit more lightweight.

@mfekadu
Copy link
Member

mfekadu commented Aug 2, 2020

what about this one?

native_client.rpi3.cpu.linux.tar.xz
940 KB

download link

@Jason-Ku
Copy link
Collaborator

Jason-Ku commented Aug 2, 2020

I give that model a shot later tonight @mfekadu! or @hhokari can try that one out.

Here are some metrics from a sample usage of the tflite model:

NOTES:

  • memory usage is measured in MEBIBYTES
  • the memory profiler itself uses quite a bit of memory
  • I only put the profiler on the main method
  • totally bungled the sentence
how did copleston pacific ranch
Memory usage (in chunks of .1 seconds): [26.81640625, 26.8515625, 69.7421875, 75.79296875, 76.296875, 93.7890625, 96.40625, 98.66015625, 101.80078125, 102.08203125, 28.9921875]
Maximum memory usage: 102.08203125
Filename: audio.py

Line #    Mem usage    Increment   Line Contents
================================================
    92   26.852 MiB   26.852 MiB   @profile
    93                             def stt():
    94   26.855 MiB    0.004 MiB       parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
    95   26.855 MiB    0.000 MiB       parser.add_argument('--model', required=True,
    96   26.855 MiB    0.000 MiB                           help='Path to the model (protocol buffer binary file)')
    97   26.855 MiB    0.000 MiB       parser.add_argument('--scorer', required=False,
    98   26.855 MiB    0.000 MiB                           help='Path to the external scorer file')
    99   26.855 MiB    0.000 MiB       parser.add_argument('--audio', required=True,
   100   26.855 MiB    0.000 MiB                           help='Path to the audio file to run (WAV format)')
   101   26.855 MiB    0.000 MiB       parser.add_argument('--beam_width', type=int,
   102   26.855 MiB    0.000 MiB                           help='Beam width for the CTC decoder')
   103   26.855 MiB    0.000 MiB       parser.add_argument('--lm_alpha', type=float,
   104   26.859 MiB    0.004 MiB                           help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
   105   26.859 MiB    0.000 MiB       parser.add_argument('--lm_beta', type=float,
   106   26.859 MiB    0.000 MiB                           help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
   107   26.859 MiB    0.000 MiB       parser.add_argument('--version', action=VersionAction,
   108   26.859 MiB    0.000 MiB                           help='Print version and exits')
   109   26.859 MiB    0.000 MiB       parser.add_argument('--extended', required=False, action='store_true',
   110   26.859 MiB    0.000 MiB                           help='Output string from extended metadata')
   111   26.859 MiB    0.000 MiB       parser.add_argument('--json', required=False, action='store_true',
   112   26.859 MiB    0.000 MiB                           help='Output json from metadata with timestamp of each word')
   113   26.859 MiB    0.000 MiB       parser.add_argument('--candidate_transcripts', type=int, default=3,
   114   26.859 MiB    0.000 MiB                           help='Number of candidate transcripts to include in JSON output')
   115   26.863 MiB    0.004 MiB       args = parser.parse_args()
   116                             
   117   26.863 MiB    0.000 MiB       print('Loading model from file {}'.format(args.model), file=sys.stderr)
   118   26.863 MiB    0.000 MiB       model_load_start = timer()
   119                                 # sphinx-doc: python_ref_model_start
   120   27.707 MiB    0.844 MiB       ds = Model(args.model)
   121                                 # sphinx-doc: python_ref_model_stop
   122   27.707 MiB    0.000 MiB       model_load_end = timer() - model_load_start
   123   27.711 MiB    0.004 MiB       print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)
   124                             
   125   27.711 MiB    0.000 MiB       if args.beam_width:
   126                                     ds.setBeamWidth(args.beam_width)
   127                             
   128   27.715 MiB    0.004 MiB       desired_sample_rate = ds.sampleRate()
   129                             
   130   27.715 MiB    0.000 MiB       if args.scorer:
   131   27.715 MiB    0.000 MiB           print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
   132   27.715 MiB    0.000 MiB           scorer_load_start = timer()
   133   27.863 MiB    0.148 MiB           ds.enableExternalScorer(args.scorer)
   134   27.863 MiB    0.000 MiB           scorer_load_end = timer() - scorer_load_start
   135   27.863 MiB    0.000 MiB           print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)
   136                             
   137   27.863 MiB    0.000 MiB           if args.lm_alpha and args.lm_beta:
   138                                         ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
   139                             
   140   27.863 MiB    0.000 MiB       fin = wave.open(args.audio, 'rb')
   141   27.863 MiB    0.000 MiB       fs_orig = fin.getframerate()
   142   27.863 MiB    0.000 MiB       if fs_orig != desired_sample_rate:
   143   27.863 MiB    0.000 MiB           print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
   144   28.145 MiB    0.281 MiB           fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
   145                                 else:
   146                                     audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
   147                             
   148   28.145 MiB    0.000 MiB       audio_length = fin.getnframes() * (1/fs_orig)
   149   28.145 MiB    0.000 MiB       fin.close()
   150                             
   151   28.145 MiB    0.000 MiB       print('Running inference.', file=sys.stderr)
   152   28.145 MiB    0.000 MiB       inference_start = timer()
   153                                 # sphinx-doc: python_ref_inference_start
   154   28.145 MiB    0.000 MiB       if args.extended:
   155                                     print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
   156   28.145 MiB    0.000 MiB       elif args.json:
   157                                     print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
   158                                 else:
   159  102.641 MiB   74.496 MiB           print(ds.stt(audio))
   160                             
   161                                 # sphinx-doc: python_ref_inference_stop
   162  102.641 MiB    0.000 MiB       inference_end = timer() - inference_start
   163  102.641 MiB    0.000 MiB       print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)



@mfekadu
Copy link
Member

mfekadu commented Aug 3, 2020

Super cool @Jason-Ku

Perhaps we can make good use of the extra memory by fine-tuning the pre-trained model to ensure that domain-specific words will work (e.g. Cal Poly != Cow Police).

Highlighted below is the data they used to train on:
image

@Jason-Ku
Copy link
Collaborator

Jason-Ku commented Aug 9, 2020

what about this one?

native_client.rpi3.cpu.linux.tar.xz
940 KB

download link

Not sure how to get this working, unzipped it and theres no model files here, just a bunch of hex data

might need to compile in c
bummer: /usr/bin/ld: unknown architecture of input file `deepspeech' is incompatible with i386:x86-64 output

@snekiam
Copy link
Member

snekiam commented Aug 9, 2020

It's setup for ARM - I'll test it on a raspberry pi

@snekiam
Copy link
Member

snekiam commented Aug 9, 2020

Running on pi, looks like it needs SoX installed:

./deepspeech: error while loading shared libraries: libsox.so.3: cannot open shared object file: No such file or directory

@mfekadu
Copy link
Member

mfekadu commented Aug 9, 2020

@Jason-Ku 's memory profiling python script

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function

import argparse
import numpy as np
import shlex
import subprocess
import sys
import wave
import json
import time

from deepspeech import Model, version
from timeit import default_timer as timer
from memory_profiler import memory_usage

try:
    from shhlex import quote
except ImportError:
    from pipes import quote


def convert_samplerate(audio_path, desired_sample_rate):
    sox_cmd = 'sox {} --type raw --bits 16 --channels 1 --rate {} --encoding signed-integer --endian little --compression 0.0 --no-dither - '.format(quote(audio_path), desired_sample_rate)
    try:
        output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)
    except subprocess.CalledProcessError as e:
        raise RuntimeError('SoX returned non-zero status: {}'.format(e.stderr))
    except OSError as e:
        raise OSError(e.errno, 'SoX not found, use {}hz files or install it: {}'.format(desired_sample_rate, e.strerror))

    return desired_sample_rate, np.frombuffer(output, np.int16)


def metadata_to_string(metadata):
    return ''.join(token.text for token in metadata.tokens)


def words_from_candidate_transcript(metadata):
    word = ""
    word_list = []
    word_start_time = 0
    # Loop through each character
    for i, token in enumerate(metadata.tokens):
        # Append character to word if it's not a space
        if token.text != " ":
            if len(word) == 0:
                # Log the start time of the new word
                word_start_time = token.start_time

            word = word + token.text
        # Word boundary is either a space or the last character in the array
        if token.text == " " or i == len(metadata.tokens) - 1:
            word_duration = token.start_time - word_start_time

            if word_duration < 0:
                word_duration = 0

            each_word = dict()
            each_word["word"] = word
            each_word["start_time "] = round(word_start_time, 4)
            each_word["duration"] = round(word_duration, 4)

            word_list.append(each_word)
            # Reset
            word = ""
            word_start_time = 0

    return word_list


def metadata_json_output(metadata):
    json_result = dict()
    json_result["transcripts"] = [{
        "confidence": transcript.confidence,
        "words": words_from_candidate_transcript(transcript),
    } for transcript in metadata.transcripts]
    return json.dumps(json_result, indent=2)



class VersionAction(argparse.Action):
    def __init__(self, *args, **kwargs):
        super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

    def __call__(self, *args, **kwargs):
        print('DeepSpeech ', version())
        exit(0)


@profile
def stt():
    parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
    parser.add_argument('--model', required=True,
                        help='Path to the model (protocol buffer binary file)')
    parser.add_argument('--scorer', required=False,
                        help='Path to the external scorer file')
    parser.add_argument('--audio', required=True,
                        help='Path to the audio file to run (WAV format)')
    parser.add_argument('--beam_width', type=int,
                        help='Beam width for the CTC decoder')
    parser.add_argument('--lm_alpha', type=float,
                        help='Language model weight (lm_alpha). If not specified, use default from the scorer package.')
    parser.add_argument('--lm_beta', type=float,
                        help='Word insertion bonus (lm_beta). If not specified, use default from the scorer package.')
    parser.add_argument('--version', action=VersionAction,
                        help='Print version and exits')
    parser.add_argument('--extended', required=False, action='store_true',
                        help='Output string from extended metadata')
    parser.add_argument('--json', required=False, action='store_true',
                        help='Output json from metadata with timestamp of each word')
    parser.add_argument('--candidate_transcripts', type=int, default=3,
                        help='Number of candidate transcripts to include in JSON output')
    args = parser.parse_args()

    print('Loading model from file {}'.format(args.model), file=sys.stderr)
    model_load_start = timer()
    # sphinx-doc: python_ref_model_start
    ds = Model(args.model)
    # sphinx-doc: python_ref_model_stop
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    if args.beam_width:
        ds.setBeamWidth(args.beam_width)

    desired_sample_rate = ds.sampleRate()

    if args.scorer:
        print('Loading scorer from files {}'.format(args.scorer), file=sys.stderr)
        scorer_load_start = timer()
        ds.enableExternalScorer(args.scorer)
        scorer_load_end = timer() - scorer_load_start
        print('Loaded scorer in {:.3}s.'.format(scorer_load_end), file=sys.stderr)

        if args.lm_alpha and args.lm_beta:
            ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)

    fin = wave.open(args.audio, 'rb')
    fs_orig = fin.getframerate()
    if fs_orig != desired_sample_rate:
        print('Warning: original sample rate ({}) is different than {}hz. Resampling might produce erratic speech recognition.'.format(fs_orig, desired_sample_rate), file=sys.stderr)
        fs_new, audio = convert_samplerate(args.audio, desired_sample_rate)
    else:
        audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

    audio_length = fin.getnframes() * (1/fs_orig)
    fin.close()

    print('Running inference.', file=sys.stderr)
    inference_start = timer()
    # sphinx-doc: python_ref_inference_start
    if args.extended:
        print(metadata_to_string(ds.sttWithMetadata(audio, 1).transcripts[0]))
    elif args.json:
        print(metadata_json_output(ds.sttWithMetadata(audio, args.candidate_transcripts)))
    else:
        print(ds.stt(audio))

    # sphinx-doc: python_ref_inference_stop
    inference_end = timer() - inference_start
    print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)


if __name__ == '__main__':
    mem_usage = memory_usage(stt)
    print('Memory usage (in chunks of .1 seconds): %s' % mem_usage)
    print('Maximum memory usage: %s' % max(mem_usage))

@hhokari
Copy link
Collaborator

hhokari commented Aug 9, 2020

I was just able to get deep speech running; really cool!

@snekiam
Copy link
Member

snekiam commented Aug 10, 2020

Ran on a Raspberry Pi 4b with 1gb of ram:
image
Its possible that benchmarking slows things down, but we're very much CPU bound on this, not ram bound. Ran on a ~10sec audio file, found here (preamble10.wav)

@mfekadu
Copy link
Member

mfekadu commented Aug 10, 2020

That's great @hhokari !

Thanks for the analysis and soundfile @snekiam !

@mfekadu
Copy link
Member

mfekadu commented Aug 10, 2020

For some reason that audiofile preamble10.wav does not work nicely with my deepspeech executable on my mac

@snekiam

image

WAVE: RIFF header not found
➜  native_client.amd64.cpu.osx ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio audio/preamble10.wav -t
TensorFlow: v2.2.0-17-g0854bb5188
DeepSpeech: v0.8.0-0-gf56b07da
2020-08-09 17:45:05.920602: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
formats: can't open input file `audio/preamble10.wav': WAVE: RIFF header not found
Assertion failed: (input), function GetAudioBuffer, file client.cc, line 228.
[1]    39132 abort      ./deepspeech --model model/deepspeech-0.8.0-models.pbmm --audio  -t

@mfekadu
Copy link
Member

mfekadu commented Aug 10, 2020

I realized that my audiofile was corrupted during the download. re-downloaded and that fixed it.

New issue:
I found some interesting mistakes (highlighted below) that occur on my CPU but not within your screenshot
@snekiam

image

@mfekadu
Copy link
Member

mfekadu commented Aug 10, 2020

The screenshot above is also using the pbmm model (deepspeech-0.8.0-models.pbmm) rather than the tflite model

Here is a link to the docs about the pre-trained models

@snekiam
Copy link
Member

snekiam commented Aug 10, 2020

Some more interesting info on preamble10.wav - potentially to do with why it took so long to process:
image
22.05khz is potentially a higher sample rate than we're going to use. We also might want to consider using a USB accelerator, like the coral if things don't perform well, but I'm not 100% convinced that we'll need it - deepspeech should be able to do realtime audio on a Pi 4b, according to mozilla. I'm going to try a Pi 4b-specific compilation rather than the Pi 3b version I ran earlier.

@snekiam
Copy link
Member

snekiam commented Aug 10, 2020

Some more interesting data - Mozilla claims deepspeech is real-time on the Pi 4, and we're not constrained by the 1gb memory on this specific Pi 4b. I wonder if we're limited by the SD card. I'll try a quick flash drive tomorrow, kinda interested in trying to boot from USB rather than SD anyways.
image
This shows the difference between the deepspeech reported time and the system's - 2.164s for a 1.975s audio file is much better than the 22 seconds for the preamble file above! The above file is a lower bitrate and sampling rate, which might have an effect also:
image
Ideally, we'll want to process audio as it comes in rather than reading from disk anyway, which may make the read/write speed of our medium irrelevant.

@chidiewenike
Copy link
Collaborator Author

Audio data should be read from the audio stream buffer and stored on RAM. That is what we do for the wake-word on NIMBUS and the GCP STT API.

@chidiewenike
Copy link
Collaborator Author

Based on their documentation, they seem to use 16 KHz although the Baidu paper suggests that both 16 KHz and 8 KHz datasets were used. They seem to use Sox to resample their data. That process might add some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants