Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSA: Realtime audio frontend demo for macOS #327

Open
lunixbochs opened this issue Jun 9, 2019 · 20 comments

Comments

Projects
None yet
6 participants
@lunixbochs
Copy link
Contributor

commented Jun 9, 2019

This is a working-out-of-the-box demo for realtime speech recognition on macOS with wav2letter++

This is based on my C API in #326
There's a src dir in the w2l_cli tarball with the frontend source (w2l_cli.cpp) and scripts/instructions for building this all from scratch.

to install:

wget https://talonvoice.com/research/w2l_cli.tar.gz
tar -xf w2l_cli.tar.gz && rm w2l_cli.tar.gz
cd w2l_cli
wget https://talonvoice.com/research/epoch186-ls3_14.tar.gz
tar -xf epoch186-ls3_14.tar.gz && rm epoch186-ls3_14.tar.gz

to run:
./bin/w2l emit epoch186-ls3_14/model.bin epoch186-ls3_14/tokens.txt

Then speak, and you should see emissions (letter predictions) in the terminal output after you speak, for example:

$ ./bin/w2l emit epoch186-ls3_14/model.bin epoch186-ls3_14/tokens.txt 
helow|world
this|is|a|test|of|wave|to|leter

Language model decoding is also wired up via ./bin/w2l decode am tokens lm lexicon, but as per #326 it segfaults right now when setting up the Trie.

There are more pretrained english acoustic models at https://talonvoice.com/research/ you can try as well.

@timdoug

This comment has been minimized.

Copy link

commented Jun 10, 2019

The w2l binary points to dynamic library that isn't present:

$ ./bin/w2l 
dyld: Library not loaded: @rpath/libclang_rt.asan_osx_dynamic.dylib
  Referenced from: .../bin/w2l
  Reason: image not found
Abort trap: 6
$ otool -L ./bin/w2l | grep rpath
	@rpath/libaf.3.dylib (compatibility version 3.0.0, current version 3.6.2)
	@rpath/libmkldnn.0.dylib (compatibility version 0.0.0, current version 0.18.1)
	@rpath/libiomp5.dylib (compatibility version 5.0.0, current version 5.0.0)
$ ls bin/
libaf.3.dylib     libafcpu.3.dylib  libiomp5.dylib    libmkldnn.0.dylib libmklml.dylib    w2l
$

Fixed with install_name_tool:

$ install_name_tool -change @rpath/libclang_rt.asan_osx_dynamic.dylib /Library/Developer/CommandLineTools/usr/lib/clang/10.0.1/lib/darwin/libclang_rt.asan_osx_dynamic.dylib bin/w2l 
$ ./bin/w2l 
Usage: ./bin/w2l emit   <acoustic model> <tokens.txt>
Usage: ./bin/w2l decode <acoustic model> <tokens.txt> <language model> <lexicon>
$

This is on 10.14.5 with the command-line dev tools installed, not full Xcode; the path may be different in that case / different versions / etc.

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

Ok, I uploaded a new version at the same URL that wasn't compiled with -fsanitize-address, and with a small improvement to capture more audio at the start of an utterance.

@cogmeta

This comment has been minimized.

Copy link

commented Jul 9, 2019

will it build successfully on Ubuntu?

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 10, 2019

@cogmeta

This comment has been minimized.

Copy link

commented Jul 10, 2019

as a favor, can you please provide a example with input from a wav file (and does not use any of the mac os audio capturing stuff) and printing the result? I was successfully able to build everything including libw2l.a on ubuntu 16.04 but now stuck with sample example.

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 10, 2019

No, wav2letter already has code for loading an audio file in the right format using libsndfile in the featurization path, and the Test/Decode binaries can already run against sound files if you put them in the right dataset format.

@cogmeta

This comment has been minimized.

Copy link

commented Jul 11, 2019

Thanks, i did try ./Decoder but it show no results.

./Decoder -test ./data/ -am ../../../epoch186-ls3_14/model.bin -lm ../../wav2letter/src/decoder/test/lm.arpa -showletters -show
Loading the LM will be faster if you build a binary file.
Reading ../../wav2letter/src/decoder/test/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


|T|:
|P|:
|t|: my|internet|is|not|working|man
|p|:
[sample: 0, WER: 100%, LER: 100%, slice WER: 100%, slice LER: 100%, progress: 100%]

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

I think you should open a new issue for that. It's possible your sound file is in the wrong format. Make sure it's 16 bit 16000khz 1 channel.

@maiasmith

This comment has been minimized.

Copy link

commented Jul 13, 2019

FYI

wget https://talonvoice.com/research/w2l_cli.tar.gz
...
ERROR: cannot verify talonvoice.com's certificate, issued by ‘CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US’:
  Issued certificate has expired.
@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2019

@bobtwinkles

This comment has been minimized.

Copy link

commented Jul 15, 2019

I hacked together a Linux port of this here. If you don't want to lose your mind trying to get it to build, I recommend using the provided Nix derivations. I had to make several changes to the upstream w2l C API to get it working against the latest wav2letter, which are captured in this fork of wav2letter. It "works" in the sense that it'll pass data to and from the Wav2Letter system, though using epoch186 kindly provided by @lunixbochs results in somewhat underwhelming performance. That might be due to my sketchy normalization, bad microphone, and weak grasp on many of the deep technical details involved in actually deploying an accurate speech to text solution.

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@bobtwinkles

This comment has been minimized.

Copy link

commented Jul 15, 2019

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

Try the emit example. You can copy the updated emit code from the branch I linked. If saying "hello world" slowly and clearly doesn't result in something remotely like "helow|world" or "helo|world", your audio input pipe is probably the culprit.

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

You shouldn't be "normalizing" at all. You should just divide by INT16_MAX with no fabs() to directly convert between the two integer ranges.

PCM signed int16 is a range from -32768 - 32767. PCM float is a signed floating point number from -1.0 to 1.0. The mapping between the two ranges is linear.

Also if you can ask Pulse for floating point samples that's even better, as you only need to multiply by INT16_MAX for FVAD and you can store the correct wav2letter format through the whole pipe without modifying it yourself.

@bobtwinkles

This comment has been minimized.

Copy link

commented Jul 16, 2019

Dividing by INT16_MAX and using the w2l-mac branch (with some minor tweaks to the CMakeLists.txt to get it building in my environment) seems to be working well -- the letter decoder is very accurate now. Trying to use the word decode crashes for some reason that I have yet to fully debug. I'll push my fixes shortly if other people want to try it out on Linux. Thanks for the help @lunixbochs =)

@lunixbochs

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

@cogmeta

This comment has been minimized.

Copy link

commented Jul 16, 2019

@bobtwinkles thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.