Add wav2letter engine support #245

lexxish · 2020-04-30T00:53:00Z

Wav2letter would be valuable because it is one of the lowest latency speech engines (significantly faster than even Kaldi).

See chart of latencies on page 2: https://arxiv.org/pdf/1812.07625&ved=2ahUKEwj8wpnF8o7pAhWEmHIEHagZBA8QFjANegQIAxAB&usg=AOvVaw31lEtWMzIMj6acvVN9UrSH

daanzu · 2020-04-30T03:33:16Z

Regarding latency, I don't think that is correct. Which chart are you referring to, @lexxish? My Kaldi model has a latency on the order of ~10ms for commands in my experience, and that would be hard to beat.

More engines are welcome, but I haven't been convinced that wav2letter is superior for our usage yet.

lexxish · 2020-04-30T04:03:25Z

Regarding latency, I don't think that is correct. Which chart are you referring to, @lexxish? My Kaldi model has a latency on the order of ~10ms for commands in my experience, and that would be hard to beat.

More engines are welcome, but I haven't been convinced that wav2letter is superior for our usage yet.

Sorry the chart is at the top right of page 3 not page 2.

daanzu · 2020-04-30T07:01:37Z

Sorry the chart is at the top right of page 3 not page 2.

Figure 3 is for time spent training, not for decoding (normal usage). (Also, frankly, I question the relevance of including Kaldi in that figure. Kaldi uses a completely different architecture and was likely trained on far fewer epochs, making the per-epoch comparison of questionable relevance, although the paper doesn't provide enough detail to say for sure.)

drmfinlay · 2020-04-30T12:37:13Z

David is more knowledgable on this subject than me. I have to agree that the Kaldi engine backend would be hard to beat at this point. I suppose it would be nice if training was easier.

I would welcome pull requests for wav2letter support. I am not however willing to develop or maintain such an engine implementation myself; maintaining Dragonfly is difficult enough with all the components it currently has, let alone another. Someone else would need to come on board for improvements, bug fixes, etc.

lexxish · 2020-04-30T12:45:24Z

Sorry the chart is at the top right of page 3 not page 2.

Figure 3 is for time spent training, not for decoding (normal usage). (Also, frankly, I question the relevance of including Kaldi in that figure. Kaldi uses a completely different architecture and was likely trained on far fewer epochs, making the per-epoch comparison of questionable relevance, although the paper doesn't provide enough detail to say for sure.)

It looks like table 2 on page 4 might be decoding time although for some reason they didn't include Kaldi. After reading more about it perhaps you are right Kaldi may be as good or in some ways better depending on how it's trained/configured.

Is it possible to configure Kaldi to reduce latency? Looking at table 2 is it saying they tested wav2letter++ with 140ms and 10ms latency? I'm sure I get more than 10ms latency right now with Kaldi so wondering if it can be improved at a minor accuracy cost similar to how they show happening for wav2letter++.

Also, regarding training, should individuals be training Kaldi to their specific voice to reach optimal WERs? How much of a difference would you expect that to make for most people?

daanzu · 2020-05-02T03:15:37Z

@lexxish

Table 2: They are a bit vague about how they set up the decoding test, but I think it is probably also quite different from our usage, which might also be why they didn't include Kaldi in the comparison. They are probably batching the utterances, whereas we only have one voice at a time (though I would like to have a multi threaded voice). Also, they are running on massive servers with beastly GPUs, which I think are being used for the decoding as well.

Kaldi latency: There are two components to the latency: the time for the VAD (voice activity detector) to determine you've actually stopped speaking, and then the processing time for Kaldi to finalize the decoding. The former is easy to adjust (it is an dragonfly engine parameter), but kind of impossible to remove, since lowering it makes it more likely to falsely end your utterance prematurely. The 10ms I referred to is the latter.

The factors affecting the latter:

Acoustic model size: determined at training time.
Language model size: determined at model construction time; mostly relevant only for dictation, as command grammars are simple by comparison.
Decoding parameters: this came up in the Kaldi gitter recently:

I haven't really experimented too much with the decoding parameters (like beam size and others), and they are not currently exposed (although I probably should at some point), but they are in the python code so it would not be hard for you to modify them and try out different values: just edit the file in your python library directory. try the first four numbers on line https://github.com/daanzu/kaldi-active-grammar/blob/master/kaldi_active_grammar/wrapper.py#L396
please let me know how it goes if you experiment with this

Training: see daanzu/kaldi-active-grammar#13.

lunixbochs · 2021-08-12T18:47:09Z

I recommend closing this as per #326

drmfinlay · 2021-08-13T06:04:04Z

Thanks.

drmfinlay added the Feature Request Feature requests not plan to be implemented by OP label Apr 30, 2020

drmfinlay closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wav2letter engine support #245

Add wav2letter engine support #245

lexxish commented Apr 30, 2020

daanzu commented Apr 30, 2020

lexxish commented Apr 30, 2020

daanzu commented Apr 30, 2020

drmfinlay commented Apr 30, 2020

lexxish commented Apr 30, 2020 •

edited

Loading

daanzu commented May 2, 2020

lunixbochs commented Aug 12, 2021

drmfinlay commented Aug 13, 2021

Add wav2letter engine support #245

Add wav2letter engine support #245

Comments

lexxish commented Apr 30, 2020

daanzu commented Apr 30, 2020

lexxish commented Apr 30, 2020

daanzu commented Apr 30, 2020

drmfinlay commented Apr 30, 2020

lexxish commented Apr 30, 2020 • edited Loading

daanzu commented May 2, 2020

lunixbochs commented Aug 12, 2021

drmfinlay commented Aug 13, 2021

lexxish commented Apr 30, 2020 •

edited

Loading