Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wav2letter engine support #245

Closed
lexxish opened this issue Apr 30, 2020 · 8 comments
Closed

Add wav2letter engine support #245

lexxish opened this issue Apr 30, 2020 · 8 comments
Labels
Feature Request Feature requests not plan to be implemented by OP

Comments

@lexxish
Copy link
Contributor

lexxish commented Apr 30, 2020

Wav2letter would be valuable because it is one of the lowest latency speech engines (significantly faster than even Kaldi).

See chart of latencies on page 2: https://arxiv.org/pdf/1812.07625&ved=2ahUKEwj8wpnF8o7pAhWEmHIEHagZBA8QFjANegQIAxAB&usg=AOvVaw31lEtWMzIMj6acvVN9UrSH

@daanzu
Copy link
Collaborator

daanzu commented Apr 30, 2020

Regarding latency, I don't think that is correct. Which chart are you referring to, @lexxish? My Kaldi model has a latency on the order of ~10ms for commands in my experience, and that would be hard to beat.

More engines are welcome, but I haven't been convinced that wav2letter is superior for our usage yet.

@lexxish
Copy link
Contributor Author

lexxish commented Apr 30, 2020

Regarding latency, I don't think that is correct. Which chart are you referring to, @lexxish? My Kaldi model has a latency on the order of ~10ms for commands in my experience, and that would be hard to beat.

More engines are welcome, but I haven't been convinced that wav2letter is superior for our usage yet.

Sorry the chart is at the top right of page 3 not page 2.

@daanzu
Copy link
Collaborator

daanzu commented Apr 30, 2020

Sorry the chart is at the top right of page 3 not page 2.

Figure 3 is for time spent training, not for decoding (normal usage). (Also, frankly, I question the relevance of including Kaldi in that figure. Kaldi uses a completely different architecture and was likely trained on far fewer epochs, making the per-epoch comparison of questionable relevance, although the paper doesn't provide enough detail to say for sure.)

@drmfinlay
Copy link
Member

David is more knowledgable on this subject than me. I have to agree that the Kaldi engine backend would be hard to beat at this point. I suppose it would be nice if training was easier.

I would welcome pull requests for wav2letter support. I am not however willing to develop or maintain such an engine implementation myself; maintaining Dragonfly is difficult enough with all the components it currently has, let alone another. Someone else would need to come on board for improvements, bug fixes, etc.

@drmfinlay drmfinlay added the Feature Request Feature requests not plan to be implemented by OP label Apr 30, 2020
@lexxish
Copy link
Contributor Author

lexxish commented Apr 30, 2020

Sorry the chart is at the top right of page 3 not page 2.

Figure 3 is for time spent training, not for decoding (normal usage). (Also, frankly, I question the relevance of including Kaldi in that figure. Kaldi uses a completely different architecture and was likely trained on far fewer epochs, making the per-epoch comparison of questionable relevance, although the paper doesn't provide enough detail to say for sure.)

It looks like table 2 on page 4 might be decoding time although for some reason they didn't include Kaldi. After reading more about it perhaps you are right Kaldi may be as good or in some ways better depending on how it's trained/configured.

Is it possible to configure Kaldi to reduce latency? Looking at table 2 is it saying they tested wav2letter++ with 140ms and 10ms latency? I'm sure I get more than 10ms latency right now with Kaldi so wondering if it can be improved at a minor accuracy cost similar to how they show happening for wav2letter++.

Also, regarding training, should individuals be training Kaldi to their specific voice to reach optimal WERs? How much of a difference would you expect that to make for most people?

@daanzu
Copy link
Collaborator

daanzu commented May 2, 2020

@lexxish

Table 2: They are a bit vague about how they set up the decoding test, but I think it is probably also quite different from our usage, which might also be why they didn't include Kaldi in the comparison. They are probably batching the utterances, whereas we only have one voice at a time (though I would like to have a multi threaded voice). Also, they are running on massive servers with beastly GPUs, which I think are being used for the decoding as well.

Kaldi latency: There are two components to the latency: the time for the VAD (voice activity detector) to determine you've actually stopped speaking, and then the processing time for Kaldi to finalize the decoding. The former is easy to adjust (it is an dragonfly engine parameter), but kind of impossible to remove, since lowering it makes it more likely to falsely end your utterance prematurely. The 10ms I referred to is the latter.

The factors affecting the latter:

  • Acoustic model size: determined at training time.
  • Language model size: determined at model construction time; mostly relevant only for dictation, as command grammars are simple by comparison.
  • Decoding parameters: this came up in the Kaldi gitter recently:

I haven't really experimented too much with the decoding parameters (like beam size and others), and they are not currently exposed (although I probably should at some point), but they are in the python code so it would not be hard for you to modify them and try out different values: just edit the file in your python library directory. try the first four numbers on line https://github.com/daanzu/kaldi-active-grammar/blob/master/kaldi_active_grammar/wrapper.py#L396
please let me know how it goes if you experiment with this

Training: see daanzu/kaldi-active-grammar#13.

@lunixbochs
Copy link

I recommend closing this as per #326

@drmfinlay
Copy link
Member

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Feature requests not plan to be implemented by OP
Projects
None yet
Development

No branches or pull requests

4 participants