-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wav2letter engine support #245
Comments
Regarding latency, I don't think that is correct. Which chart are you referring to, @lexxish? My Kaldi model has a latency on the order of ~10ms for commands in my experience, and that would be hard to beat. More engines are welcome, but I haven't been convinced that wav2letter is superior for our usage yet. |
Sorry the chart is at the top right of page 3 not page 2. |
Figure 3 is for time spent training, not for decoding (normal usage). (Also, frankly, I question the relevance of including Kaldi in that figure. Kaldi uses a completely different architecture and was likely trained on far fewer epochs, making the per-epoch comparison of questionable relevance, although the paper doesn't provide enough detail to say for sure.) |
David is more knowledgable on this subject than me. I have to agree that the Kaldi engine backend would be hard to beat at this point. I suppose it would be nice if training was easier. I would welcome pull requests for wav2letter support. I am not however willing to develop or maintain such an engine implementation myself; maintaining Dragonfly is difficult enough with all the components it currently has, let alone another. Someone else would need to come on board for improvements, bug fixes, etc. |
It looks like table 2 on page 4 might be decoding time although for some reason they didn't include Kaldi. After reading more about it perhaps you are right Kaldi may be as good or in some ways better depending on how it's trained/configured. Is it possible to configure Kaldi to reduce latency? Looking at table 2 is it saying they tested wav2letter++ with 140ms and 10ms latency? I'm sure I get more than 10ms latency right now with Kaldi so wondering if it can be improved at a minor accuracy cost similar to how they show happening for wav2letter++. Also, regarding training, should individuals be training Kaldi to their specific voice to reach optimal WERs? How much of a difference would you expect that to make for most people? |
Table 2: They are a bit vague about how they set up the decoding test, but I think it is probably also quite different from our usage, which might also be why they didn't include Kaldi in the comparison. They are probably batching the utterances, whereas we only have one voice at a time (though I would like to have a multi threaded voice). Also, they are running on massive servers with beastly GPUs, which I think are being used for the decoding as well. Kaldi latency: There are two components to the latency: the time for the VAD (voice activity detector) to determine you've actually stopped speaking, and then the processing time for Kaldi to finalize the decoding. The former is easy to adjust (it is an dragonfly engine parameter), but kind of impossible to remove, since lowering it makes it more likely to falsely end your utterance prematurely. The 10ms I referred to is the latter. The factors affecting the latter:
Training: see daanzu/kaldi-active-grammar#13. |
I recommend closing this as per #326 |
Thanks. |
Wav2letter would be valuable because it is one of the lowest latency speech engines (significantly faster than even Kaldi).
See chart of latencies on page 2: https://arxiv.org/pdf/1812.07625&ved=2ahUKEwj8wpnF8o7pAhWEmHIEHagZBA8QFjANegQIAxAB&usg=AOvVaw31lEtWMzIMj6acvVN9UrSH
The text was updated successfully, but these errors were encountered: