Adding domain-specific words to the language model - how much do I need? #2061
Unanswered
etlweather
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I went through the documentation (https://stt.readthedocs.io/en/latest/LANGUAGE_MODEL.html) and I understand what I need to do to create a language model from scratch. I didn't try it yet but it seems relatively simple.
What I am uncertain about is how much data of my domain specific terms do I need. I would want to finetune the language model as my audio files have a lot more common words than domain specific - I would not want to try to create a model from scratch. But there are a number of domain specific terms I want to make the model recognize and transcribe properly.
I did similar in Vosk and it was very simple. I only had to add my domain-specific words to a separate text file and the build script combined them. In Vosk, my understanding, is that it's more like a dictionary of valid words, rather than a language model.
So it seems that for Coqui STT, I will need a decent amount of text so the model can understand my domain-specific words in sentence context. Is that right?
Also, it seems it will need sufficient content to "stand out" above the high amount of repetition of common words in the Libri dataset.
Finally, the command in the doc:
This
top_k
parameter makes me feel like it is likely going to ignore my words as they might not be in thetop_k
specified if I don't have enough content - it could be that I am simply misunderstanding this parameter.Thanks for any clarification!
Beta Was this translation helpful? Give feedback.
All reactions