Adding domain-specific words to the language model - how much do I need? #2061

etlweather · 2021-12-30T09:00:12Z

etlweather
Dec 30, 2021

I went through the documentation (https://stt.readthedocs.io/en/latest/LANGUAGE_MODEL.html) and I understand what I need to do to create a language model from scratch. I didn't try it yet but it seems relatively simple.

What I am uncertain about is how much data of my domain specific terms do I need. I would want to finetune the language model as my audio files have a lot more common words than domain specific - I would not want to try to create a model from scratch. But there are a number of domain specific terms I want to make the model recognize and transcribe properly.

I did similar in Vosk and it was very simple. I only had to add my domain-specific words to a separate text file and the build script combined them. In Vosk, my understanding, is that it's more like a dictionary of valid words, rather than a language model.

So it seems that for Coqui STT, I will need a decent amount of text so the model can understand my domain-specific words in sentence context. Is that right?

Also, it seems it will need sufficient content to "stand out" above the high amount of repetition of common words in the Libri dataset.

Finally, the command in the doc:

python generate_lm.py \
  --input_txt librispeech-lm-norm.txt.gz \
  --output_dir . \
  --top_k 500000 \
...

This top_k parameter makes me feel like it is likely going to ignore my words as they might not be in the top_k specified if I don't have enough content - it could be that I am simply misunderstanding this parameter.

Thanks for any clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding domain-specific words to the language model - how much do I need? #2061

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Adding domain-specific words to the language model - how much do I need? #2061

etlweather Dec 30, 2021

Replies: 0 comments

etlweather
Dec 30, 2021