Dictionary and text corpus #66

skalinin · 2022-12-09T13:50:06Z

Hi! Thanks for the repo!

I see in the documentation that there is no argument for dictionary, we only have an argument for text corpus, which is used to create both dictionary and word LM:

Text (corpus): is given as a UTF8 encoded string. The operation creates its dictionary and (optionally) LM from it

So my lexicon is restricted with words from my text corpus. Am I getting it right? Or is there a way to pass both a dictionary and corpus in CTCWordBeamSearch?

weinman · 2022-12-09T14:40:11Z

So my lexicon is restricted with words from my text corpus.

Yes, I believe that's right. (Unless you modify the code somehow).

Since the language model is built from the corpus of words, if a dictionary were included, the language model would not have information about any dictionary words not present in the corpus.

Of course there are standard methods to deal with this one could implement (i.e., smoothing, back-off, etc.) if it was essential; the code does seem to support smoothing for query ngrams not present in the training corpus, so perhaps it would not be too difficult to modify for your needs to add non-corpus words to the lexicon without using them in the bigram calculation.

But if you don't need a language model, you can just pass in a dictionary as a giant string of words using the "Words" option for the Scoring mode (lm_type).

skalinin · 2022-12-09T14:58:47Z

Thanks for the detailed answer!

I'll use only a dictionary without a language model by now.

skalinin closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary and text corpus #66

Dictionary and text corpus #66

skalinin commented Dec 9, 2022 •

edited

Loading

weinman commented Dec 9, 2022

skalinin commented Dec 9, 2022

Dictionary and text corpus #66

Dictionary and text corpus #66

Comments

skalinin commented Dec 9, 2022 • edited Loading

weinman commented Dec 9, 2022

skalinin commented Dec 9, 2022

skalinin commented Dec 9, 2022 •

edited

Loading