You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see in the documentation that there is no argument for dictionary, we only have an argument for text corpus, which is used to create both dictionary and word LM:
Text (corpus): is given as a UTF8 encoded string. The operation creates its dictionary and (optionally) LM from it
So my lexicon is restricted with words from my text corpus. Am I getting it right? Or is there a way to pass both a dictionary and corpus in CTCWordBeamSearch?
The text was updated successfully, but these errors were encountered:
So my lexicon is restricted with words from my text corpus.
Yes, I believe that's right. (Unless you modify the code somehow).
Since the language model is built from the corpus of words, if a dictionary were included, the language model would not have information about any dictionary words not present in the corpus.
Of course there are standard methods to deal with this one could implement (i.e., smoothing, back-off, etc.) if it was essential; the code does seem to support smoothing for query ngrams not present in the training corpus, so perhaps it would not be too difficult to modify for your needs to add non-corpus words to the lexicon without using them in the bigram calculation.
But if you don't need a language model, you can just pass in a dictionary as a giant string of words using the "Words" option for the Scoring mode (lm_type).
Hi! Thanks for the repo!
I see in the documentation that there is no argument for dictionary, we only have an argument for text corpus, which is used to create both dictionary and word LM:
So my lexicon is restricted with words from my text corpus. Am I getting it right? Or is there a way to pass both a dictionary and corpus in CTCWordBeamSearch?
The text was updated successfully, but these errors were encountered: