Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary and text corpus #66

Closed
skalinin opened this issue Dec 9, 2022 · 2 comments
Closed

Dictionary and text corpus #66

skalinin opened this issue Dec 9, 2022 · 2 comments

Comments

@skalinin
Copy link

skalinin commented Dec 9, 2022

Hi! Thanks for the repo!

I see in the documentation that there is no argument for dictionary, we only have an argument for text corpus, which is used to create both dictionary and word LM:

Text (corpus): is given as a UTF8 encoded string. The operation creates its dictionary and (optionally) LM from it

So my lexicon is restricted with words from my text corpus. Am I getting it right? Or is there a way to pass both a dictionary and corpus in CTCWordBeamSearch?

@weinman
Copy link

weinman commented Dec 9, 2022

So my lexicon is restricted with words from my text corpus.

Yes, I believe that's right. (Unless you modify the code somehow).

Since the language model is built from the corpus of words, if a dictionary were included, the language model would not have information about any dictionary words not present in the corpus.

Of course there are standard methods to deal with this one could implement (i.e., smoothing, back-off, etc.) if it was essential; the code does seem to support smoothing for query ngrams not present in the training corpus, so perhaps it would not be too difficult to modify for your needs to add non-corpus words to the lexicon without using them in the bigram calculation.

But if you don't need a language model, you can just pass in a dictionary as a giant string of words using the "Words" option for the Scoring mode (lm_type).

@skalinin
Copy link
Author

skalinin commented Dec 9, 2022

Thanks for the detailed answer!

I'll use only a dictionary without a language model by now.

@skalinin skalinin closed this as completed Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants