Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language model at word level #3

Closed
marcoleewow opened this issue Nov 16, 2017 · 4 comments
Closed

Language model at word level #3

marcoleewow opened this issue Nov 16, 2017 · 4 comments

Comments

@marcoleewow
Copy link

Hi, did you add word level language model for beam search?

Currently its easy to add character level bi-gram, but I find it much harder to add word level. I tried CTC token passing algorithm but its just way too slow comparing beam search.

@githubharald
Copy link
Owner

githubharald commented Nov 16, 2017

  1. just checking if the words exist would be the easiest way to go:
    A. you could check if the words in a beam exist in your dictionary. Each time a labelling gets extended by a whitespace in function calcExtPr, you could check if the last word exists, if yes, assign a probability of 1 and 0 otherwise.
    B. or you could build a dictionary of prefixes of the dictionary words (e.g. Hello -> H, He, Hel, ...), by using a prefix tree. Then you know which beams can be extended by which characters.

  2. using word-level bigram LM is not that easy. You can only score neighbouring words by a bigram after both words have been fully added to the beam. But you could give it a try. Score the two last words of a beam as soon it is possible. This would at least remove beams that represent nonsense from a LM point of view, even if this scoring happens a bit late. I think a clever combination of word-level LM and a prefix tree could give good results and would be fast (reduce number of beams).

@marcoleewow
Copy link
Author

I have done 1.A together with long words penalty, but there is no word bi-gram level prior knowledge to this method which means it is only an autocorrect.

Example: "milk the cous" are all words in the dictionary but it does not make sense, whereas the true label we want is "milk the cows".

For 2, I have tried giving bi-gram scores whenever I see a space label, but then it will push the beam out of beam width and what I get is a long single word a lot of time.

Currently I am reading on WFSTpdf and trying to implement a CTC decoder using WSFT so that I can include bi-gram word level, have you tried these methods?

@githubharald
Copy link
Owner

no, I haven't tried WFST yet.

@githubharald
Copy link
Owner

I've implemented an algorithm which uses beam search on word-level (dictionary, unigrams/bigrams) and which runs faster than token passing: https://github.com/githubharald/CTCWordBeamSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants