Language model at word level #3

marcoleewow · 2017-11-16T09:15:35Z

Hi, did you add word level language model for beam search?

Currently its easy to add character level bi-gram, but I find it much harder to add word level. I tried CTC token passing algorithm but its just way too slow comparing beam search.

githubharald · 2017-11-16T11:45:17Z

just checking if the words exist would be the easiest way to go:
A. you could check if the words in a beam exist in your dictionary. Each time a labelling gets extended by a whitespace in function calcExtPr, you could check if the last word exists, if yes, assign a probability of 1 and 0 otherwise.
B. or you could build a dictionary of prefixes of the dictionary words (e.g. Hello -> H, He, Hel, ...), by using a prefix tree. Then you know which beams can be extended by which characters.
using word-level bigram LM is not that easy. You can only score neighbouring words by a bigram after both words have been fully added to the beam. But you could give it a try. Score the two last words of a beam as soon it is possible. This would at least remove beams that represent nonsense from a LM point of view, even if this scoring happens a bit late. I think a clever combination of word-level LM and a prefix tree could give good results and would be fast (reduce number of beams).

marcoleewow · 2017-11-17T07:31:25Z

I have done 1.A together with long words penalty, but there is no word bi-gram level prior knowledge to this method which means it is only an autocorrect.

Example: "milk the cous" are all words in the dictionary but it does not make sense, whereas the true label we want is "milk the cows".

For 2, I have tried giving bi-gram scores whenever I see a space label, but then it will push the beam out of beam width and what I get is a long single word a lot of time.

Currently I am reading on WFSTpdf and trying to implement a CTC decoder using WSFT so that I can include bi-gram word level, have you tried these methods?

githubharald · 2017-11-19T21:35:45Z

no, I haven't tried WFST yet.

githubharald · 2018-03-01T10:45:22Z

I've implemented an algorithm which uses beam search on word-level (dictionary, unigrams/bigrams) and which runs faster than token passing: https://github.com/githubharald/CTCWordBeamSearch

githubharald closed this as completed Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language model at word level #3

Language model at word level #3

marcoleewow commented Nov 16, 2017

githubharald commented Nov 16, 2017 •

edited

marcoleewow commented Nov 17, 2017

githubharald commented Nov 19, 2017

githubharald commented Mar 1, 2018

Language model at word level #3

Language model at word level #3

Comments

marcoleewow commented Nov 16, 2017

githubharald commented Nov 16, 2017 • edited

marcoleewow commented Nov 17, 2017

githubharald commented Nov 19, 2017

githubharald commented Mar 1, 2018

githubharald commented Nov 16, 2017 •

edited