-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training text on a single line. #77
Comments
Is there a compelling reason why you need that feature? And would it be On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
|
It's so this LM can be used with On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:
|
Hm. Dan On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com
|
With my current experiments pocolm still seems to be worth it. Do you think Also there is the licensing point of view, I use srilm just for reference, On 13 October 2016 at 17:53, Daniel Povey notifications@github.com wrote:
|
What kind of perplexity improvements are you seeing versus SRILM, and in On Thu, Oct 13, 2016 at 1:17 PM, Rémi Francis notifications@github.com
|
btw until recently I didn't know this http://www.speech.sri.com/pipermail/srilm-user/2010q3/000928.html |
In our experiments we did not see that the --prune-history-lm was that On Thu, Oct 13, 2016 at 4:30 PM, vince62s notifications@github.com wrote:
|
I have trained a trigram on one training set with 1.5G words, and I prune it to about 1M ngrams. On the test sets I get: @vince62s I did some tests with that at some point, IRCC it didn't bring me much improvements with the level of pruning I used. |
Btw, when it doesn't print |
Remi, under what circumstances does it not print the And were those SRILM results with Good-Turing smoothing or Kneser-Ney? On Fri, Oct 14, 2016 at 6:03 AM, Rémi Francis notifications@github.com
|
It's when I have the whole training text on one line, and then prune the lm. |
Can you please see if that PR fixes the issue? It will only be necessary to re-run format_arpa_lm.py or whatever it's called, after compiling. |
I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning
<s>
doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.The text was updated successfully, but these errors were encountered: