Training text on a single line. #77

francisr · 2016-10-13T13:24:13Z

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning <s> doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

The text was updated successfully, but these errors were encountered:

danpovey · 2016-10-13T16:34:36Z

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm (though I
can't split the counts then), however after pruning doesn't appear in
the arpa LM ( is still there though), which is annoying in some
applications.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV
.

francisr · 2016-10-13T16:45:03Z

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm (though I
can't split the counts then), however after pruning doesn't appear in
the arpa LM ( is still there though), which is annoying in some
applications.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB-8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV
.

danpovey · 2016-10-13T16:53:14Z

Hm.
My feeling is that if it's part of a pipeline involving other SRILM tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com
wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't appear
in
the arpa LM ( is still there though), which is annoying in some
applications.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH-l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV
.

francisr · 2016-10-13T17:17:57Z

With my current experiments pocolm still seems to be worth it. Do you think
that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference,
I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com wrote:

Hm.
My feeling is that if it's part of a pipeline involving other SRILM tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com
wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis <
notifications@github.com>
wrote:

I've tried putting my training text on a single line to simulate
SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't
appear
in
the arpa LM ( is still there though), which is annoying in some
applications.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH-
l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV>

.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB-8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV
.

danpovey · 2016-10-13T17:19:17Z

What kind of perplexity improvements are you seeing versus SRILM, and in
what kind of scenario (e.g. how many training sets; how much data; what
level of pruning?)

On Thu, Oct 13, 2016 at 1:17 PM, Rémi Francis notifications@github.com
wrote:

With my current experiments pocolm still seems to be worth it. Do you think
that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference,
I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com
wrote:

Hm.
My feeling is that if it's part of a pipeline involving other SRILM
tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a
super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as
good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair
bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis <notifications@github.com

wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html
to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it
be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis <
notifications@github.com>
wrote:

I've tried putting my training text on a single line to simulate
SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't
appear
in
the arpa LM ( is still there though), which is annoying in some
applications.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#77 (comment)
,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu-VBSIzNRlJWoIMH-
l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV>

.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV>
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu0lVeu1Zh8bmZEr3NhMEFm88KeILks5qzmfGgaJpZM4KV4dV
.

vince62s · 2016-10-13T20:30:50Z

btw until recently I didn't know this http://www.speech.sri.com/pipermail/srilm-user/2010q3/000928.html
but it works fine.
eg. Poco and srilm are in line with the scenario out of domain = cantab text, in-domain = ted corpus.

danpovey · 2016-10-13T20:34:28Z

In our experiments we did not see that the --prune-history-lm was that
helpful, we found it was best to just use Good-Turing LMs throughout. But
it could be we did something wrong.
Dan

On Thu, Oct 13, 2016 at 4:30 PM, vince62s notifications@github.com wrote:

btw until recently I didn't know this http://www.speech.sri.com/
pipermail/srilm-user/2010q3/000928.html
but it works fine.
eg. Poco and srilm are in line with the scenario out of domain = cantab
text, in-domain = ted corpus.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVuyqB1V0hnsoUhSFxCqyGQ_K9FrCPks5qzpT7gaJpZM4KV4dV
.

francisr · 2016-10-14T10:00:52Z

I have trained a trigram on one training set with 1.5G words, and I prune it to about 1M ngrams. On the test sets I get:
Pocolm gets 153 ppl with 1 310 647 ngrams.
Pocolm gets 159 ppl with 1 034 962 ngrams
Srilm gets 160 ppl with 1 263 941 ngrams.
I haven't tried yet to use multiple train sets.

@vince62s I did some tests with that at some point, IRCC it didn't bring me much improvements with the level of pruning I used.

francisr · 2016-10-14T10:03:19Z

Btw, when it doesn't print <s> in the arpa file, it still prints the number ngram 1= as if it was there.

danpovey · 2016-10-14T17:12:50Z

Remi, under what circumstances does it not print the <s> in the unigram
section of the arpa file?

And were those SRILM results with Good-Turing smoothing or Kneser-Ney?

On Fri, Oct 14, 2016 at 6:03 AM, Rémi Francis notifications@github.com
wrote:

Btw, when it doesn't print in the arpa file, it still prints the
number ngram 1= as if it was there.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVuw-Rb7xULYdFy3ekIYlWTBjAGrcqks5qz1NngaJpZM4KV4dV
.

francisr · 2016-10-14T22:30:22Z

It's when I have the whole training text on one line, and then prune the lm.
The srilm results are with Good-Turing.

danpovey · 2016-10-14T22:47:38Z

Can you please see if that PR fixes the issue? It will only be necessary to re-run format_arpa_lm.py or whatever it's called, after compiling.

danpovey mentioned this issue Oct 14, 2016

Fix bug where if BOS state is pruned away, BOS is not printed in ARPA… #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training text on a single line. #77

Training text on a single line. #77

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

vince62s commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 14, 2016

francisr commented Oct 14, 2016

danpovey commented Oct 14, 2016

francisr commented Oct 14, 2016

danpovey commented Oct 14, 2016

Training text on a single line. #77

Training text on a single line. #77

Comments

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 13, 2016

danpovey commented Oct 13, 2016

vince62s commented Oct 13, 2016

danpovey commented Oct 13, 2016

francisr commented Oct 14, 2016

francisr commented Oct 14, 2016

danpovey commented Oct 14, 2016

francisr commented Oct 14, 2016

danpovey commented Oct 14, 2016