Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training text on a single line. #77

Open
francisr opened this issue Oct 13, 2016 · 12 comments
Open

Training text on a single line. #77

francisr opened this issue Oct 13, 2016 · 12 comments

Comments

@francisr
Copy link
Contributor

I've tried putting my training text on a single line to simulate SRI's continuous-ngram-count, and it worked fine to create an ARPA lm (though I can't split the counts then), however after pruning <s> doesn't appear in the arpa LM ( is still there though), which is annoying in some applications.

@danpovey
Copy link
Owner

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm (though I
can't split the counts then), however after pruning doesn't appear in
the arpa LM ( is still there though), which is annoying in some
applications.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV
.

@francisr
Copy link
Contributor Author

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm (though I
can't split the counts then), however after pruning doesn't appear in
the arpa LM ( is still there though), which is annoying in some
applications.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB-8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV
.

@danpovey
Copy link
Owner

Hm.
My feeling is that if it's part of a pipeline involving other SRILM tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com
wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis notifications@github.com
wrote:

I've tried putting my training text on a single line to simulate SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't appear
in
the arpa LM ( is still there though), which is annoying in some
applications.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH-l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV
.

@francisr
Copy link
Contributor Author

With my current experiments pocolm still seems to be worth it. Do you think
that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference,
I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com wrote:

Hm.
My feeling is that if it's part of a pipeline involving other SRILM tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis notifications@github.com
wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis <
notifications@github.com>
wrote:

I've tried putting my training text on a single line to simulate
SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't
appear
in
the arpa LM ( is still there though), which is annoying in some
applications.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-VBSIzNRlJWoIMH-
l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV>

.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB-8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV
.

@danpovey
Copy link
Owner

What kind of perplexity improvements are you seeing versus SRILM, and in
what kind of scenario (e.g. how many training sets; how much data; what
level of pruning?)

On Thu, Oct 13, 2016 at 1:17 PM, Rémi Francis notifications@github.com
wrote:

With my current experiments pocolm still seems to be worth it. Do you think
that the efficiency can depend on the size of the training set?

Also there is the licensing point of view, I use srilm just for reference,
I have another set of tools to do the same things.

On 13 October 2016 at 17:53, Daniel Povey notifications@github.com
wrote:

Hm.
My feeling is that if it's part of a pipeline involving other SRILM
tools,
it might be better to stay within the SRILM universe.
Our more recent experiments have actually failed to show a
super-compelling
improvement of pocolm versus SRILM.
The place there was originally a compelling improvement was in
highly-pruned models, but it turns out that if you use Good-Turing
estimation in SRILM, then the highly-pruned SRILM models are almost as
good
as the pocolm ones. Now, Good-Turing doesn't work as well with un-pruned
or lightly-pruned models, but in that case you can use SRILM. There is a
region in the middle [moderately pruned models] where pocolm is a fair
bit
better than SRILM's Kneser-Ney or Good-Turing, but that may not be enough
to justify the added hassle.

Dan

On Thu, Oct 13, 2016 at 12:45 PM, Rémi Francis <notifications@github.com

wrote:

It's so this LM can be used with
http://www.speech.sri.com/projects/srilm/manpages/hidden-ngram.1.html
to
add end of sentences.

On 13 October 2016 at 17:34, Daniel Povey notifications@github.com
wrote:

Is there a compelling reason why you need that feature? And would it
be
feasible to merge into medium-length lines, like 100 words?

On Thu, Oct 13, 2016 at 9:24 AM, Rémi Francis <
notifications@github.com>
wrote:

I've tried putting my training text on a single line to simulate
SRI's
continuous-ngram-count, and it worked fine to create an ARPA lm
(though I
can't split the counts then), however after pruning doesn't
appear
in
the arpa LM ( is still there though), which is annoying in some
applications.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#77, or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu99dBkm9pKlYT-IFr_9P38iveQKDks5qzjD9gaJpZM4KV4dV>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#77 (comment)
,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZJiMA3zArHIUCgNzM7Wk35BBG5stks5qzl2cgaJpZM4KV4dV>
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-
auth/ADJVu-VBSIzNRlJWoIMH-
l2iLgUR0YY_ks5qzmAQgaJpZM4KV4dV>

.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#77 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB-
8ZGOEESKaVbTaHwfhoCABCdYSVOltks5qzmH6gaJpZM4KV4dV>
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVu0lVeu1Zh8bmZEr3NhMEFm88KeILks5qzmfGgaJpZM4KV4dV
.

@vince62s
Copy link
Contributor

btw until recently I didn't know this http://www.speech.sri.com/pipermail/srilm-user/2010q3/000928.html
but it works fine.
eg. Poco and srilm are in line with the scenario out of domain = cantab text, in-domain = ted corpus.

@danpovey
Copy link
Owner

In our experiments we did not see that the --prune-history-lm was that
helpful, we found it was best to just use Good-Turing LMs throughout. But
it could be we did something wrong.
Dan

On Thu, Oct 13, 2016 at 4:30 PM, vince62s notifications@github.com wrote:

btw until recently I didn't know this http://www.speech.sri.com/
pipermail/srilm-user/2010q3/000928.html
but it works fine.
eg. Poco and srilm are in line with the scenario out of domain = cantab
text, in-domain = ted corpus.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVuyqB1V0hnsoUhSFxCqyGQ_K9FrCPks5qzpT7gaJpZM4KV4dV
.

@francisr
Copy link
Contributor Author

I have trained a trigram on one training set with 1.5G words, and I prune it to about 1M ngrams. On the test sets I get:
Pocolm gets 153 ppl with 1 310 647 ngrams.
Pocolm gets 159 ppl with 1 034 962 ngrams
Srilm gets 160 ppl with 1 263 941 ngrams.
I haven't tried yet to use multiple train sets.

@vince62s I did some tests with that at some point, IRCC it didn't bring me much improvements with the level of pruning I used.

@francisr
Copy link
Contributor Author

Btw, when it doesn't print <s> in the arpa file, it still prints the number ngram 1= as if it was there.

@danpovey
Copy link
Owner

Remi, under what circumstances does it not print the <s> in the unigram
section of the arpa file?

And were those SRILM results with Good-Turing smoothing or Kneser-Ney?

On Fri, Oct 14, 2016 at 6:03 AM, Rémi Francis notifications@github.com
wrote:

Btw, when it doesn't print in the arpa file, it still prints the
number ngram 1= as if it was there.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#77 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADJVuw-Rb7xULYdFy3ekIYlWTBjAGrcqks5qz1NngaJpZM4KV4dV
.

@francisr
Copy link
Contributor Author

It's when I have the whole training text on one line, and then prune the lm.
The srilm results are with Good-Turing.

@danpovey
Copy link
Owner

Can you please see if that PR fixes the issue? It will only be necessary to re-run format_arpa_lm.py or whatever it's called, after compiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants