StarPos and EndPos for ngram log probability #19

GoogleCodeExporter · 2015-07-16T16:15:54Z

Hi

I am a bit confused on how to find the log probabilities of ngrams. From 
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
                    final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
                    sentScore += score;
                }
The thing I am not getting is why is it starting from 1 and why i + 
lm_.getLmOrder() and why sent is only number of words in line + 2. 

Ideally I was expecting sent to be number of words line + 3. So if I have a 
sentence, Hello how are you, sent should be START START Hello how are you STOP. 
So the first trigram should be START START Hello. So if I wanted to find the 
log probability of the first trigram I would use startpos 0 and endpos 2. The 
last trigram will be "are you STOP" , startpos 4, and endpos 6.

Obviously I am making some assumptions here. I tried to dig the code to prove 
myself otherwise but unfortunately could not get much intelligence in this 
context.

I will be grateful for any help on this.

Regards
Deb

Original issue reported on code.google.com by db12...@my.bristol.ac.uk on 20 Mar 2014 at 9:03

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-07-16T16:15:54Z

This is be design. We actually generate the first word given START, so the 
first generation is a bigram. This is the way SRILM models things, and 
generally accepted, as far as I know. 

If the model were a 5-gram, then the first word would be generated as a bigram, 
the second as a trigram, and so on.

Original comment by adpa...@gmail.com on 20 Mar 2014 at 6:30

Changed state: WontFix

GoogleCodeExporter · 2015-07-16T16:15:54Z

Ok - thanks a lot for that info. Few questions on the design

1. Can you let me know why the loop starts from i=1, rather than i=0. 

2. Also why is it i + lm_.getLmOrder() ? Is the endPos non inclusive ? 
E.G if i is 1 and getLmOrder() is 3, startpos and endpos will be 1 and 4. If I 
am looking for tri-grams, I would have thought startpos and endpos should be 1 
and 3.

3. I am loading google books binary in my code and need only trigram log 
probabilities. Going by your explanation of using a bigram for the start of the 
sentence, should I pass 0,1 as the startpos and endpos of the first gram ? And 
then again 0,2 as the startpost and endpos of the next gram? and then follow it 
up by 1,3 ?

WOuld be grateful again for your help on this

Original comment by db12...@my.bristol.ac.uk on 21 Mar 2014 at 5:36

GoogleCodeExporter · 2015-07-16T16:15:54Z

1. The sentence already has start markers, and we don't generate start.

2. endPos is non-inclusive, you're right.

3. I'm confused why you are writing your own code? AWhat exactly is the score 
you want to compute? Why can't you just use ComputeLogProbabilityOfTextStream?

Original comment by adpa...@gmail.com on 7 Sep 2014 at 7:05

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Jul 16, 2015

GoogleCodeExporter closed this as completed Jul 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StarPos and EndPos for ngram log probability #19

StarPos and EndPos for ngram log probability #19

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015

StarPos and EndPos for ngram log probability #19

StarPos and EndPos for ngram log probability #19

Comments

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015

GoogleCodeExporter commented Jul 16, 2015