Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StarPos and EndPos for ngram log probability #19

Closed
GoogleCodeExporter opened this issue Jul 16, 2015 · 3 comments
Closed

StarPos and EndPos for ngram log probability #19

GoogleCodeExporter opened this issue Jul 16, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

Hi

I am a bit confused on how to find the log probabilities of ngrams. From 
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
                    final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
                    sentScore += score;
                }
The thing I am not getting is why is it starting from 1 and why i + 
lm_.getLmOrder() and why sent is only number of words in line + 2. 

Ideally I was expecting sent to be number of words line + 3. So if I have a 
sentence, Hello how are you, sent should be START START Hello how are you STOP. 
So the first trigram should be START START Hello. So if I wanted to find the 
log probability of the first trigram I would use startpos 0 and endpos 2. The 
last trigram will be "are you STOP" , startpos 4, and endpos 6.

Obviously I am making some assumptions here. I tried to dig the code to prove 
myself otherwise but unfortunately could not get much intelligence in this 
context.

I will be grateful for any help on this.

Regards
Deb

Original issue reported on code.google.com by db12...@my.bristol.ac.uk on 20 Mar 2014 at 9:03

@GoogleCodeExporter
Copy link
Author

This is be design. We actually generate the first word given START, so the 
first generation is a bigram. This is the way SRILM models things, and 
generally accepted, as far as I know. 

If the model were a 5-gram, then the first word would be generated as a bigram, 
the second as a trigram, and so on. 

Original comment by adpa...@gmail.com on 20 Mar 2014 at 6:30

  • Changed state: WontFix

@GoogleCodeExporter
Copy link
Author

Ok - thanks a lot for that info. Few questions on the design

1. Can you let me know why the loop starts from i=1, rather than i=0. 

2. Also why is it i + lm_.getLmOrder() ? Is the endPos non inclusive ? 
E.G if i is 1 and getLmOrder() is 3, startpos and endpos will be 1 and 4. If I 
am looking for tri-grams, I would have thought startpos and endpos should be 1 
and 3.

3. I am loading google books binary in my code and need only trigram log 
probabilities. Going by your explanation of using a bigram for the start of the 
sentence, should I pass 0,1 as the startpos and endpos of the first gram ? And 
then again 0,2 as the startpost and endpos of the next gram? and then follow it 
up by 1,3 ?

WOuld be grateful again for your help on this

Original comment by db12...@my.bristol.ac.uk on 21 Mar 2014 at 5:36

@GoogleCodeExporter
Copy link
Author

1. The sentence already has start markers, and we don't generate start.

2. endPos is non-inclusive, you're right.

3. I'm confused why you are writing your own code? AWhat exactly is the score 
you want to compute? Why can't you just use ComputeLogProbabilityOfTextStream?

Original comment by adpa...@gmail.com on 7 Sep 2014 at 7:05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant