You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I am a bit confused on how to find the log probabilities of ngrams. From
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
sentScore += score;
}
The thing I am not getting is why is it starting from 1 and why i +
lm_.getLmOrder() and why sent is only number of words in line + 2.
Ideally I was expecting sent to be number of words line + 3. So if I have a
sentence, Hello how are you, sent should be START START Hello how are you STOP.
So the first trigram should be START START Hello. So if I wanted to find the
log probability of the first trigram I would use startpos 0 and endpos 2. The
last trigram will be "are you STOP" , startpos 4, and endpos 6.
Obviously I am making some assumptions here. I tried to dig the code to prove
myself otherwise but unfortunately could not get much intelligence in this
context.
I will be grateful for any help on this.
Regards
Deb
Original issue reported on code.google.com by db12...@my.bristol.ac.uk on 20 Mar 2014 at 9:03
The text was updated successfully, but these errors were encountered:
This is be design. We actually generate the first word given START, so the
first generation is a bigram. This is the way SRILM models things, and
generally accepted, as far as I know.
If the model were a 5-gram, then the first word would be generated as a bigram,
the second as a trigram, and so on.
Original comment by adpa...@gmail.com on 20 Mar 2014 at 6:30
Ok - thanks a lot for that info. Few questions on the design
1. Can you let me know why the loop starts from i=1, rather than i=0.
2. Also why is it i + lm_.getLmOrder() ? Is the endPos non inclusive ?
E.G if i is 1 and getLmOrder() is 3, startpos and endpos will be 1 and 4. If I
am looking for tri-grams, I would have thought startpos and endpos should be 1
and 3.
3. I am loading google books binary in my code and need only trigram log
probabilities. Going by your explanation of using a bigram for the start of the
sentence, should I pass 0,1 as the startpos and endpos of the first gram ? And
then again 0,2 as the startpost and endpos of the next gram? and then follow it
up by 1,3 ?
WOuld be grateful again for your help on this
Original comment by db12...@my.bristol.ac.uk on 21 Mar 2014 at 5:36
1. The sentence already has start markers, and we don't generate start.
2. endPos is non-inclusive, you're right.
3. I'm confused why you are writing your own code? AWhat exactly is the score
you want to compute? Why can't you just use ComputeLogProbabilityOfTextStream?
Original comment by adpa...@gmail.com on 7 Sep 2014 at 7:05
Original issue reported on code.google.com by
db12...@my.bristol.ac.uk
on 20 Mar 2014 at 9:03The text was updated successfully, but these errors were encountered: