Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get raw ngram count in addition to logProb #3

Closed
GoogleCodeExporter opened this issue Mar 22, 2015 · 3 comments
Closed

Get raw ngram count in addition to logProb #3

GoogleCodeExporter opened this issue Mar 22, 2015 · 3 comments

Comments

@GoogleCodeExporter
Copy link

A request for adding the feature to obtain also the raw count of an n-gram if 
Google n-gram data is used in the back-end.

Original issue reported on code.google.com by torsten....@gmail.com on 14 Jul 2011 at 7:14

@GoogleCodeExporter
Copy link
Author

Do you need this access to be fast? I have some functionality which you can 
access by doing:
 new NgramMapWrapper<W, LongRef>(lm.getNgramMap(), lm.getWordIndexer());

on a StupidBackoffLm. This gives a Map from List<W> to LongRefs. However, this 
interface is slow due to all the boxing/unboxing. 

Original comment by adpa...@gmail.com on 14 Jul 2011 at 5:39

@GoogleCodeExporter
Copy link
Author

Of course, fast is always better :)

However, it seems I have not fully understood the way the library works.
Two questions:
1) As the JavaDocs say that getLogProb() is slow, what is a fast way to get 
this information given a phrase?

2) How is this probability computed given the raw counts in the Google web1t 
corpus? It seems to me there should be an easy way to just invert the process.

thanks for your help,
Torsten

Original comment by torsten....@gmail.com on 15 Jul 2011 at 7:52

@GoogleCodeExporter
Copy link
Author

1) NgramLanguageModel.getLogProb(List<W>) is "slow" because it has to turn the 
List<W> into an int[] first. Note that it is not actually "slow", just slow 
relative to the efficient accessors in 
ArrayEncodedNgramLanguageModel.getLogProb(int[]) and 
ContextEncodedNgramLanguageModel.getLogProb. I have added additional comments 
that direct you towards those calls so others are not confused by this. 

2) The probability is computed using Stupid Backoff. I have added a call to 
StupidBackoffLm that grabs the count, and will be releasing a new version of 
the code with this fix shortly. 

Original comment by adpa...@gmail.com on 15 Jul 2011 at 6:19

  • Changed state: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant