Speed up phrase suggestion scoring #5396

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
4 participants
@nik9000
Contributor

nik9000 commented Mar 12, 2014

Two changes:

  1. In the StupidBackoffScorer only look for the trigram if there is a bigram.
  2. Cache the frequencies in WordScorer rather so we don't look them up
    again and again and again.

This provides a pretty substantial speedup when there are many candidates.

Closes #5395

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 12, 2014

Contributor

I'd really appreciate someone having a look at this. I think it could save me some time. I'll run it on a more real dataset in the morning and have more to report.

I'd also like to be able to use the candidate frequency rather than look up the term frequency again given that it is already there. I think that would work in most cases but it doesn't seem right when you use one of the transformers. I don't have a good grasp of those at the moment so I'm going to leave it as is.

Contributor

nik9000 commented Mar 12, 2014

I'd really appreciate someone having a look at this. I think it could save me some time. I'll run it on a more real dataset in the morning and have more to report.

I'd also like to be able to use the candidate frequency rather than look up the term frequency again given that it is already there. I think that would work in most cases but it doesn't seem right when you use one of the transformers. I don't have a good grasp of those at the moment so I'm going to leave it as is.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 12, 2014

Contributor

@s1monw, is the phrase suggester still your baby?

Contributor

nik9000 commented Mar 12, 2014

@s1monw, is the phrase suggester still your baby?

@s1monw

View changes

src/main/java/org/elasticsearch/search/suggest/phrase/CandidateScorer.java
@@ -92,6 +92,7 @@ public void findCandidates(CandidateSet[] candidates, Candidate[] path, int ord,
private void updateTop(CandidateSet[] candidates, Candidate[] path, PriorityQueue<Correction> corrections, double cutoffScore, double score)
throws IOException {
score = Math.exp(score);
+ // This assertion totally throws off performance numbers.... It makes caching look better then it is. It is still good though.

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

run your benchmarks not with -ea ;)

@s1monw

s1monw Mar 12, 2014

Contributor

run your benchmarks not with -ea ;)

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

I was more trying to make a unit test that made the problem I see in production more obvious. In production I see 300ms per phrase suggester and I'm hoping to cut that by an order of magnitude. Here I had to make it bad enough to go from 5-10 seconds to 300ms so the test can actually catch stuff.... And this assertion caught me off guard there.

@nik9000

nik9000 Mar 12, 2014

Contributor

I was more trying to make a unit test that made the problem I see in production more obvious. In production I see 300ms per phrase suggester and I'm hoping to cut that by an order of magnitude. Here I had to make it bad enough to go from 5-10 seconds to 300ms so the test can actually catch stuff.... And this assertion caught me off guard there.

@@ -54,18 +55,18 @@ protected double scoreBigram(Candidate word, Candidate w_1) throws IOException {
@Override
protected double scoreTrigram(Candidate w, Candidate w_1, Candidate w_2) throws IOException {
+ // First see if there are bigrams. If there aren't then skip looking up the trigram. This saves lookups

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

+1 for this optimization!

@s1monw

s1monw Mar 12, 2014

Contributor

+1 for this optimization!

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

Seemed like a simple win.

@nik9000

nik9000 Mar 12, 2014

Contributor

Seemed like a simple win.

@s1monw

View changes

src/test/java/org/elasticsearch/test/ElasticsearchIntegrationTest.java
+ * Index the provided builders in random order, chunked up to size. Use this if you are indexing many documents and don't want
+ * to accidentally index them all at once.
+ */
+ public void indexRandomChunked(boolean forceRefresh, int size, List<IndexRequestBuilder> builders) throws InterruptedException, ExecutionException {

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

I don't get what this one does - the indexRandom method doesn't index everything at once unless you do bulk which is randomly done. We also do sometimes async and sometimes sync there too

@s1monw

s1monw Mar 12, 2014

Contributor

I don't get what this one does - the indexRandom method doesn't index everything at once unless you do bulk which is randomly done. We also do sometimes async and sometimes sync there too

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

I remember a few days ago I was trying to indexRandom a couple hundred records and it would blow up at me. I'll see if I can reproduce it or just scrap this.

@nik9000

nik9000 Mar 12, 2014

Contributor

I remember a few days ago I was trying to indexRandom a couple hundred records and it would blow up at me. I'll see if I can reproduce it or just scrap this.

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

If I try using indexRandom instead of chunking the input I end up with a bunch of:


@nik9000

nik9000 Mar 12, 2014

Contributor

If I try using indexRandom instead of chunking the input I end up with a bunch of:


This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

ah yeah that is likely a problem but we should fix indexRandom here rahter than having a custom method

@s1monw

s1monw Mar 12, 2014

Contributor

ah yeah that is likely a problem but we should fix indexRandom here rahter than having a custom method

@s1monw

View changes

src/main/java/org/elasticsearch/search/suggest/phrase/WordScorer.java
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
import java.io.IOException;
//TODO public for tests
-public abstract class WordScorer {

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

ok so I think the idea is great but lemme explain why I think we shouldn't do this the way we do it here....
I think this is too specialized and we should simply have a CachingTermEnumWrapper that we can use elsewhere and test randomly as well. I think this wrapper should just use a simple HashMap like new HashMap<BytesRef, TermState>() for now to simplify things. It should be single threaded and have a very simple usage like:

public final class CachingTermsEnum extends FilterTermsEnum {
 public boolean seekExact(BytesRef text) throws IOException {
   TermState state = cache.get(text)
   if (state == null) {
     if (delegate.seekExact(text)) {
        state = delegate.termState();
     } else {
       state = NOT_FOUND_STATE;
     }
     cache.put(BytesRef.deepCopyOf(term), state);
   } else if (state != NOT_FOUND_STATE) {
       return delegate.seekExact(text, state); // this should be fast but we can also cache the state
   }
   return false;
}
}

makes sense? that way we can just wrap the existing one and be least intrusive here?

@s1monw

s1monw Mar 12, 2014

Contributor

ok so I think the idea is great but lemme explain why I think we shouldn't do this the way we do it here....
I think this is too specialized and we should simply have a CachingTermEnumWrapper that we can use elsewhere and test randomly as well. I think this wrapper should just use a simple HashMap like new HashMap<BytesRef, TermState>() for now to simplify things. It should be single threaded and have a very simple usage like:

public final class CachingTermsEnum extends FilterTermsEnum {
 public boolean seekExact(BytesRef text) throws IOException {
   TermState state = cache.get(text)
   if (state == null) {
     if (delegate.seekExact(text)) {
        state = delegate.termState();
     } else {
       state = NOT_FOUND_STATE;
     }
     cache.put(BytesRef.deepCopyOf(term), state);
   } else if (state != NOT_FOUND_STATE) {
       return delegate.seekExact(text, state); // this should be fast but we can also cache the state
   }
   return false;
}
}

makes sense? that way we can just wrap the existing one and be least intrusive here?

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

I'll give it a shot.

@nik9000

nik9000 Mar 12, 2014

Contributor

I'll give it a shot.

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

It doesn't look like seeking with the state there is any faster then without it. If I use a TermEnum that just caches the frequency then that caching will be less intrusive, I think, but still pretty quick.

@nik9000

nik9000 Mar 12, 2014

Contributor

It doesn't look like seeking with the state there is any faster then without it. If I use a TermEnum that just caches the frequency then that caching will be less intrusive, I think, but still pretty quick.

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

well I think we can make the actual seek lazy then and cache the freq and ttf?

@s1monw

s1monw Mar 12, 2014

Contributor

well I think we can make the actual seek lazy then and cache the freq and ttf?

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

See if the new version I sent makes sense. It need documentation and its own test, but I kindof do that. I cache the freq you ask for and only that freq so if you ask again for the other one you get the first one. Lame, but it was pretty simple....

@nik9000

nik9000 Mar 12, 2014

Contributor

See if the new version I sent makes sense. It need documentation and its own test, but I kindof do that. I cache the freq you ask for and only that freq so if you ask again for the other one you get the first one. Lame, but it was pretty simple....

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 12, 2014

Contributor

@nik9000 that's a great patch - lets get this in soon it can speed things u dramatically I guess - I left some comments impl wise

Contributor

s1monw commented Mar 12, 2014

@nik9000 that's a great patch - lets get this in soon it can speed things u dramatically I guess - I left some comments impl wise

@s1monw s1monw self-assigned this Mar 12, 2014

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 12, 2014

Contributor

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

Contributor

nik9000 commented Mar 12, 2014

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

@s1monw

View changes

...main/java/org/elasticsearch/common/lucene/FrequencyCachingTermsEnum.java
+
+import java.io.IOException;
+
+public class FrequencyCachingTermsEnum extends FilterTermsEnum {

This comment has been minimized.

@s1monw

s1monw Mar 12, 2014

Contributor

I mean this is trivial but I don't want to add this to ES. It's pretty much broken unless you exactly know how it works internally. What I mean is that you can't call #next() or docs() since it doesn't seek. I think we can work towards this but it should have all the functions supported. I think we should cache both ttf and df and put a boolean if a real seek is needed and then do the seek if one of the other methods i called. makes sense? Other than that I think this is the way to go

@s1monw

s1monw Mar 12, 2014

Contributor

I mean this is trivial but I don't want to add this to ES. It's pretty much broken unless you exactly know how it works internally. What I mean is that you can't call #next() or docs() since it doesn't seek. I think we can work towards this but it should have all the functions supported. I think we should cache both ttf and df and put a boolean if a real seek is needed and then do the seek if one of the other methods i called. makes sense? Other than that I think this is the way to go

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

Let me have a look at that.

@nik9000

nik9000 Mar 12, 2014

Contributor

Let me have a look at that.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 12, 2014

Contributor

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

I like the change and I agree we should make it a sep. commit but not need to open a different PR. We can just keep this a sep commit, makes sense? I can port that commit then to 1.0 since this is a n improvement so it will only go into 1.x and master

Contributor

s1monw commented Mar 12, 2014

Added a commit to improve indexRandom for this patch. I think, though, that I should send it in another pull request and make this one dependent on it.

I like the change and I agree we should make it a sep. commit but not need to open a different PR. We can just keep this a sep commit, makes sense? I can port that commit then to 1.0 since this is a n improvement so it will only go into 1.x and master

@nik9000

View changes

...va/org/elasticsearch/common/lucene/FrequencyCachingTermsEnumWrapper.java
+ * To serve requests that can't come from the cache just perform the seek if you skipped the last one.
+ * We could do fancier things if all enums supported ord.
+ */
+public class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {

This comment has been minimized.

@nik9000

nik9000 Mar 12, 2014

Contributor

Now that I think I have a working TermsEnum I'd love to have a unit test to prove it. @s1monw, do you think the best way to go about this is to grab something from Lucene?

@nik9000

nik9000 Mar 12, 2014

Contributor

Now that I think I have a working TermsEnum I'd love to have a unit test to prove it. @s1monw, do you think the best way to go about this is to grab something from Lucene?

This comment has been minimized.

@s1monw

s1monw Mar 13, 2014

Contributor

hard to tell to be honest.. I'd just write a test that randomly iterates over a terms enum and calls the same things on a wrapped and unwrapped instance and checks if everything is the same?

@s1monw

s1monw Mar 13, 2014

Contributor

hard to tell to be honest.. I'd just write a test that randomly iterates over a terms enum and calls the same things on a wrapped and unwrapped instance and checks if everything is the same?

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 13, 2014

Contributor

Caching helps real suggest calls with real data in the index. Improvements get better as the phrase gets longer. 1 word phrases break even. 9 word phrases are ~75% faster.

How I figured that out:
I wanted to give this a shot with more "real world" data. I made a script to load titles from English Wikipedia. I didn't use the river because I only want the titles. Anyway, then I hit it with some searches with another script. I turned off the caching and hit is again with the same script. That spat out some data. I summarize it above.

Contributor

nik9000 commented Mar 13, 2014

Caching helps real suggest calls with real data in the index. Improvements get better as the phrase gets longer. 1 word phrases break even. 9 word phrases are ~75% faster.

How I figured that out:
I wanted to give this a shot with more "real world" data. I made a script to load titles from English Wikipedia. I didn't use the river because I only want the titles. Anyway, then I hit it with some searches with another script. I turned off the caching and hit is again with the same script. That spat out some data. I summarize it above.

Make indexRandom handle many documents better
* Index one at a time only rarely if doing more then 300.
* When launching async actions, take some care to make sure you don't already
have more then 150 other async actions in flight.
* When indexing in bulk split into chunks of 1000 documents.
@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 13, 2014

Contributor

I've cleaned up the pull request, squashing it into two commits: one for the scoring improvements and one for the indexRandom changes.

Contributor

nik9000 commented Mar 13, 2014

I've cleaned up the pull request, squashing it into two commits: one for the scoring improvements and one for the indexRandom changes.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 14, 2014

Contributor

hey nik, I took a look at the TermsEnum you added and I think it goes in the right direction but it became more complex that I thought it would. I'd like to get this improvement in for 1.1 which is close but on the other hand I think the general TermsEnum is useful. What about splitting this up and go back to the previous version and put the enum as an inner class into WordScorer making all the methods that would require an actual seek throw UnsupportedOperationException. If it turns out that we really really need to do all the seeks etc. we can still open a sep issue? I think this would be more practical and then we can think about how to generally test it? I mean we can open a dedicated issue right away? Sorry for taking that step back now after you implemented all that stuff.

Contributor

s1monw commented Mar 14, 2014

hey nik, I took a look at the TermsEnum you added and I think it goes in the right direction but it became more complex that I thought it would. I'd like to get this improvement in for 1.1 which is close but on the other hand I think the general TermsEnum is useful. What about splitting this up and go back to the previous version and put the enum as an inner class into WordScorer making all the methods that would require an actual seek throw UnsupportedOperationException. If it turns out that we really really need to do all the seeks etc. we can still open a sep issue? I think this would be more practical and then we can think about how to generally test it? I mean we can open a dedicated issue right away? Sorry for taking that step back now after you implemented all that stuff.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 16, 2014

Contributor

I have to admit I'm having trouble working up the willpower to back it out and work it back in in another issue. What about slapping more warnings on it? Just moving it to a subclass as is?

Contributor

nik9000 commented Mar 16, 2014

I have to admit I'm having trouble working up the willpower to back it out and work it back in in another issue. What about slapping more warnings on it? Just moving it to a subclass as is?

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 16, 2014

Contributor

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..

Contributor

s1monw commented Mar 16, 2014

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 16, 2014

Contributor

The test I added isn't too bad.... Anyway, I'll switch it back to the old implementation tonight.

Sent from my iPhone

On Mar 16, 2014, at 1:53 PM, Simon Willnauer notifications@github.com wrote:

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..


Reply to this email directly or view it on GitHub.

Contributor

nik9000 commented Mar 16, 2014

The test I added isn't too bad.... Anyway, I'll switch it back to the old implementation tonight.

Sent from my iPhone

On Mar 16, 2014, at 1:53 PM, Simon Willnauer notifications@github.com wrote:

@nik9000 I can do it though no worries. Yet the reasoning here is that we don't need all the DocsEnum etc methods so no real seek is needed. I could have made up my mind earlier, hindsight is 20/20.... I am worries about the usecase for this general class since it's very limited and second I don't have a good answer to the testing question so I suggest make it simpler and throw UnsupportedOperationException where appropriate. No need for a sep issue..


Reply to this email directly or view it on GitHub.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 17, 2014

Contributor

Done. Let me know what else I need to do. I reran the (not scientific) performance tests I was running and it is marginally faster. I assume because it is maintaining less stuff in the cache.

Contributor

nik9000 commented Mar 17, 2014

Done. Let me know what else I need to do. I reran the (not scientific) performance tests I was running and it is marginally faster. I assume because it is maintaining less stuff in the cache.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 17, 2014

Contributor

don't get me wrong I think you test was ok! I didn't want to make you upset or anything I really really appreciate your work here that's for sure!

Contributor

s1monw commented Mar 17, 2014

don't get me wrong I think you test was ok! I didn't want to make you upset or anything I really really appreciate your work here that's for sure!

@s1monw

View changes

src/main/java/org/elasticsearch/search/suggest/phrase/WordScorer.java
+ * that is requested and will always return that one. Also only works with seekExact(BytesRef), not next, not seekCeil. Because of this
+ * it really only makes sense in this context.
+ */
+ private static class FrequencyCachingTermsEnumWrapper extends FilterTermsEnum {

This comment has been minimized.

@s1monw

s1monw Mar 17, 2014

Contributor

this looks great. I think we should throw UnsupportedOperationException on the methods that are not correct if they are called?
I also think we should rather use a struct to hold both the ttf and the docFreq - it seems hard to understand what is returned at this point . Other than than I think we are ready to go ie. you can squash and we push it in!

@s1monw

s1monw Mar 17, 2014

Contributor

this looks great. I think we should throw UnsupportedOperationException on the methods that are not correct if they are called?
I also think we should rather use a struct to hold both the ttf and the docFreq - it seems hard to understand what is returned at this point . Other than than I think we are ready to go ie. you can squash and we push it in!

This comment has been minimized.

@nik9000

nik9000 Mar 17, 2014

Contributor

Yes! Sorry, forgot about that while I was doing this.

@nik9000

nik9000 Mar 17, 2014

Contributor

Yes! Sorry, forgot about that while I was doing this.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Mar 17, 2014

Member

Few questions on the caching:

Member

kimchy commented Mar 17, 2014

Few questions on the caching:

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 17, 2014

Contributor

Few questions on the caching:

is it correct to have the last freq work as it does today, convoluting the doc freq with totalTermFreq? it might be very tricky down the road... . See the below point, but 2 long arrays for each freq would work well?

It works in this context because the phrase suggester is configured to use one of the two and uses that one the whole time. @s1monw also asked that I switch back to holding both so I'll have a look at that.

Can we move the caching data structure to be BytesRefHash based (hopefully using paged data structures)? Something similar to what we ended up doing in significant terms: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java#L164.

I was using that a few days ago but it required modifying a dozen classes to pass BigArrays around and handle Releasable. I think that'd be OK but many of the classes are valid plugin extension points so maybe that isn't OK so late in the 1.1 release cycle.

Contributor

nik9000 commented Mar 17, 2014

Few questions on the caching:

is it correct to have the last freq work as it does today, convoluting the doc freq with totalTermFreq? it might be very tricky down the road... . See the below point, but 2 long arrays for each freq would work well?

It works in this context because the phrase suggester is configured to use one of the two and uses that one the whole time. @s1monw also asked that I switch back to holding both so I'll have a look at that.

Can we move the caching data structure to be BytesRefHash based (hopefully using paged data structures)? Something similar to what we ended up doing in significant terms: https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java#L164.

I was using that a few days ago but it required modifying a dozen classes to pass BigArrays around and handle Releasable. I think that'd be OK but many of the classes are valid plugin extension points so maybe that isn't OK so late in the 1.1 release cycle.

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 17, 2014

Contributor

Another note while I think about it, just in case folks need to further speed this process up:

Using the default configuration with suggest_mode = always, performance is now dominated by drawing candidates. This is because the default only scores the top five candidates per term. Even on many many term queries scoring is now pretty fast. So if you wanted to make this even faster:

  1. You could get a quick win on the worst case by skipping looking at edit distance one. Even though generating the automata for edit distance two is expensive, on my dataset edit distance one never seems to cut it. The improvement is 10%ish or so on my tests but I can't be sure if they are truly representative here.
  2. Beyond that I don't see getting any faster without abandoning DirectSpellChecker for something that pregenerates all the candidates. It'd be a pain to deal with, I think.
Contributor

nik9000 commented Mar 17, 2014

Another note while I think about it, just in case folks need to further speed this process up:

Using the default configuration with suggest_mode = always, performance is now dominated by drawing candidates. This is because the default only scores the top five candidates per term. Even on many many term queries scoring is now pretty fast. So if you wanted to make this even faster:

  1. You could get a quick win on the worst case by skipping looking at edit distance one. Even though generating the automata for edit distance two is expensive, on my dataset edit distance one never seems to cut it. The improvement is 10%ish or so on my tests but I can't be sure if they are truly representative here.
  2. Beyond that I don't see getting any faster without abandoning DirectSpellChecker for something that pregenerates all the candidates. It'd be a pain to deal with, I think.
@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 17, 2014

Contributor

Add commit to cache both ttf and df AND to throw UnsupportedOperationExceptions. Less hacky now.

Contributor

nik9000 commented Mar 17, 2014

Add commit to cache both ttf and df AND to throw UnsupportedOperationExceptions. Less hacky now.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 17, 2014

Contributor

LGTM can you maybe squash the commits together such that we have two commits, one for Make indexRandom handle many documents better ... and one for the speeups? Thanks Nik

Contributor

s1monw commented Mar 17, 2014

LGTM can you maybe squash the commits together such that we have two commits, one for Make indexRandom handle many documents better ... and one for the speeups? Thanks Nik

Speed up phrase suggestion scoring
Two changes:
1.  In the StupidBackoffScorer only look for the trigram if there is a bigram.
2.  Cache the frequencies in WordScorer so we don't look them up again and
again and again.  This is implemented by wrapping the TermsEnum in a special
purpose wrapper that really only works in context of the WordScorer.

This provides a pretty substantial speedup when there are many candidates.

Closes #5395
@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Mar 17, 2014

Member

++, we can work on a common frequency caching terms enum and reuse between both the spell checker and significant terms /cc @markharwood, @jpountz

Member

kimchy commented Mar 17, 2014

++, we can work on a common frequency caching terms enum and reuse between both the spell checker and significant terms /cc @markharwood, @jpountz

@nik9000

This comment has been minimized.

Show comment
Hide comment
@nik9000

nik9000 Mar 17, 2014

Contributor

Squashed.

Contributor

nik9000 commented Mar 17, 2014

Squashed.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 18, 2014

Contributor

pushed to 1.x and master

Contributor

s1monw commented Mar 18, 2014

pushed to 1.x and master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment