-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize disjunction counts. #12415
Optimize disjunction counts. #12415
Conversation
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates apache#12358
Note: this is just a proof of concept to discuss the idea of integrating at the collector level, more work is needed to add more tests, more docs, integrating in the test framework ( |
To me a big question with this API is whether we should consider methods on the |
I added documentation and tests, it's ready for review. I settled on making consumption of |
Wow! I'll try to review soon. Thanks @jpountz! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jpountz! Thank you! It's wonderful to see cross fertilization / inspiration from the ongoing Tantivy <-> Lucene comparison resulting in optimizations like this. I'd love to see the other direction too (Tantivy porting over 2-phase iteration, or pulsing in terms dictionary or so).
Sorry for the slow review.
I'm trying to add count(...)
to nightly benchmarks -- it's regolding now. I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop" :)
I wonder what other queries could (later) benefit from DocIdStream
bulk collection ...
final Bucket[] buckets = new Bucket[SIZE]; | ||
// One bucket per doc ID in the window, non-null if scores are needed or if frequencies need to be | ||
// counted | ||
final Bucket[] buckets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if switching this to parallel arrays, for maybe better CPU cache locality, would show any speedup (separate issue!). Or maybe "structs" (value objects) when Java finally gets them.
Though, the inlined matching
bitset is sort of already a parallel array and maybe gets most of the gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been this way for a very (very very very) long time, but I agree it would probably perform better with parallel arrays!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL +1 to the extra very
instances above!
@Override | ||
public void forEach(CheckedIntConsumer<IOException> consumer) throws IOException { | ||
long[] matching = BooleanScorer.this.matching; | ||
Bucket[] buckets = BooleanScorer.this.buckets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should (later!) maybe rename Bucket
to OneHit
or DocHit
or so, to make it clear it represents details of a single doc hit.
@@ -154,19 +207,20 @@ public void collect(int doc) throws IOException { | |||
throw new IllegalArgumentException( | |||
"This scorer can only be used with two scorers or more, got " + scorers.size()); | |||
} | |||
for (int i = 0; i < buckets.length; i++) { | |||
buckets[i] = new Bucket(); | |||
if (needsScores || minShouldMatch > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this also optimize other cases, where we are using BooleanScorer
in non-scoring cases (MUST_NOT
or FILTER
)? Or do we never use BooleanScorer
for these clauses and it's really just the count
API that we are accelerating here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we also use BooleanScorer
when there is a mix of SHOULD and MUST_NOT clauses. But not when there are FILTER clauses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Maybe if the FILTER
clause is high enough cardinality, at some point BS1 becomes worth it. Restrictive (low cardinality) filters is where BS2 should win.
if (needsScores == false) { | ||
// OrCollector calls score() all the time so we have to explicitly | ||
// disable scoring in order to avoid decoding useless norms | ||
scorer = BooleanWeight.disableScoring(scorer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- this change is a more effective way to disable scoring than wrapping in a no-op / fake scorer!
|
||
/** Like {@link IntConsumer}, but may throw checked exceptions. */ | ||
@FunctionalInterface | ||
public interface CheckedIntConsumer<T extends Exception> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darned ubiquitous IOException
all throughout Lucene!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's viral!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about s/throws IOException//g
and s/IOException/IOExceptionUnchecked/g
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like LeafCollector#collect
to be a valid method reference that implements this functional interface, and I don't want to change the signature of LeafCollector#collect
. If we remove the exception here, it would force the default implementation of LeafCollector#collect(DocIdStream)
to change from this:
default void collect(DocIdStream stream) throws IOException {
stream.forEach(this::collect);
}
to that
default void collect(DocIdStream stream) throws IOException {
stream.forEach(doc -> {
try {
collect(doc);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
});
}
which I like less than introducing this functional interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was global search and replace throughout Lucene, only half-serious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sooooo tempting. This IOExcption
pollution has been so irritating over the years ... we could maybe make all the entry points (IndexSearcher#search
, #count
, etc.) throw IOException
so callers know "yes you are searching an on-disk index still so stuff could go badly wrong with those transistors", but internally use the unchecked form maybe. Though, that just pushes the virus "up" to our users ...
@@ -119,6 +121,7 @@ public BulkScorerAndDoc get(int i) { | |||
final Score score = new Score(); | |||
final int minShouldMatch; | |||
final long cost; | |||
final boolean needsScores; | |||
|
|||
final class OrCollector implements LeafCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really only use this BooleanScorer
for pure disjunctive cases now? I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if the clauses all have "similar" cost. (Separate issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we never use it for conjunctions. It's probably faster than BS2 for conjunctions at times, it would be interesting to find a good heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query optimization is so tricky!
* @see LeafCollector#collect(DocIdStream) | ||
* @lucene.experimental | ||
*/ | ||
public abstract class DocIdStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where to document this, but this stream is not in general (though could be) holding ALL matching hits for a given collection situation (query) right? As used from BooleanScorer
it is just one window's worth of hits (a 2048 chunk of docid space) at once? I guess the right place to make this clear is in the new collect(DocIdStream)
method?
@@ -83,6 +83,13 @@ public interface LeafCollector { | |||
*/ | |||
void collect(int doc) throws IOException; | |||
|
|||
/** | |||
* Bulk-collect doc IDs. The default implementation calls {@code stream.forEach(this::collect)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we note that this might be a chunk/window of docids, and it's always sequential/in-order with respect to other calls to collect (e.g. collect(int doc)
). Is it valid for a caller to mix & match calls to both collect
methods here? I would think so, but we are not yet doing that since this change will always collect with one or the other.
context.reader().getLiveDocs(), | ||
0, | ||
DocIdSetIterator.NO_MORE_DOCS); | ||
assertEquals(expectedCount[0], actualCount[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! So we count both fast (DocIdStream#count
) and slow (one by one) way and confirm they agree.
@Override | ||
public void collect(DocIdStream stream) throws IOException { | ||
docIdStream[0] = true; | ||
LeafCollector.super.collect(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This then forwards to our collect(int doc)
method below right? So we are forcing counting "the slow way" (one by one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
+1 I'll wait for a few data point before merging
I tried to think about this too.
Queries that produce bitsets could also implement a similar optimization, e.g. (numeric) Term queries could theoretically return a DocIdStream per block of 128 doc IDs, where decoding would happen lazily at the beginning of And like you already suggested, we could handle some conjunctions if we ran them through BS1. In general, deletions will tend to disable this optimization. (BS1 is a notable case when deletions would not disable this optimization) It might help to have a |
+1, this is a neat idea! Often deletes are sparse (apps try hard to merge them away) ... at Amazon product search we have insanely aggressively asked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz! What an exciting change! And I love that it comes from cross-fertilizing from Tantivy's awesome search implementation/optimizations.
This is a subset of apache#12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
After merging a subset of this PR in #12475, there remains a ~25% speedup when counting hits on |
Counting tasks after integrating #12488:
|
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358
This introduces
LeafCollector#collect(DocIdStream)
to enable collectors to collect batches of doc IDs at once.BooleanScorer
takes advantage of this by creating aDocIdStream
whosecount()
method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches.On wikimedium10m, this yields a ~20% speedup when counting hits for the
title OR 12
query (2.9M hits).Relates #12358