Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize disjunction counts. #12415

Merged
merged 15 commits into from
Aug 11, 2023
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 5, 2023

This introduces LeafCollector#collect(DocIdStream) to enable collectors to collect batches of doc IDs at once. BooleanScorer takes advantage of this by creating a DocIdStream whose count() method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the title OR 12 query (2.9M hits).

Relates #12358

This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates apache#12358
@jpountz
Copy link
Contributor Author

jpountz commented Jul 5, 2023

Note: this is just a proof of concept to discuss the idea of integrating at the collector level, more work is needed to add more tests, more docs, integrating in the test framework (AssertingLeafCollector), etc.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 6, 2023

To me a big question with this API is whether we should consider methods on the DocIdStream terminal or not. If we do this, then this may enable more optimizations later on, e.g. it would be legal to create such objects that are backed by an iterator. But on the other hand, you wouldn't be able to propagate this optimization in a MultiLeafCollector since it would be illegal for each sub LeafCollector to consume the DocIdStream independently.

@jpountz jpountz marked this pull request as ready for review July 7, 2023 16:21
@jpountz
Copy link
Contributor Author

jpountz commented Jul 7, 2023

I added documentation and tests, it's ready for review. I settled on making consumption of DocIdStreams terminal, on the basis that it wouldn't add much value to use this optimization in a MultiCollector anyway. I also removed some overhead that is mostly unrelated to this change, and counting title OR 12 is now 80% faster compared to main.

@mikemccand
Copy link
Member

counting title OR 12 is now 80% faster compared to main.

Wow! I'll try to review soon. Thanks @jpountz!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jpountz! Thank you! It's wonderful to see cross fertilization / inspiration from the ongoing Tantivy <-> Lucene comparison resulting in optimizations like this. I'd love to see the other direction too (Tantivy porting over 2-phase iteration, or pulsing in terms dictionary or so).

Sorry for the slow review.

I'm trying to add count(...) to nightly benchmarks -- it's regolding now. I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop" :)

I wonder what other queries could (later) benefit from DocIdStream bulk collection ...

final Bucket[] buckets = new Bucket[SIZE];
// One bucket per doc ID in the window, non-null if scores are needed or if frequencies need to be
// counted
final Bucket[] buckets;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if switching this to parallel arrays, for maybe better CPU cache locality, would show any speedup (separate issue!). Or maybe "structs" (value objects) when Java finally gets them.

Though, the inlined matching bitset is sort of already a parallel array and maybe gets most of the gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been this way for a very (very very very) long time, but I agree it would probably perform better with parallel arrays!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL +1 to the extra very instances above!

@Override
public void forEach(CheckedIntConsumer<IOException> consumer) throws IOException {
long[] matching = BooleanScorer.this.matching;
Bucket[] buckets = BooleanScorer.this.buckets;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should (later!) maybe rename Bucket to OneHit or DocHit or so, to make it clear it represents details of a single doc hit.

@@ -154,19 +207,20 @@ public void collect(int doc) throws IOException {
throw new IllegalArgumentException(
"This scorer can only be used with two scorers or more, got " + scorers.size());
}
for (int i = 0; i < buckets.length; i++) {
buckets[i] = new Bucket();
if (needsScores || minShouldMatch > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this also optimize other cases, where we are using BooleanScorer in non-scoring cases (MUST_NOT or FILTER)? Or do we never use BooleanScorer for these clauses and it's really just the count API that we are accelerating here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we also use BooleanScorer when there is a mix of SHOULD and MUST_NOT clauses. But not when there are FILTER clauses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Maybe if the FILTER clause is high enough cardinality, at some point BS1 becomes worth it. Restrictive (low cardinality) filters is where BS2 should win.

if (needsScores == false) {
// OrCollector calls score() all the time so we have to explicitly
// disable scoring in order to avoid decoding useless norms
scorer = BooleanWeight.disableScoring(scorer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- this change is a more effective way to disable scoring than wrapping in a no-op / fake scorer!


/** Like {@link IntConsumer}, but may throw checked exceptions. */
@FunctionalInterface
public interface CheckedIntConsumer<T extends Exception> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darned ubiquitous IOException all throughout Lucene!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's viral!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s/throws IOException//g and s/IOException/IOExceptionUnchecked/g ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like LeafCollector#collect to be a valid method reference that implements this functional interface, and I don't want to change the signature of LeafCollector#collect. If we remove the exception here, it would force the default implementation of LeafCollector#collect(DocIdStream) to change from this:

  default void collect(DocIdStream stream) throws IOException {
    stream.forEach(this::collect);
  }

to that

  default void collect(DocIdStream stream) throws IOException {
    stream.forEach(doc -> {
      try {
        collect(doc);
      } catch (IOException e) {
        throw new UncheckedIOException(e);
      }
    });
  }

which I like less than introducing this functional interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was global search and replace throughout Lucene, only half-serious

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sooooo tempting. This IOExcption pollution has been so irritating over the years ... we could maybe make all the entry points (IndexSearcher#search, #count, etc.) throw IOException so callers know "yes you are searching an on-disk index still so stuff could go badly wrong with those transistors", but internally use the unchecked form maybe. Though, that just pushes the virus "up" to our users ...

@@ -119,6 +121,7 @@ public BulkScorerAndDoc get(int i) {
final Score score = new Score();
final int minShouldMatch;
final long cost;
final boolean needsScores;

final class OrCollector implements LeafCollector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really only use this BooleanScorer for pure disjunctive cases now? I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if the clauses all have "similar" cost. (Separate issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we never use it for conjunctions. It's probably faster than BS2 for conjunctions at times, it would be interesting to find a good heuristic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query optimization is so tricky!

* @see LeafCollector#collect(DocIdStream)
* @lucene.experimental
*/
public abstract class DocIdStream {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to document this, but this stream is not in general (though could be) holding ALL matching hits for a given collection situation (query) right? As used from BooleanScorer it is just one window's worth of hits (a 2048 chunk of docid space) at once? I guess the right place to make this clear is in the new collect(DocIdStream) method?

@@ -83,6 +83,13 @@ public interface LeafCollector {
*/
void collect(int doc) throws IOException;

/**
* Bulk-collect doc IDs. The default implementation calls {@code stream.forEach(this::collect)}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we note that this might be a chunk/window of docids, and it's always sequential/in-order with respect to other calls to collect (e.g. collect(int doc)). Is it valid for a caller to mix & match calls to both collect methods here? I would think so, but we are not yet doing that since this change will always collect with one or the other.

context.reader().getLiveDocs(),
0,
DocIdSetIterator.NO_MORE_DOCS);
assertEquals(expectedCount[0], actualCount[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! So we count both fast (DocIdStream#count) and slow (one by one) way and confirm they agree.

@Override
public void collect(DocIdStream stream) throws IOException {
docIdStream[0] = true;
LeafCollector.super.collect(stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This then forwards to our collect(int doc) method below right? So we are forcing counting "the slow way" (one by one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@jpountz
Copy link
Contributor Author

jpountz commented Jul 27, 2023

I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop"

+1 I'll wait for a few data point before merging

I wonder what other queries could (later) benefit from DocIdStream bulk collection ...

I tried to think about this too.

MatchAllDocsQuery is an obvious candidate, but it's already optimized differently using Weight#count. It's probably still a good idea to implement this API on MatchAllDocsQuery so that it would help pure negations (a MatchAllDocsQuery in a MUST/FILTER/SHOULD clause, and one or more MUST_NOT clauses), as this will trigger usage of ReqExclBulkScorer which will delegate to the MatchAllDocsQuery BulkScorer.

Queries that produce bitsets could also implement a similar optimization, e.g. (numeric) range, prefix, wildcard or geo queries. I expect the cost of building the bitset to dominate the overall execution time, but it will probably still yield a noticeable speedup. A question I'm wondering there is whether we should pass the entire BitSet as a single DocIdStream or if there are reasons why we should split it anyway.

Term queries could theoretically return a DocIdStream per block of 128 doc IDs, where decoding would happen lazily at the beginning of DocIdStream#forEach and DocIdStream#count would return 128 without even decoding postings. This would require more intimate integration with the codec as we don't have the right APIs to do this at the moment.

And like you already suggested, we could handle some conjunctions if we ran them through BS1.

In general, deletions will tend to disable this optimization. (BS1 is a notable case when deletions would not disable this optimization) It might help to have a nextClearBit on Bits to be able to still apply this optimization. E.g. MatchAllDocsQuery could use Bits#nextClearBit on live docs to create a DocIdStream for every sequence of adjacent non-deleted doc IDs to speed up counting under sparse deletions.

@mikemccand
Copy link
Member

It might help to have a nextClearBit on Bits to be able to still apply this optimization. E.g. MatchAllDocsQuery could use Bits#nextClearBit on live docs to create a DocIdStream for every sequence of adjacent non-deleted doc IDs to speed up counting under sparse deletions.

+1, this is a neat idea! Often deletes are sparse (apps try hard to merge them away) ... at Amazon product search we have insanely aggressively asked TieredMP to merge away deletions now, at tremendous increase of indexing cost and wee bit faster searching, the right tradeoff when using NRT segment replication to efficiently/incrementally index once and distribute the copy to many replicas for searching. Maybe open a follow-on issue for this?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz! What an exciting change! And I love that it comes from cross-fertilizing from Tantivy's awesome search implementation/optimizations.

jpountz added a commit to jpountz/lucene that referenced this pull request Jul 31, 2023
This is a subset of apache#12415, which I'm extracting to its own pull request in
order to have separate data points in nightly benchmarks.

Results on `wikimedium10m` and `wikinightly` counting tasks:

```
                       CountTerm     4624.91      (6.4%)     4581.34      (6.4%)   -0.9% ( -12% -   12%) 0.640
                 CountAndHighMed      280.03      (4.5%)      280.15      (4.4%)    0.0% (  -8% -    9%) 0.974
                     CountPhrase        7.22      (3.0%)        7.24      (1.8%)    0.3% (  -4% -    5%) 0.728
                CountAndHighHigh       52.84      (4.9%)       53.12      (5.6%)    0.5% (  -9% -   11%) 0.755
                        PKLookup      232.01      (3.6%)      235.45      (2.8%)    1.5% (  -4% -    8%) 0.144
                 CountOrHighHigh       42.37      (6.1%)       56.04      (9.1%)   32.3% (  16% -   50%) 0.000
                  CountOrHighMed       30.56      (6.5%)       40.46      (9.8%)   32.4% (  15% -   52%) 0.000
```
jpountz added a commit that referenced this pull request Aug 3, 2023
This is a subset of #12415, which I'm extracting to its own pull request in
order to have separate data points in nightly benchmarks.

Results on `wikimedium10m` and `wikinightly` counting tasks:

```
                       CountTerm     4624.91      (6.4%)     4581.34      (6.4%)   -0.9% ( -12% -   12%) 0.640
                 CountAndHighMed      280.03      (4.5%)      280.15      (4.4%)    0.0% (  -8% -    9%) 0.974
                     CountPhrase        7.22      (3.0%)        7.24      (1.8%)    0.3% (  -4% -    5%) 0.728
                CountAndHighHigh       52.84      (4.9%)       53.12      (5.6%)    0.5% (  -9% -   11%) 0.755
                        PKLookup      232.01      (3.6%)      235.45      (2.8%)    1.5% (  -4% -    8%) 0.144
                 CountOrHighHigh       42.37      (6.1%)       56.04      (9.1%)   32.3% (  16% -   50%) 0.000
                  CountOrHighMed       30.56      (6.5%)       40.46      (9.8%)   32.4% (  15% -   52%) 0.000
```
jpountz added a commit that referenced this pull request Aug 3, 2023
This is a subset of #12415, which I'm extracting to its own pull request in
order to have separate data points in nightly benchmarks.

Results on `wikimedium10m` and `wikinightly` counting tasks:

```
                       CountTerm     4624.91      (6.4%)     4581.34      (6.4%)   -0.9% ( -12% -   12%) 0.640
                 CountAndHighMed      280.03      (4.5%)      280.15      (4.4%)    0.0% (  -8% -    9%) 0.974
                     CountPhrase        7.22      (3.0%)        7.24      (1.8%)    0.3% (  -4% -    5%) 0.728
                CountAndHighHigh       52.84      (4.9%)       53.12      (5.6%)    0.5% (  -9% -   11%) 0.755
                        PKLookup      232.01      (3.6%)      235.45      (2.8%)    1.5% (  -4% -    8%) 0.144
                 CountOrHighHigh       42.37      (6.1%)       56.04      (9.1%)   32.3% (  16% -   50%) 0.000
                  CountOrHighMed       30.56      (6.5%)       40.46      (9.8%)   32.4% (  15% -   52%) 0.000
```
@jpountz
Copy link
Contributor Author

jpountz commented Aug 4, 2023

After merging a subset of this PR in #12475, there remains a ~25% speedup when counting hits on title OR 12 (same query as mentioned earlier).

@jpountz
Copy link
Contributor Author

jpountz commented Aug 5, 2023

Counting tasks after integrating #12488:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     CountPhrase       12.47      (3.2%)       12.49      (3.9%)    0.1% (  -6% -    7%) 0.909
                        PKLookup      240.85      (4.1%)      241.22      (3.6%)    0.2% (  -7% -    8%) 0.897
                       CountTerm     9110.81      (3.9%)     9163.88      (2.8%)    0.6% (  -5% -    7%) 0.586
                CountAndHighHigh       51.81      (3.5%)       52.36      (2.5%)    1.1% (  -4% -    7%) 0.274
                 CountAndHighMed      196.28      (3.8%)      198.66      (2.3%)    1.2% (  -4% -    7%) 0.222
                 CountOrHighHigh       50.41      (9.3%)       64.37     (16.3%)   27.7% (   1% -   58%) 0.000
                  CountOrHighMed       79.16      (8.7%)      105.40     (16.4%)   33.2% (   7% -   63%) 0.000

@jpountz jpountz added this to the 9.8.0 milestone Aug 11, 2023
@jpountz jpountz merged commit 4d26cb2 into apache:main Aug 11, 2023
4 checks passed
@jpountz jpountz deleted the optimized_disjunction_count branch August 11, 2023 20:37
jpountz added a commit that referenced this pull request Aug 11, 2023
jpountz added a commit that referenced this pull request Aug 11, 2023
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates #12358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants