-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better skipping for multi-term queries with a FILTER rewrite. #12055
Conversation
Here is what luceneutil gives on wikimedium10m:
|
For the record, the reason why we're seeing a speedup here is because prefix and wildcard queries produce constant scores, so the query can early terminate once 1,000 hits have been collected. Before the change, we would always create a bitset of all matches, and that would force evaluating the query against the entire doc ID space up-front. Evaluation is more lazy now, with only low-frequency postings being evaluated up-front and high-frequency postings being evaulated lazily. |
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java
Outdated
Show resolved
Hide resolved
docs = termsEnum2.postings(docs, PostingsEnum.NONE); | ||
builder.add(docs); | ||
PostingsEnum postings = termsEnum2.postings(null, PostingsEnum.NONE); | ||
highFrequencyTerms.add(postings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than just blindly add terms to the PQ, should we just have a constant mininum cost
threshold (e.g. 256, 1024, whatever) to even consider it? otherwise go directly to otherTerms
. The skipping stuff isn't going to be useful for the long-tail of low-cost terms (the majority, if we are thinking zipf). Ideally we wouldnt waste our time unless it has skipdata? And we want to be careful about the performance of these queries when there are jazillions of jazillions of matching low-frequency terms.
do { | ||
docs = termsEnum.postings(docs, PostingsEnum.NONE); | ||
postings = termsEnum.postings(postings, PostingsEnum.NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand how this is safe at all, we are reusing PostingsEnum instances yet also stuffing them into a priority queue.
+1 to this approach in general. I do wonder if the distribution assumptions generally hold if we start looking at "term in set" queries though. That's sort of irrelevant right now since that implementation is still separate ( I also wonder if we could be more aggressive with the number of clauses we build into a Just a couple thoughts but certainly nothing blocking or anything that needs to be included as part of this PR. Just wanted to toss them out there. |
there's no reason to duplicate a bunch of code just because of minor changes to a rewrite method. we can have more than one or two of these rewritemethods, and use different ones for different queries: this fact seems to have been forgotten here. and maybe we should have e.g. simple FILTER rewrite that just does that, this one with lots of magic could have a different name. |
Thanks @rmuir, you're right. The way rewrite methods are decoupled from the core protocol term intersection logic in |
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java
Outdated
Show resolved
Hide resolved
return a.cost() < b.cost(); | ||
} | ||
}; | ||
DocIdSetBuilder otherTerms = new DocIdSetBuilder(context.reader().maxDoc(), terms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Could we define otherTerms
closer to where it first gets used? (e.g., L:207)
List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1); | ||
for (PostingsEnum pe : highFrequencyTerms) { | ||
disis.add(pe); | ||
} | ||
disis.add(otherTerms.build().iterator()); | ||
DisiPriorityQueue subs = new DisiPriorityQueue(disis.size()); | ||
for (DocIdSetIterator disi : disis) { | ||
subs.add(new DisiWrapper(disi)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm overlooking something silly, but can't we just do one pass like this?
List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1); | |
for (PostingsEnum pe : highFrequencyTerms) { | |
disis.add(pe); | |
} | |
disis.add(otherTerms.build().iterator()); | |
DisiPriorityQueue subs = new DisiPriorityQueue(disis.size()); | |
for (DocIdSetIterator disi : disis) { | |
subs.add(new DisiWrapper(disi)); | |
} | |
DisiPriorityQueue subs = new DisiPriorityQueue(highFrequencyTerms.size() + 1); | |
for (DocIdSetIterator disi : highFrequencyTerms) { | |
subs.add(new DisiWrapper(disi)); | |
} | |
subs.add(new DisiWrapper(otherTerms.build().iterator())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice if we could get direct access to the underlying array backing highFrequencyTerms
, then we could leverage DisiPriorityQueue#addAll
to heapify everything at once.
I didn't really mean you specifically. I meant in the file itself. The But it doesn't need to be the default for all queries. I think the skipping logic proposed here makes total sense for stuff like WildcardQuery, PrefixQuery, RegexQuery, etc. But for TermInSetQuery? I think the rewritemethod there need not optimize for "dense" terms at all, but could focus on "PK lookups" (e.g. lots of terms with one posting) since we all know, people use it as a |
@jpountz I've found that applying this same idea to Do you mind if I extend this PR to include a similar change to |
My attention has moved to a few other things, feel free to do whatever you want with this PR, I'll be happy to review. +1 on the nice property of gradually moving from a lazy disjunction to eager evaluation, this is what I was after with this change! |
Thanks @jpountz. I pushed my |
Currently multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown. The slowdown is regrettable, but my gut feeling is that this change still has more pros than cons.
2746a8a
to
0b4f57e
Compare
OK, I think I've addressed the previous feedback and also brought in the same changes to On some internal benchmarks (Amazon product search), we see throughput increases ranging from ~7 - 63% (depending on a number of factors). We have some situations where pre-processing of long postings (in the existing TiS implementation) takes up a large share of CPU time (these tend to be cases where the actual matches are sparse but the TiS terms match a large number of docs). Being able to "hold back" these long postings into a I also re-ran
|
I dont see any of my feedback addressed. I'll repeat what i said before:
|
I think it would help, a lot, to look thru history and see how "constant score auto rewrite" was implemented years ago, and then its removal, before adding it back again. but I'm firmly against doing this crap inside filter rewrite. If we really have good reason to "add back" "auto rewrite", it would be good to add it as a separate thing like before. Also probably good to look at how it was done before. |
also dropping the postings reuse is going to cause big performance degradation for many situations. For example with NIOFSDirectory, new postings reader means indexinput.clone() calls, buffer refills, etc etc. It translates into real I/O But the reuse isn't dropped in all cases, and we're still putting "reused enums" into things like PQs. Sorry, none of this looks even half-way baked at all. |
I suggest an easy path to success here:
|
@rmuir thanks for the feedback. Let me see if I can respond to all of it here:
Can you help me with where you see this as a problem? I went through the code and didn't see any obvious mistakes. I'm sure I'm missing it? The only tricky one is in the do/while loop in MTQCSW starting on line 214. But I think this is safe since we reassign
Thanks for raising this. I wasn't aware of the history here so I'll have to go do some digging. I'll take that on as a next step.
I'd looked at this in the past and had found performance regressions with this approach, which I believe were caused by the different term intersection approach (seekCeil vs. seekExact). I don't recall the amount of regression though. I'll see if I can dig that up. Hopefully I documented it somewhere. But if not, I'll try to re-benchmark. Maybe the regressions don't justify the separate implementation. |
Found the issue where the "constant score auto rewrite" implementation was removed: LUCENE-5938. If I'm understanding the history, it seems like the auto rewrite logic was trying to balance two things:
It looks like this rewrite method was removed in LUCENE-5938 since it introduced the idea of a sparse bitset, which removed the issue with #2 above. In my opinion, it seems like #1 is still a very valid trade-off (many terms are inefficient to manage in a boolean query due to the associated PQ). This, of course, is what the current rewrite method takes into consideration (with a threshold of 16 terms). What I still don't like about the existing implementation is how it completely changes behavior to a full bitset rewrite after passing 16 terms. I do think it's a nice win overall to rewrite in the way proposed by this PR. As far as I can tell, the former implementation never "incrementally" pre-processed postings into a filter bitset. It was an "all or nothing" approach. I think the key benefit of this PR is to allow for "incremental" processing. But, I also recognize it might not be applicable in all cases, for all users, and/or for all file systems. I think it's really good feedback to introduce this idea as a new rewrite option instead of modifying the existing one in-place. I'll look into that as a next step. @rmuir / @jpountz / @mikemccand - since you all were involved in LUCENE-5938 and the earlier implementation of the "auto rewrite," please let me know if I'm missing anything. I the best "digital archeology" I could, but it's very possible I'm missing something. Thanks again for the feedback! |
As one more update, I just opened a new PR to have |
Thanks Greg for sharing more info about how it helped on Amazon Product search. Do your queries early terminate somehow (in which case I'd expect this change to help the most since it can skip evaluating the tail of long postings)? I like the idea of having multiple rewrite methods and possibly an Reuse of postings enums looks ok to me, we could improve naming and add more comments to make it more obviously ok, but we only create up to 16 postings enums from scratch, reuse otherwise, and make sure to never reuse a postings enum that is in the priority queue. The threshold of 16 looks conservative to me so I wouldn't worry about NIOFSDirectory, if we have a problem with NIOFSDirectory and this threshold of 16 then many simple boolean queries have problems too, which I don't think is the case in practice? The threshold on the minimum document frequency should also help here, e.g. a near-PK field would only accumulate hits into a DocIdSetBuilder and not pull postings enums? |
OK I think I better understand the concern around the slowness with NIOFSDirectory now. With a single PostingsEnum getting reused, a single BufferedIndexInput refill would buffer postings lists for multiple terms at once, so next calls to We're losing this property by reusing across up to 16+1 PostingsEnums, though the fact that the top 16 postings enums that have the highest doc freq should get relatively stable after some time, ie. most terms would have a lower doc freq, plus the threshold on docFreq=512 should help too, as we're always reusing the last postings enum in that case, and postings lists that have more matches would likely need to refill their buffer multiple times to read all matching docs anyway. |
Yes, our queries do terminate early. +1 to this explaining most of the large difference between the approaches in our case.
Interesting. Thanks for describing the issue a bit more. I'll try to spend a little time digging into this as well to make sure I understand the concern a bit better as well. I'll post another revision here soon that introduces this rewrite logic as a new rewrite method (as discussed) and will also remove the changes to |
…TrackingQueryCachingPolicy
Updated this PR in the following ways:
I believe this addresses all the previous feedback and is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I can't approve since I opened the PR initially, but the change looks good to me!
this.twoPhaseView = null; | ||
this.approximation = iterator; | ||
this.matchCost = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can keep the API a bit cleaner and change callers to do new DisiWrapper(new ConstantScoreScorer(itiretar))
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's reasonable I think. We'd be doing a little unnecessary object creation with the "dummy" scorer wrapper, but that's alright. I'll work this suggestion in. Honestly, I'm not super happy with either option, but I think this is a good suggestion for now.
* PrefixQuery}, {@link WildcardQuery} or {@link TermRangeQuery}. This implementation is generally | ||
* preferable because it a) Runs faster b) Does not have the scarcity of terms unduly influence | ||
* score c) avoids any {@link org.apache.lucene.search.IndexSearcher.TooManyClauses} exception. | ||
* However, if your application really needs to use the old-fashioned {@link BooleanQuery} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention the non-blended constant-score rewrite as another alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks!
lucene/CHANGES.txt
Outdated
@@ -277,6 +277,9 @@ Improvements | |||
* GITHUB#12070: Compound file creation is no longer subject to merge throttling. | |||
(Adrien Grand) | |||
|
|||
* GITHUB#12055: Better skipping support for multi-term queries that have a | |||
FILTER rewrite. (Adrien Grand, Greg Miller) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update this change log to mention that a new rewrite method was introduced and what queries specifically switched to this new rewrite method as a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed this. I also realized the entry is in the 9.5 block. Updating the wording and moving under 9.6.
Ha! Thanks for having another look. This expanded a little in scope by keeping the "old" behavior along with the new, so thanks for having patience with it. I made a couple minor changes based on your feedback and am going to go ahead and merge given your "approval." If (or anyone else) have additional comments, I'm happy to continue addressing them. Thanks again. |
This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown, so users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new default for PrefixQuery, WildcardQuery and TermRangeQuery. Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>
Nightly bench runs have refreshed and show some nice improvements. I'll try to add some annotations soon. |
Woohoo! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, the CONSTANT_SCORE_REWRITE method is now very under-tested. I don't suspect a problem today, but future changes might introduce a regression that we don't have a test for. Ideally we could have tests randomly pick one as they are semantically equivalent but it'd be hard to do that. Alternatively update some general TestPrefixQuery etc. to be parameterized by the rewrite method.
Currently multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a
DocIdSetBuilder
. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from theDocIdSetBuilder
that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to theDocIdSetBuilder
. This provides two main benefits:maxDoc
.The slowdown is unfortunate, but my gut feeling is that this change still has more pros than cons.