Avoid unnecessary DisjunctionMaxBulkScorer overhead#15659
Avoid unnecessary DisjunctionMaxBulkScorer overhead#15659shimpeko wants to merge 3 commits intoapache:mainfrom
Conversation
56cf380 to
ecc05df
Compare
|
I realized that we cannot call ScorerSupplier.get() more than once. Need another solutions. |
a3b772a to
160a235
Compare
This change inspects clause bulk scorers up front and only uses DisjunctionMaxBulkScorer if at least one clause provides a non-default BulkScorer. Otherwise, we fall back to the scorer-based path. c88f933 made DisjunctionMaxQuery use DisjunctionMaxBulkScorer when tieBreakerMultiplier == 0 and scoreMode == TOP_SCORES. However, this bulk path primarily pays off when at least one clause implements a specialized BulkScorer. When all clauses return DefaultBulkScorer, the bulk windowing and replay logic adds overhead while preventing effective use of minCompetitiveScore and block-max optimizations that the scorer-based DisjunctionMaxScorer supports in TOP_SCORES mode. In such cases, falling back to the scorer-based path typically results in better performance and restores competitive-score-based skipping. Fixes: apache#15658 Related PR: apache#14040
160a235 to
6231d37
Compare
I updated the code not to call ScorerSupplier.get() more than once. |
|
Hi, it would be nice to not always force push, this makes reviewing hard. I have no idea if you have fixed the linter warnings. Squashing the commits is done on our side while merging, please don't do it yourself. Otherwise the change looks fine to me. but I leave it to @jpountz as he understand the code much better. |
|
Please fix the linter warning by running "./gradlew tidy" and commit, but don't sqash! |
It's true that the windowing+replay logic isn't free but I remember that it was still better than using a Let's benchmark this change with luceneutil and our dismax tasks (https://github.com/mikemccand/luceneutil/blob/main/tasks/wikinightly.tasks#L257-L276)? |
|
I tried to run wikinightly but it didn't finish after 10 hours on my laptop so just picked some dismax tasks. Looking at the result, this change (edit: This PR) basically has no effect on performance of those dismax tasks including I'll dig around deeper. Commands to run bench and task detail |
|
cont. #15659 (comment)
|
|
I added test case that are closer to my query which has dismax + constant_score as https://github.com/shimpeko/luceneutil/pull/1/changes. Looking at the following benchmark result, I think I can say that the changes on this PR has significant positive impact on the performance of specifc type of query.
This seems true, so I don't cleary understand why using DisjunctionBulkMaxScorer causing regression in this paticular case, yet. @jpountz do you have any idea. Edit: I'll look deeper into DisjunctionBulkMaxScorer but happy if I could get feedbacks/suggestions, for this PR after having the benchmark result. Result (50 warmups, 50 iter) Commands to run bench |
This change inspects clause bulk scorers up front and only uses DisjunctionMaxBulkScorer if at least one clause provides a non-default BulkScorer. Otherwise, we fall back to the scorer-based path.
c88f933 made DisjunctionMaxQuery use DisjunctionMaxBulkScorer when tieBreakerMultiplier == 0 and scoreMode == TOP_SCORES. However, this bulk path primarily pays off when at least one clause implements a specialized BulkScorer.
When all clauses return DefaultBulkScorer, the bulk windowing and replay logic adds overhead while preventing effective use of minCompetitiveScore and block-max optimizations that the scorer-based DisjunctionMaxScorer supports in TOP_SCORES mode.
In such cases, falling back to the scorer-based path typically results in better performance and restores competitive-score-based skipping.
Fixes: #15658
Related PR: #14040