Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exponential runtime for Boolean#rewrite #12072

Merged
merged 6 commits into from
Jan 12, 2023

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jan 10, 2023

When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

The absolute worst case I was able to test is many nested SHOULD clauses with a depth of 22. This ran for over 7 seconds without my change. With my change it took less than 70ms.

I had to cancel the test (without my change) when the depth was 30. It was simply taking too long. With my change, it was around 100ms.

closes: #12069

@benwtrent
Copy link
Member Author

@jpountz You probably want to review this one as it relates to your original optimizations.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @benwtrent the fix looks good to me, I left a couple of comments.

// this causes
// exponential growth of runtime.
if (query instanceof BooleanQuery booleanQuery) {
rewritten = booleanQuery.rewriteNoScoring(indexSearcher);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the same change to the main rewrite method. That will reduce further the amount of time we spend on needlessly rewriting boolean queries.

@javanna
Copy link
Contributor

javanna commented Jan 11, 2023

I did some additional testing to understand the impact of the regression and in light of that I view this as a bug rather than a performance regression, because we end up performing many needless rewrite steps that waste resources and end up making rewrite take tens of seconds depending on the depth of a boolean query.

I added some logging to the rewrite methods involved to more clearly show the issue. The following is the output from a boolean query with depth 3.

    IndexSearcher searcher = newSearcher(new MultiReader());
    Directory dir = newDirectory();
    (new RandomIndexWriter(random(), dir)).close();
    IndexReader r = DirectoryReader.open(dir);

    int depth = 3;
    BooleanQuery.Builder bq = new BooleanQuery.Builder().add(new TermQuery(new Term("field", "value")), Occur.MUST);
    for (int i = 0; i < depth; i++) {
      bq = new BooleanQuery.Builder().add(bq.build(), Occur.MUST).add(new TermQuery(new Term("depth" + (depth - i), "value")), Occur.MUST);
    }
    BooleanQuery booleanQuery = new BooleanQuery.Builder().add(bq.build(), Occur.FILTER).build();
    Query rewritten = searcher.rewrite(booleanQuery);
    System.out.println("final rewritten query: " + rewritten);
    r.close();
    dir.close();

Output before the fix (87 lines):

boolean query rewrite: #(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value) -> (ConstantScore(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value))^0.0
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value
boolean query rewrite: +field:value -> field:value
boolean query rewrite: +(+field:value) +depth3:value -> +field:value +depth3:value
boolean query rewrite: +(+(+field:value) +depth3:value) +depth2:value -> +(+field:value +depth3:value) +depth2:value
boolean query rewrite: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value -> +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ constant score query rewriting inner boolean query no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: +(+field:value +depth3:value) +depth2:value
boolean query rewrite: +field:value +depth3:value -> +field:value +depth3:value
boolean query rewrite: +(+field:value +depth3:value) +depth2:value -> +(+field:value +depth3:value) +depth2:value
+ constant score query rewriting inner boolean query no scoring: +(+field:value +depth3:value) +depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: +field:value +depth3:value
boolean query rewrite: +field:value +depth3:value -> +field:value +depth3:value
+ constant score query rewriting inner boolean query no scoring: +field:value +depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
boolean query rewrite no scoring: +field:value +depth3:value -> #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
boolean query rewrite no scoring: +(+field:value +depth3:value) +depth2:value -> #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
boolean query rewrite no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ constant score query rewriting inner query: #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner query: #field:value #depth3:value
+ constant score query rewriting inner query: field:value
+ constant score query rewriting inner query: depth3:value
boolean query rewrite: #field:value #depth3:value -> #field:value #depth3:value
+ constant score query rewriting inner boolean query no scoring: #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ constant score query rewriting inner query: depth2:value
boolean query rewrite: #(#field:value #depth3:value) #depth2:value -> #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner boolean query no scoring: #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: #field:value #depth3:value
+ constant score query rewriting inner query: field:value
+ constant score query rewriting inner query: depth3:value
boolean query rewrite: #field:value #depth3:value -> #field:value #depth3:value
+ constant score query rewriting inner boolean query no scoring: #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ constant score query rewriting inner query: depth1:value
boolean query rewrite: #(#(#field:value #depth3:value) #depth2:value) #depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ constant score query rewriting inner boolean query no scoring: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner query: #field:value #depth3:value
+ constant score query rewriting inner query: field:value
+ constant score query rewriting inner query: depth3:value
boolean query rewrite: #field:value #depth3:value -> #field:value #depth3:value
+ constant score query rewriting inner boolean query no scoring: #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ constant score query rewriting inner query: depth2:value
boolean query rewrite: #(#field:value #depth3:value) #depth2:value -> #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner boolean query no scoring: #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: #field:value #depth3:value
+ constant score query rewriting inner query: field:value
+ constant score query rewriting inner query: depth3:value
boolean query rewrite: #field:value #depth3:value -> #field:value #depth3:value
+ constant score query rewriting inner boolean query no scoring: #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
final rewritten query: (ConstantScore(#(#(#field:value #depth3:value) #depth2:value) #depth1:value))^0.0

Output after the fix (52 lines):

boolean query rewrite: #(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value) -> (ConstantScore(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value))^0.0
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value
boolean query rewrite: +field:value -> field:value
boolean query rewrite: +(+field:value) +depth3:value -> +field:value +depth3:value
boolean query rewrite: +(+(+field:value) +depth3:value) +depth2:value -> +(+field:value +depth3:value) +depth2:value
boolean query rewrite: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value -> +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ constant score query rewriting inner boolean query no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
boolean query rewrite no scoring: +field:value +depth3:value -> #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
boolean query rewrite no scoring: +(+field:value +depth3:value) +depth2:value -> #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
boolean query rewrite no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ constant score query rewriting inner query: #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner query: #field:value #depth3:value
+ constant score query rewriting inner query: field:value
+ constant score query rewriting inner query: depth3:value
boolean query rewrite: #field:value #depth3:value -> #field:value #depth3:value
+ constant score query rewriting inner boolean query no scoring: #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ constant score query rewriting inner query: depth2:value
boolean query rewrite: #(#field:value #depth3:value) #depth2:value -> #(#field:value #depth3:value) #depth2:value
+ constant score query rewriting inner boolean query no scoring: #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ constant score query rewriting inner query: depth1:value
boolean query rewrite: #(#(#field:value #depth3:value) #depth2:value) #depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ constant score query rewriting inner boolean query no scoring: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
final rewritten query: (ConstantScore(#(#(#field:value #depth3:value) #depth2:value) #depth1:value))^0.0

Output after shortcutting the main rewrite method to rewriteNoScoring instead of wrapping a boolean query with constant score query (39 lines):

boolean query rewrite: #(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value) -> (ConstantScore(+(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value))^0.0
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value
boolean query rewrite: +field:value -> field:value
boolean query rewrite: +(+field:value) +depth3:value -> +field:value +depth3:value
boolean query rewrite: +(+(+field:value) +depth3:value) +depth2:value -> +(+field:value +depth3:value) +depth2:value
boolean query rewrite: +(+(+(+field:value) +depth3:value) +depth2:value) +depth1:value -> +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ constant score query rewriting inner boolean query no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
boolean query rewrite no scoring: +field:value +depth3:value -> #field:value #depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
boolean query rewrite no scoring: +(+field:value +depth3:value) +depth2:value -> #(#field:value #depth3:value) #depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
boolean query rewrite no scoring: +(+(+field:value +depth3:value) +depth2:value) +depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
---------------------------------------------- top level rewrite round completed
+ constant score query rewriting inner query: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ constant score query rewriting inner query: depth1:value
boolean query rewrite: #(#(#field:value #depth3:value) #depth2:value) #depth1:value -> #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ constant score query rewriting inner boolean query no scoring: #(#(#field:value #depth3:value) #depth2:value) #depth1:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: field:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth3:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth2:value
+ Artificially wrapping into constant score query to rewrite
+ constant score query rewriting inner query: depth1:value
final rewritten query: (ConstantScore(#(#(#field:value #depth3:value) #depth2:value) #depth1:value))^0.0

Obviously the effect of this grows further as the depth of the query increases.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! I like that the fix is contained. I left a suggestion for a fix that I think is a bit more robust and for the test.

rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
if (rewritten instanceof ConstantScoreQuery constantScoreQuery) {
rewritten = constantScoreQuery.getQuery();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm contemplating replacing your fix with this one which is similar:

diff --git a/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java b/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
index 0354280eb01..be323bf0e4c 100644
--- a/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
+++ b/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java
@@ -203,7 +203,13 @@ public class BooleanQuery extends Query implements Iterable<BooleanClause> {
 
     for (BooleanClause clause : clauses) {
       Query query = clause.getQuery();
-      Query rewritten = new ConstantScoreQuery(query).rewrite(indexSearcher);
+      // NOTE: rewritingNoScoring() should not call rewrite(), otherwise this
+      // method could run in exponential time with the depth of the query as
+      // every new level would rewrite 2x more than its parent level.
+      Query rewritten = query;
+      if (rewritten instanceof BoostQuery) {
+        rewritten = ((BoostQuery) query).getQuery();
+      }
       if (rewritten instanceof ConstantScoreQuery) {
         rewritten = ((ConstantScoreQuery) rewritten).getQuery();
       }

I think it would be a bit more robust, as it would keep working if there are ConstantScoreQuery wrappers between the inner levels of BooleanQuerys.

@benwtrent benwtrent requested review from jpountz and javanna and removed request for jpountz and javanna January 11, 2023 17:57
@benwtrent
Copy link
Member Author

@jpountz applied your suggestions

@javanna I added a rewrite count check and split the should & must test.

@benwtrent benwtrent requested review from jpountz and javanna and removed request for jpountz January 11, 2023 21:21
@benwtrent benwtrent requested review from javanna and jpountz and removed request for javanna January 11, 2023 21:22
@benwtrent benwtrent requested review from javanna and jpountz and removed request for jpountz and javanna January 11, 2023 21:22
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@javanna
Copy link
Contributor

javanna commented Jan 12, 2023

@benwtrent could you add a changelog entry too?

@javanna javanna merged commit 59b1745 into apache:main Jan 12, 2023
@javanna
Copy link
Contributor

javanna commented Jan 12, 2023

Thanks @benwtrent !

javanna pushed a commit that referenced this pull request Jan 12, 2023
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

closes: #12069
@javanna javanna added this to the 9.5.0 milestone Jan 13, 2023
@benwtrent benwtrent deleted the test/replicate-long-rewrite branch March 13, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long rewrite times for deeply nested, non-scoring Boolean queries
3 participants