Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

Merged
merged 2 commits into from
Mar 1, 2023

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Feb 19, 2023

Description

TermInSetQuery's implementation is more-or-less an exactly copy of MultiTermQuery + MultiTermQueryConstantScoreWrapper. This PR removes the custom TermInSetQuery implementation in favor of extending MultiTermQuery with MultiTermQueryConstantScoreWrapper as the default rewrite behavior.

One nice benefit of this (beyond code cleanup) is that different rewrite methods can be provided for different behavior. Specifically, we can leverage DocValuesRewriteMethod to completely replace SortedSetDocValuesSetQuery (I would propose doing that in a follow up PR once we're happy with this change).

A couple notes about the change:

  1. I needed to give MultiTermQueryConstantScoreWrapper a ScoreSupplier implementation so TermInSetQuery can still be used efficiently with IndexOrDocValuesQuery. This also required a new (optional) public API in MultiTermQuery where queries can expose their number of terms (where applicable).
  2. Retaining the existing TermInSetQuery#rewrite behavior was slightly non-trivial. I don't really like how I've done this in the PR, but I also can't think of a better solution. I'm not actually convinced we need to retain this, but I'm not sure if client code might rely on the existing behavior in places. Even if that's the case, it might not be a good enough argument to keep it, but I did find a breaking unit test if I didn't keep it. (TestPresearcherMatchCollector.testMatchCollectorShowMatches assumes the query will get rewritten to a BQ). I'm not familiar enough with the monitor package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here. If someone is more familiar and/or has opinions on the need to retain the existing "rewrite" behavior, I'd love some feedback.

Performance

When this has been discussed in the past, there's been an open question around performance since the term intersection happens a bit differently (seekCeil vs. seekExact). I ran some benchmarks using a one-off tool (similar to #12151) and found no noticeable regressions/issues. Here's the output of my tool (which you can check out here:
TiSBench.java.txt )

All Country Code Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 206.12 203.00 201.95 119.97
TiS MTQ 193.67 190.54 189.43 117.02

Medium Cardinality + High Cost Country Code Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 132.42 127.95 126.15 77.80
TiS MTQ 130.49 125.97 124.25 77.31

Low Cardinality + High Cost Country Code Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 239.00 110.95 36.51 75.18
TiS MTQ 242.64 113.94 38.71 76.00

Medium Cardinality + Low Cost Country Code Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 2.51 1.99 1.67 0.25
TiS MTQ 2.54 2.01 1.67 0.28

Low Cardinality + Low Cost Country Code Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 2.34 2.07 1.88 0.44
TiS MTQ 2.20 1.91 1.68 0.41

High Cardinality PK Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 26.41 26.19 25.96 6.36
TiS MTQ 27.78 27.65 27.32 6.69

Medium Cardinality PK Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 2.19 2.13 2.15 0.49
TiS MTQ 2.33 2.28 2.25 0.50

Low Cardinality PK Filter Terms

Approach Large Lead Terms Medium Lead Terms Small Lead Terms No Lead Terms
TiS Original 1.76 1.74 1.74 0.34
TiS MTQ 1.32 1.29 1.28 0.22

@romseygeek
Copy link
Contributor

I'm not familiar enough with the monitor package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here.

It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.

@gsmiller
Copy link
Contributor Author

It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.

OK thanks. I've removed the BQ rewrite logic and updated this test. I'm not convinced we need to actually rewrite to a BQ during the query rewrite, but that's the one big difference with this implementation.

@MarcusSorealheis
Copy link
Contributor

MarcusSorealheis commented Feb 19, 2023

@gsmiller this is a great simplification. Thank you. I'm going to share it with our team.

gsmiller added a commit to jpountz/lucene that referenced this pull request Feb 23, 2023
@gsmiller
Copy link
Contributor Author

I've rebased this work on top of main, picking up the significant changes in the MultiTermQuery space (#12055) and switching the default rewrite method for TermInSetQuery to the newly added CONSTANT_SCORE_BLENDED_REWRITE. I re-ran my simple benchmarking tooling (mentioned above) and results still look good.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I am not fully sure what default rewrite method is best here.

@gsmiller
Copy link
Contributor Author

gsmiller commented Mar 1, 2023

Thanks @uschindler.

I am not fully sure what default rewrite method is best here.

The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values post-filtering, etc.). Based on the benchmark wins in #12055 for other multi-term queries, I thought it would be good to use the same rewrite by default, at least for now. We can change the default easily if we learn more.

@gsmiller gsmiller merged commit 3809106 into apache:main Mar 1, 2023
@gsmiller gsmiller deleted the GH/tis-extend-mtq branch March 1, 2023 13:20
gsmiller added a commit that referenced this pull request Mar 1, 2023
@gsmiller gsmiller added this to the 9.6.0 milestone Mar 1, 2023
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jun 5, 2024
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
 match query terms.

With the introduction of PR apache#12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.

This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.

Relates to PR apache#12156
mayya-sharipova added a commit that referenced this pull request Jun 6, 2024
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
 match query terms.

With the introduction of PR #12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.

This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.

Relates to PR #12156
mayya-sharipova added a commit that referenced this pull request Jun 6, 2024
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
 match query terms.

With the introduction of PR #12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.

This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.

Relates to PR #12156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants