Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

gsmiller · 2023-02-19T00:37:38Z

Description

TermInSetQuery's implementation is more-or-less an exactly copy of MultiTermQuery + MultiTermQueryConstantScoreWrapper. This PR removes the custom TermInSetQuery implementation in favor of extending MultiTermQuery with MultiTermQueryConstantScoreWrapper as the default rewrite behavior.

One nice benefit of this (beyond code cleanup) is that different rewrite methods can be provided for different behavior. Specifically, we can leverage DocValuesRewriteMethod to completely replace SortedSetDocValuesSetQuery (I would propose doing that in a follow up PR once we're happy with this change).

A couple notes about the change:

I needed to give MultiTermQueryConstantScoreWrapper a ScoreSupplier implementation so TermInSetQuery can still be used efficiently with IndexOrDocValuesQuery. This also required a new (optional) public API in MultiTermQuery where queries can expose their number of terms (where applicable).
Retaining the existing TermInSetQuery#rewrite behavior was slightly non-trivial. I don't really like how I've done this in the PR, but I also can't think of a better solution. I'm not actually convinced we need to retain this, but I'm not sure if client code might rely on the existing behavior in places. Even if that's the case, it might not be a good enough argument to keep it, but I did find a breaking unit test if I didn't keep it. (TestPresearcherMatchCollector.testMatchCollectorShowMatches assumes the query will get rewritten to a BQ). I'm not familiar enough with the monitor package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here. If someone is more familiar and/or has opinions on the need to retain the existing "rewrite" behavior, I'd love some feedback.

Performance

When this has been discussed in the past, there's been an open question around performance since the term intersection happens a bit differently (seekCeil vs. seekExact). I ran some benchmarks using a one-off tool (similar to #12151) and found no noticeable regressions/issues. Here's the output of my tool (which you can check out here:
TiSBench.java.txt )

All Country Code Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	206.12	203.00	201.95	119.97
TiS MTQ	193.67	190.54	189.43	117.02

Medium Cardinality + High Cost Country Code Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	132.42	127.95	126.15	77.80
TiS MTQ	130.49	125.97	124.25	77.31

Low Cardinality + High Cost Country Code Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	239.00	110.95	36.51	75.18
TiS MTQ	242.64	113.94	38.71	76.00

Medium Cardinality + Low Cost Country Code Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	2.51	1.99	1.67	0.25
TiS MTQ	2.54	2.01	1.67	0.28

Low Cardinality + Low Cost Country Code Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	2.34	2.07	1.88	0.44
TiS MTQ	2.20	1.91	1.68	0.41

High Cardinality PK Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	26.41	26.19	25.96	6.36
TiS MTQ	27.78	27.65	27.32	6.69

Medium Cardinality PK Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	2.19	2.13	2.15	0.49
TiS MTQ	2.33	2.28	2.25	0.50

Low Cardinality PK Filter Terms

Approach	Large Lead Terms	Medium Lead Terms	Small Lead Terms	No Lead Terms
TiS Original	1.76	1.74	1.74	0.34
TiS MTQ	1.32	1.29	1.28	0.22

romseygeek · 2023-02-19T14:32:33Z

I'm not familiar enough with the monitor package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here.

It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.

gsmiller · 2023-02-19T17:54:27Z

It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.

OK thanks. I've removed the BQ rewrite logic and updated this test. I'm not convinced we need to actually rewrite to a BQ during the query rewrite, but that's the one big difference with this implementation.

MarcusSorealheis · 2023-02-19T19:16:24Z

@gsmiller this is a great simplification. Thank you. I'm going to share it with our team.

…opefully; see apache#12156)

…tiTermQuery

gsmiller · 2023-02-28T23:48:31Z

I've rebased this work on top of main, picking up the significant changes in the MultiTermQuery space (#12055) and switching the default rewrite method for TermInSetQuery to the newly added CONSTANT_SCORE_BLENDED_REWRITE. I re-ran my simple benchmarking tooling (mentioned above) and results still look good.

uschindler

Looks good to me! I am not fully sure what default rewrite method is best here.

gsmiller · 2023-03-01T13:20:18Z

Thanks @uschindler.

I am not fully sure what default rewrite method is best here.

The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values post-filtering, etc.). Based on the benchmark wins in #12055 for other multi-term queries, I thought it would be good to use the same rewrite by default, at least for now. We can change the default easily if we learn more.

…tiTermQuery (#12156)

MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR apache#12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR apache#12156

MultiTermQuery return null for ScoreSupplier if there are no terms in an index that match query terms. With the introduction of PR #12156 we saw degradation in performance of bool queries where one of the mandatory clauses is a TermInSetQuery with query terms not present in the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which would shortcut the whole bool query. This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field doesn't contain any query terms. Relates to PR #12156

gsmiller mentioned this pull request Feb 19, 2023

Better skipping for multi-term queries with a FILTER rewrite. #12055

Merged

gsmiller added a commit to jpountz/lucene that referenced this pull request Feb 23, 2023

remove custom impl for TiSQuery in favor of having TiSQ extend MTQ (h…

1a59f66

…opefully; see apache#12156)

Remove custom TermInSetQuery implementation in favor of extending Mul…

aee4262

…tiTermQuery

gsmiller force-pushed the GH/tis-extend-mtq branch from feb3c0b to aee4262 Compare February 28, 2023 22:22

changes

e17b7d8

uschindler approved these changes Mar 1, 2023

View reviewed changes

gsmiller merged commit 3809106 into apache:main Mar 1, 2023

gsmiller deleted the GH/tis-extend-mtq branch March 1, 2023 13:20

gsmiller added a commit that referenced this pull request Mar 1, 2023

Remove custom TermInSetQuery implementation in favor of extending Mul…

fa1d050

…tiTermQuery (#12156)

gsmiller added this to the 9.6.0 milestone Mar 1, 2023

gsmiller mentioned this pull request Mar 1, 2023

Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod #12175

Merged

gsmiller mentioned this pull request Mar 10, 2023

DRAFT: TermInSetQuery refactored to extend MultiTermsQuery #11739

Closed

mayya-sharipova mentioned this pull request Jun 5, 2024

MultiTermQuery return null for ScoreSupplier #13454

Merged

carlosdelest mentioned this pull request Jun 19, 2024

terms queries performance degradation elastic/elasticsearch#109913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

gsmiller commented Feb 19, 2023 •

edited

Loading

romseygeek commented Feb 19, 2023

gsmiller commented Feb 19, 2023

MarcusSorealheis commented Feb 19, 2023 •

edited

Loading

gsmiller commented Feb 28, 2023

uschindler left a comment •

edited

Loading

gsmiller commented Mar 1, 2023

Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery #12156

Conversation

gsmiller commented Feb 19, 2023 • edited Loading

Description

Performance

All Country Code Filter Terms

Medium Cardinality + High Cost Country Code Filter Terms

Low Cardinality + High Cost Country Code Filter Terms

Medium Cardinality + Low Cost Country Code Filter Terms

Low Cardinality + Low Cost Country Code Filter Terms

High Cardinality PK Filter Terms

Medium Cardinality PK Filter Terms

Low Cardinality PK Filter Terms

romseygeek commented Feb 19, 2023

gsmiller commented Feb 19, 2023

MarcusSorealheis commented Feb 19, 2023 • edited Loading

gsmiller commented Feb 28, 2023

uschindler left a comment • edited Loading

Choose a reason for hiding this comment

gsmiller commented Mar 1, 2023

gsmiller commented Feb 19, 2023 •

edited

Loading

MarcusSorealheis commented Feb 19, 2023 •

edited

Loading

uschindler left a comment •

edited

Loading