Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery #12312

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gsmiller
Copy link
Contributor

Description

I started experimenting with #12176 to see if we can get any benefits out of having TermInSetQuery extend AutomatonQuery instead of MultiTermQuery. I'm opening this only as a draft for now so that I can "show my work" alongside some benchmark results as I have them.

@rmuir
Copy link
Member

rmuir commented May 19, 2023

thanks for getting this started! Will be interested to see how the use of Terms.intersect impacts the performance.

@gsmiller
Copy link
Contributor Author

gsmiller commented May 19, 2023

Here's what I'm seeing so far in benchmarking...

I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in luceneutil (I know, we should address this!), and 2) it lets me test a number of specific scenarios. Honestly, it was just the easiest path for me to get some numbers. My benchmarked is here:
TiSBench.java.txt. If you're interested in running it, you need a little setup. Most of it is obvious, but I'm happy to answer questions if helpful.

The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into:

  1. Term disjunctions over the country code field, with varying cardinality: "all" country codes (254 terms), "medium cardinality" country codes (20 different terms), and "low cardinality" country codes (10 different terms). The medium/low cardinality tasks also break down into a set with high and low costs (i.e., common terms and rare terms).
  2. Term disjunctions over the ID field. There are high/medium/low versions of this task (500 terms, 20 terms, 10 terms)

For each task, there are four runs:

  1. Typical postings-based approach with the current TermInSetQuery that extends MultiTermQuery (using prefix coding of terms and the ping-pong intersection with index terms)
  2. Postings-based approach with a new TermInSetQuery that extends AutomatonQuery (delegating to Term#intersect to intersect query terms with index terms)
  3. DocValues-based approach with the current TermInSetQuery
  4. DocValues-based approach with the new TermInSetQuery

I ran postings- and docvalues-based approaches since the term dictionary implementations are different.

In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I think we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead.

So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter).

I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set.

Here are the benchmark results (numbers are query time in ms):

Task Postings MTQ Postings Automata DV MTQ DV Automata
All Country Code Filter Terms 507.47 506.98 82.43 104.75
Medium Cardinality + High Cost Country Code Filter Terms 328.24 328.16 167.88 167.87
Low Cardinality + High Cost Country Code Filter Terms 70.85 70.93 169.59 169.71
Medium Cardinality + Low Cost Country Code Filter Terms 0.31 0.27 73.97 73.99
Low Cardinality + Low Cost Country Code Filter Terms 0.42 0.41 73.85 73.99
High Cardinality PK Filter Terms 6.34 8.57 127.34 129.63
Medium Cardinality PK Filter Terms 0.46 0.64 110.12 110.51
Low Cardinality PK Filter Terms 0.19 0.31 66.41 66.51

@rmuir
Copy link
Member

rmuir commented May 19, 2023

hmm, disappointing. Was hoping to see gains on the terms dictionary since it optimizes intersect. wonder what is going on.

Of course docvalues impl doesn't optimize intersect in any way, so it isn't surprising that we don't see gains there, but it is something we could improve later.

@gsmiller
Copy link
Contributor Author

gsmiller commented May 19, 2023

OK, here's a method profiler diff for the "High Cardinality PK" task, comparing two postings approaches—one that's using the current MultiTermQuery version, and one using AutomatonQuery ("TermInSetQueryLegacy" is our current version extending MultiTermQuery). The green colored frames show where the AutomatonQuery is spending less time compared to the MultiTermQuery version, and red is the opposite. Eyeballing this, it seems to confirm that it's a little more expensive to build the automaton than to do the prefix-coding, but a little less expensive to do the term intersection with the automaton approach (down in the codec's optimized intersect). The net/net though is that the AutomatonQuery approach in this case is slower by ~35%.
Screen Shot 2023-05-19 at 1 55 36 PM

@gsmiller
Copy link
Contributor Author

Hmm... not sure if I've got something setup incorrectly with my JFR settings, but trying to dig into the other tasks, I can't even get the relevant methods to show up in the profiled calls. I attached a debugger and made sure the call flow is what I expect, but I'm not seeing any of the term intersection logic in what's getting profiled. My theory is that the term intersection is so trivial relative to the actually work of computing the postings disjunction that it's just not showing up in the samples, but maybe there's a setting I'm missing that's not sampling in a fine-enough grain? Not sure. But as far as I can tell, with the non-PK tasks, it looks like the term intersection may just be such a trivial part of the query cost that it doesn't matter what approach we take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants