Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TermsQuery considered costly. #16851

Merged
merged 1 commit into from Mar 1, 2016

Conversation

@jpountz
Copy link
Contributor

jpountz commented Feb 29, 2016

This will be fixed in Lucene 6. This commit aims at backporting the change to
elasticsearch 2.3 which will be on Lucene 5.

See https://issues.apache.org/jira/browse/LUCENE-7050.

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Feb 29, 2016

This is somewhat surprising - I've worked on plenty of systems where TermsQuery are perfectly cachable and a low hanging fruit when it comes to optimisations.

To my understanding, this will prevent TermsQuery from being cached many times - based on heuristics, yes, but still biased towards not caching.

Can we have a way to opt-in for caching a query manually, similar to pre 2.x ES?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Feb 29, 2016

Sorry maybe my explanation was not clear but the goal is to make TermsQuery cached more aggressively, not less. The queries that are classified as "costly" in that case are queries that need to evaluate documents on the whole index, regardless of whether they are intersected with a selective query or not (eg. TermsQuery, PrefixQuery, NumericRangeQuer).

@jimczi

This comment has been minimized.

Copy link
Member

jimczi commented Feb 29, 2016

LGTM though I wonder if we could check the size of the terms query in order to decide if it's a costly query or not. If the number of terms is below 16 then the query is rewritten into a "cheap" boolean query.
I wonder if we should also add GeoPointTermQueryConstantScoreWrapper (not on this PR of course).
IMO we should add a simple check for any query that would be able to tell if the query will be executed on the whole index beforehand. This could make the caching more efficient, we could directly extract the bitset created by this query instead of relying on advance to build the bitset for the cache.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Feb 29, 2016

If the number of terms is below 16 then the query is rewritten into a "cheap" boolean query.

I think this is fine: the query cache only sees rewritten queries so if a TermsQuery that contains less than 16 terms is passed to IndexSearcher, the query cache will actually see a BooleanQuery.

IMO we should add a simple check for any query that would be able to tell if the query will be executed on the whole index beforehand.

I have been avoiding it so far as I would really like queries to be unaware of caching as much as possible.

@jimczi

This comment has been minimized.

Copy link
Member

jimczi commented Feb 29, 2016

I think this is fine: the query cache only sees rewritten queries so if a TermsQuery that contains less than 16 terms is passed to IndexSearcher, the query cache will actually see a BooleanQuery.

No the rewrite is noop, the resolution is done in the weight when the scorer is created. It's the same in any MultiTermQuery, the rewrite does not check anything just returning a new class which has the heuristic to create a scorer based on a boolean query or on a doc id set already built.

I have been avoiding it so far as I would really like queries to be unaware of caching as much as possible.

I agree but we should have a way to detect if a costly query is really costly ;). Maybe we could have a post query callback which checks the cost of the query, if the cost is low then no need to add the query in the frequency tracking ring buffer ?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Mar 1, 2016

TermsQuery actually rewrites both in rewrite and per segment: rewrite generates a BooleanQuery when there are less than 16 terms. If there are more than 16 terms, it is also rewritten per segment when less than 16 terms from the query are also contained in the terms dict. So I think that's fine?

Regarding an API for detecting costly queries, don't get me wrong, I understand the need and wish we had one. But having such an API only for the purpose of caching feels wrong to me, so I'd rather wait and see if we can have other use-cases for it like better query execution (regardless of caching).

@jimczi

This comment has been minimized.

Copy link
Member

jimczi commented Mar 1, 2016

So I think that's fine?

Sorry I checked MultiTermQuery and TermsQuery does not inherit from it.

Anyway this PR should help a lot for use cases like #16031 so again LGTM.

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Mar 1, 2016

@jpountz 👍 , thanks for explaining

As a side note, and as a power user with many requirements for low latency querying abilities, the lack of some manual control of caching aches to me quite a bit. While you are doing great work here, there will always be those corner cases and bugs, and for those it'd be really helpful to have a "force_cache":true option for queries in the filter context.

@jpountz

This comment has been minimized.

Copy link
Contributor Author

jpountz commented Mar 1, 2016

If the common filters are known beforehand, wouldn't it be better to ad a flag at index time to know whether those queries match in order to be able to run fast term queries at search time?

For your low-latency requirements, I think that something we should explore would be to automatically warm up the cache for new segments based on the popular filters from the history (we don't to it today and these popular filters will likely only get cached on the first time that they are used on these new segments, potentially increasing the latency if the filter is costly or matches many docs).

This will be fixed in Lucene 6. This commit aims at backporting the change to
elasticsearch 2.3 which will be on Lucene 5.
@jpountz jpountz force-pushed the jpountz:enhancement/TermsQuery_is_costly branch to 0d102ee Mar 1, 2016
@jpountz jpountz merged commit 0d102ee into elastic:2.x Mar 1, 2016
1 check passed
1 check passed
CLA Commit author is a member of Elasticsearch
Details
@jpountz jpountz deleted the jpountz:enhancement/TermsQuery_is_costly branch Mar 1, 2016
@jpountz jpountz removed the review label Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.