Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Do not cache term filters by default. #7583

Closed

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 4, 2014

These filters are directly based on postings lists and can already
iterate/advance quickly. It is still possible to opt-in for caching.

These filters are directly based on postings lists and can already
iterate/advance quickly. It is still possible to opt-in for caching.
@rjernst
Copy link
Member

rjernst commented Sep 5, 2014

LGTM.

@s1monw
Copy link
Contributor

s1monw commented Sep 5, 2014

LGTM too :) good one

@clintongormley
Copy link

This change concerns me a bit, based on the numbers in #7577 (comment)

A lot of the time, we use term filters on dense values like:

  • _type: "event"
  • status: "active"
  • gender: "male"
  • user: "customer_1" (think filtered aliases)

These common use cases are all dense and often combined in the same filter. According to the benchmarks, each of them would be 50% slower than they are today.

The other concern I have is that caching term filters allows us to completely skip segments that don't contain that term. Without that caching, we're going to have to do a term lookup on potentially thousands of segments which we would previously have skipped. What impact does that have on performance?

The numbers shown in #7577 (comment) for Leapfrog look promising, but the bool filter won't do that today so users will experience a drop in performance out of the box.

This change (and #7577) seem like they could be very useful in the longer term, but they require more (as yet unwritten) supporting changes to provide a net out-of-the-box benefit. I'd vote for not putting them into 1.4 but delaying them until we can implement a more complete solution.

(Apologies if I have misunderstood the impact of these changes)

@jpountz
Copy link
Contributor Author

jpountz commented Oct 2, 2014

@clintongormley Your analysis is correct but I think it underestimates the cost of caching:

  1. unnecessary garbage
  2. the iterator needs to be fully consumed
  3. is due to the fact that we need to load things into memory and put it in a cache. The fact that it goes to a cache makes it very likely to end up in the old generation. For that reason, I think we should generally not cache filters maybe unless we already have something that is cacheable (eg. terms filter) or we know the filter is going to be reused.
  4. is bound to the fact that our postings lists have good skip lists. Imagine a user who would use term filters that are never reused. If they are cached, elasticsearch would need to iterate over all documents that match the filter and put them into a bit set, and then use that bit set to intersect the query. On the contrary if the filter is not cached, the postings list would be directly intersected with the query and thanks to the leap-frog approach, very few documents could be consumed from the postings list (it would just jump efficiently over non-matching docs using the skip list). This makes non-cached filters potentially much faster when there is not enough reuse.

I agree that #7577 needs more work in order to be better in all cases and not only the sparse case, but I think this change is good? Maybe something that needs to be improved is the default value of the cache parameter. For example, I believe it would not be too hard to have caching disabled by default in general unless it applies to the _type field or if we are in the context of an alias filter (the latter would probably also apply to other filters that we don't cache today by default?).

@dakrone
Copy link
Member

dakrone commented Oct 2, 2014

I think it might also be useful to cache term filters used in aliases, since creating an alias with a term filter has a high chance of being re-used (for future work, just an idea).

@jpountz
Copy link
Contributor Author

jpountz commented Nov 12, 2014

We just had a discussion about this issue and could not reach to an agreement about what the best default _cache value is for term filters.

A better solution could be to make the decision based on usage as outlined in #8449.

@jpountz jpountz closed this Nov 12, 2014
@bobrik
Copy link
Contributor

bobrik commented Jan 8, 2015

@jpountz
Copy link
Contributor Author

jpountz commented Jan 8, 2015

Oops, that's probably because I forgot to remove the labels when closing. Thanks for the report, I'll try to get it removed from the blog post as well.

@jpountz
Copy link
Contributor Author

jpountz commented Jan 8, 2015

OK I just removed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants