Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: delegation of setNextReader calls #9098

Closed
jpountz opened this issue Dec 30, 2014 · 2 comments

Comments

@jpountz
Copy link
Contributor

commented Dec 30, 2014

This is a follow-up to #6477. Calls to setNextReader are today centralized by the AggregationContext class. On #6477 the conclusion was that this behaviour was better than making setNextReader delegate to sub aggregations as it was more efficient with deeply nested aggregation trees (with several aggregators sharing the same source of values). However, I'm more and more convinced this is not the right approach:

  • it makes aggregations hard to unit test
  • it makes defer/replay sub-optimal as the whole context is replayed instead of just what is needed
  • it makes it hard to migrate to Lucene5-style collectors (with one leaf collector per segment), which would allow us to have more optimized aggregations (especially for the single-valued case).

Ideally I would like aggregators to be as close as possible to Lucene collectors in terms of API. So even if it would initially be worse for deeply nested trees, I think we should revive #6477 and think about other ways to make deeply nested trees faster.

cc @colings86

@colings86

This comment has been minimized.

Copy link
Member

commented Jan 5, 2015

+1

The problem I found in #6477 was with the aggregations (such as the terms aggregation) that dynamically create buckets. Because we have no idea how many buckets will be needed up front, at the moment we create a new instance of the aggregator for each bucket. This instance needs to be reader and scorer aware. We use the anonymous Aggregator class in AggregatorFactories [1] to create and manage these instances. With the approaches I tried in #6477 this anonymous class always ends up iterating through a collection of Aggregator instances (one for each bucket) for every call to setNextReader() and it is this iteration which kills the performance as in nested terms aggregations (terms aggregations with sub-terms aggregations) we end up creating an iterator for every parent term bucket to iterate through the sub-term buckets. With the current way of registering the instance with the AggregationContext, there is only a single list of instance on which setNextReader() needs to be called, so only one instance of an iterator is required rather than having nested iterators.

If we could somehow remove the need to have BucketAggregationMode.PER_BUCKET aggregators then we would only have one instance of the ReaderContextAware and ScorerAware class (the Aggregator) regardless of the number of buckets the Aggregator creates.

[1] https://github.com/elasticsearch/elasticsearch/blob/610ce078fb3c84c47d6d32aff7d77ba850e28f9d/src/main/java/org/elasticsearch/search/aggregations/AggregatorFactories.java#L78-146

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2015

Fixed through #9544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.