Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Aggregations: delegation of setNextReader calls #9098
This is a follow-up to #6477. Calls to setNextReader are today centralized by the
Ideally I would like aggregators to be as close as possible to Lucene collectors in terms of API. So even if it would initially be worse for deeply nested trees, I think we should revive #6477 and think about other ways to make deeply nested trees faster.
The problem I found in #6477 was with the aggregations (such as the terms aggregation) that dynamically create buckets. Because we have no idea how many buckets will be needed up front, at the moment we create a new instance of the aggregator for each bucket. This instance needs to be reader and scorer aware. We use the anonymous Aggregator class in AggregatorFactories  to create and manage these instances. With the approaches I tried in #6477 this anonymous class always ends up iterating through a collection of Aggregator instances (one for each bucket) for every call to setNextReader() and it is this iteration which kills the performance as in nested terms aggregations (terms aggregations with sub-terms aggregations) we end up creating an iterator for every parent term bucket to iterate through the sub-term buckets. With the current way of registering the instance with the AggregationContext, there is only a single list of instance on which setNextReader() needs to be called, so only one instance of an iterator is required rather than having nested iterators.
If we could somehow remove the need to have BucketAggregationMode.PER_BUCKET aggregators then we would only have one instance of the ReaderContextAware and ScorerAware class (the Aggregator) regardless of the number of buckets the Aggregator creates.