Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram aggs: add empty buckets only in the final reduce step #35921

Merged

Conversation

javanna
Copy link
Member

@javanna javanna commented Nov 26, 2018

Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

if (minDocCount == 0) {
addEmptyBuckets(reducedBuckets, reduceContext);
}
if (InternalOrder.isKeyDesc(order)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few minutes to convince myself that this works. :)

On first glance the behavior changes because sorting is only done on the final reduction now (instead of every reduction), which I thought might break reduceBuckets() as that relies on consistent ordering. But histos/date_histos sort their shard results by key:asc, so this isn't actually a problem.

Could we add a test to DateHistogramAggregatorTests/HistogramAggregatorTests that randomly chooses a sort order + min_doc_count: 0 to ensure this internal contract never changes in the future? It looks like none of those tests set an order so we might not notice otherwise.

Copy link
Member Author

@javanna javanna Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I changed anything around sorting. That if was hard to read, I rewrote it but the only actual change is that empty buckets are only added in the final reduction phase. I am working on tests, I will add what you suggest, I had other additions as well in mind but I hit some roadblocks (see #36004). Basically empty buckets were not tested in our unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this in slack, and I'm just bad at boolean logic. All good here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check to the base class which verifies that no buckets are added in non final reduce phases. Now that we test adding empty buckets for histogram and date histogram aggs, this check makes some sense I think. These are the only aggs that add buckets as part of reduce right? I wonder if I need to check whether there is some other missing test coverage somewhere.

Also, I worked a bit on increasing test coverage in DateHistogramAggregatorTests like you suggested, and I think I prefer doing it in a follow-up if still necessary.

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 28, 2018
In this test we were randomizing different values but minDocCount was hardcoded to 1. It's important to test other values, especially `0` as it's the default. The test needed some adapting in the way buckets are randomly generated: all aggs need to share the same interval, minDocCount and emptyBucketInfo. Also assertions need to take into account that more (or less) buckets are expected depending on minDocCount.

This was originated by elastic#35921 and its need to test adding empty buckets as part of the reduce phase.
Also relates to elastic#26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, this was triggered by the increased test coverage.
javanna added a commit that referenced this pull request Nov 28, 2018
In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`.

This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase.

Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.
javanna and others added 2 commits November 30, 2018 14:18
Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.
@javanna javanna force-pushed the enhancement/histo_empty_buckets_final_reduce branch from fe11866 to 9f55e45 Compare November 30, 2018 16:21
@javanna javanna requested a review from jimczi November 30, 2018 19:18
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @javanna

initialBucketCount += countInnerBucket(internalAggregation);
}
int reducedBucketCount = countInnerBucket(reduced);
//check that non final reduction never adds buckets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@javanna javanna merged commit 0ebc177 into elastic:master Nov 30, 2018
javanna added a commit to javanna/elasticsearch that referenced this pull request Dec 3, 2018
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see elastic#35921), there is no point in counting and checking the number of buckets as part of non final reduction phases.

Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold.

Relates somehow to elastic#32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
javanna added a commit that referenced this pull request Dec 3, 2018
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases.

Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold.

Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
javanna added a commit that referenced this pull request Dec 3, 2018
In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`.

This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase.

Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.
javanna added a commit that referenced this pull request Dec 3, 2018
Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This will allow us to later remove the max buckets limit when performing non final reduction.
javanna added a commit that referenced this pull request Dec 3, 2018
Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases.

Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold.

Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants