Histogram aggs: add empty buckets only in the final reduce step #35921

javanna · 2018-11-26T21:36:38Z

Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.

elasticmachine · 2018-11-26T21:36:39Z

Pinging @elastic/es-analytics-geo

polyfractal · 2018-11-27T15:57:57Z

.../main/java/org/elasticsearch/search/aggregations/bucket/histogram/InternalDateHistogram.java

+            if (minDocCount == 0) {
+                addEmptyBuckets(reducedBuckets, reduceContext);
+            }
+            if (InternalOrder.isKeyDesc(order)) {


It took me a few minutes to convince myself that this works. :)

On first glance the behavior changes because sorting is only done on the final reduction now (instead of every reduction), which I thought might break reduceBuckets() as that relies on consistent ordering. But histos/date_histos sort their shard results by key:asc, so this isn't actually a problem.

Could we add a test to DateHistogramAggregatorTests/HistogramAggregatorTests that randomly chooses a sort order + min_doc_count: 0 to ensure this internal contract never changes in the future? It looks like none of those tests set an order so we might not notice otherwise.

I don't think I changed anything around sorting. That if was hard to read, I rewrote it but the only actual change is that empty buckets are only added in the final reduction phase. I am working on tests, I will add what you suggest, I had other additions as well in mind but I hit some roadblocks (see #36004). Basically empty buckets were not tested in our unit tests.

We chatted about this in slack, and I'm just bad at boolean logic. All good here :)

I added a check to the base class which verifies that no buckets are added in non final reduce phases. Now that we test adding empty buckets for histogram and date histogram aggs, this check makes some sense I think. These are the only aggs that add buckets as part of reduce right? I wonder if I need to check whether there is some other missing test coverage somewhere.

Also, I worked a bit on increasing test coverage in DateHistogramAggregatorTests like you suggested, and I think I prefer doing it in a follow-up if still necessary.

In this test we were randomizing different values but minDocCount was hardcoded to 1. It's important to test other values, especially `0` as it's the default. The test needed some adapting in the way buckets are randomly generated: all aggs need to share the same interval, minDocCount and emptyBucketInfo. Also assertions need to take into account that more (or less) buckets are expected depending on minDocCount. This was originated by elastic#35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to elastic#26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, this was triggered by the increased test coverage.

In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`. This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.

Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.

jimczi

LGTM, thanks @javanna

jimczi · 2018-11-30T19:27:34Z

test/framework/src/main/java/org/elasticsearch/test/InternalAggregationTestCase.java

+                initialBucketCount += countInnerBucket(internalAggregation);
+            }
+            int reducedBucketCount = countInnerBucket(reduced);
+            //check that non final reduction never adds buckets


Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see elastic#35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to elastic#32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.

Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.

In `InternalHistogramTests` we were randomizing different values but `minDocCount` was hardcoded to `1`. It's important to test other values, especially `0` as it's the default. To make this possible, the test needed some adapting in the way buckets are randomly generated: all aggs need to share the same `interval`, `minDocCount` and `emptyBucketInfo`. Also assertions need to take into account that more (or less) buckets are expected depending on `minDocCount`. This was originated by #35921 and its need to test adding empty buckets as part of the reduce phase. Also relates to #26856 as one more key comparison needed to use `Double.compare` to properly handle `NaN` values, which was triggered by the increased test coverage.

Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This will allow us to later remove the max buckets limit when performing non final reduction.

Given that we check the max buckets limit on each shard when collecting the buckets, and that non final reduction cannot add buckets (see #35921), there is no point in counting and checking the number of buckets as part of non final reduction phases. Such check is still needed though in the final reduction phases to make sure that the number of returned buckets is not above the allowed threshold. Relates somehow to #32125 as we will make use of non final reduction phases in CCS alternate execution mode and that increases the chance that this check trips for nothing when reducing aggs in each remote cluster.

javanna added >enhancement :Analytics/Aggregations Aggregations v7.0.0 v6.6.0 labels Nov 26, 2018

$polyfractal$

polyfractal reviewed Nov 27, 2018

View reviewed changes

javanna mentioned this pull request Nov 28, 2018

Increase InternalHistogramTests coverage #36004

Merged

javanna and others added 2 commits November 30, 2018 14:18

Add empty buckets only in the final reduce step

5f9b40d

Empty buckets don't need to be added when performing an incremental reduction step, they can be added later in the final reduction step. This should allow us to later remove the max buckets limit when performing non final reduction.

add check that no aggs add empty buckets in their non final reduce phase

9f55e45

javanna force-pushed the enhancement/histo_empty_buckets_final_reduce branch from fe11866 to 9f55e45 Compare November 30, 2018 16:21

javanna requested a review from jimczi November 30, 2018 19:18

jimczi approved these changes Nov 30, 2018

View reviewed changes

javanna merged commit 0ebc177 into elastic:master Nov 30, 2018

javanna added the backport pending label Nov 30, 2018

javanna mentioned this pull request Dec 3, 2018

Enforce max_buckets limit only in the final reduction phase #36152

Merged

javanna removed the backport pending label Dec 3, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram aggs: add empty buckets only in the final reduce step #35921

Histogram aggs: add empty buckets only in the final reduce step #35921

javanna commented Nov 26, 2018

elasticmachine commented Nov 26, 2018

$@polyfractal$ polyfractal Nov 27, 2018

javanna Nov 29, 2018 •

edited

$@polyfractal$ polyfractal Nov 29, 2018

javanna Nov 30, 2018

jimczi left a comment

jimczi Nov 30, 2018

Histogram aggs: add empty buckets only in the final reduce step #35921

Histogram aggs: add empty buckets only in the final reduce step #35921

Conversation

javanna commented Nov 26, 2018

elasticmachine commented Nov 26, 2018

polyfractal Nov 27, 2018

Choose a reason for hiding this comment

javanna Nov 29, 2018 • edited

Choose a reason for hiding this comment

polyfractal Nov 29, 2018

Choose a reason for hiding this comment

javanna Nov 30, 2018

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

jimczi Nov 30, 2018

Choose a reason for hiding this comment

$@polyfractal$ polyfractal Nov 27, 2018

javanna Nov 29, 2018 •

edited

$@polyfractal$ polyfractal Nov 29, 2018