Merge histograms without losing precision #93704

salvatore-campagna · 2023-02-10T15:08:40Z

For empty histogram aggregations we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value for
the number of significant digits that is in the range [0, 5]. As a result, if the
test runs with 4 or 5 significant value digits but the HdrHistogram sketch only
uses 3, checking errors on results fails since all computations are done with
lower than expected precision.

The issue happens at reduction time in AbstractInternalHDRPercentiles when
merging histograms coming from different shards. If a shard returns no data the
sketch is empty but uses 3 significant digits, while for non empty results the
correct number of digits is used. Now, depending on the order sketches are
merged it might happen, for instance, that we merge a sketch using 4 or 5
significant digits into one using 3 significant digits (used for the empty result).
The result, will then use whatever precision is used by the first "merged" object
created. This in some cases leads to correct result and sometimes not.

Here, when merging histograms, we always use the one with higher value of
numberOfSignificantValueDigits so to avoid reducing precision of the result.

Note that, as a result of this merging strategy, we can even use just 0 digits
precision for empty results and save on some serialization/deserialization
for empty histograms.

Resolves #92822

For unmapped histogram fields we instantiate by default a DoubleHistogram with 3 significant digits precision. The test generates a random value for the number of significant digits that is in the range [0, 5]. As a result, if the test runs with 4 or 5 significant value digits but the HdrHistogram sketch only uses 3, checking errors on results will fail. Here we change the maximum value for the significant value digits to 3 if the query involves an index with unmapped fields.

elasticsearchmachine · 2023-02-10T15:10:15Z

Hi @salvatore-campagna, I've created a changelog YAML for you.

elasticsearchmachine · 2023-02-10T15:10:15Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

salvatore-campagna · 2023-02-10T15:27:08Z

I think I can try to just instantiate the DoubleHistogram using the correct/non-default number of significant digits.

salvatore-campagna · 2023-02-10T16:13:09Z

...er/src/main/java/org/elasticsearch/search/aggregations/metrics/HDRPercentilesAggregator.java

@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) {

    @Override
    public InternalAggregation buildEmptyAggregation() {
-        return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata());
+        return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata());


Note that in TDigestPercentilesAggregator#buildEmptyAggregation we do something similar

return new InternalTDigestPercentiles(name, keys, new TDigestState(compression), keyed, formatter, metadata());

Creating DoubleHistogram with a high numberOfSignificantValueDigits consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.

Can we maybe create the right DoubleHistogram instance in AbstractInternalHDRPercentiles#reduce(...) method? Right now it always uses a EMPTY_HISTOGRAM in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits != 3?

martijnvg

I think last change to this code made the test fail.
I left a comment.

martijnvg · 2023-02-10T16:51:43Z

...er/src/main/java/org/elasticsearch/search/aggregations/metrics/HDRPercentilesAggregator.java

@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) {

    @Override
    public InternalAggregation buildEmptyAggregation() {
-        return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata());
+        return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata());


Creating DoubleHistogram with a high numberOfSignificantValueDigits consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.

Can we maybe create the right DoubleHistogram instance in AbstractInternalHDRPercentiles#reduce(...) method? Right now it always uses a EMPTY_HISTOGRAM in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits != 3?

Normallt we would create a histogram with the correct number of digits, nut for empty histograms we end up serializing and deserializing large arrays for empty aggregations. Here we have a kind of workaroud were we use just 3 digits for empty histograms and, at reduce time, we always merge usign the larger number of digits among all histograms.

salvatorecampagna · 2023-02-10T19:15:27Z

I wonder if we should do the same for tdigest too.

When an empty aggregations uses a TDigest object the udnelrying arrays used by the AvlTree TDigest implementation is eagerly allocated. If the aggreation produces no result, we serialize and deserialize an array which might be large if the comrpession value is large (about 5 * compression centroids are tracked). Here we use a null value for empty aggreations while building the result and later on use a static empty TDigest object at reduce time and merge it with the non-empty results.

This change will end up in another PR focused on improving memory usage for TDigest percentiles aggregations.

salvatore-campagna · 2023-02-13T08:35:50Z

In another PR I will address the issue of using a "zero" histogram object also for TDigest, similar to what we do with HDR.

martijnvg

LGTM

For empty histogram aggregations we instantiate by default a DoubleHistogram with 3 significant digits precision. The test generates a random value for the number of significant digits that is in the range [0, 5]. As a result, if the test runs with 4 or 5 significant value digits but the HdrHistogram sketch only uses 3, checking errors on results fails since all computations are done with lower than expected precision. The issue happens at reduction time in AbstractInternalHDRPercentiles when merging histograms coming from different shards. If a shard returns no data the sketch is empty but uses 3 significant digits, while for non empty results the correct number of digits is used. Now, depending on the order sketches are merged it might happen, for instance, that we merge a sketch using 4 or 5 significant digits into one using 3 significant digits (used for the empty result). The result, will then use whatever precision is used by the first "merged" object created. This in some cases leads to correct result and sometimes not. Here, when merging histograms, we always use the one with higher value of numberOfSignificantValueDigits so to avoid reducing precision of the result. Note that, as a result of this merging strategy, we can even use just 0 digits precision for empty results and save on some serialization/deserialization for empty histograms. Resolves elastic#92822

salvatore-campagna self-assigned this Feb 10, 2023

salvatore-campagna added the >bug label Feb 10, 2023

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.8.0 labels Feb 10, 2023

salvatore-campagna added the :Analytics/Aggregations Aggregations label Feb 10, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 10, 2023

Update docs/changelog/93704.yaml

fb6ac0d

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 10, 2023

fix: use the correct number of significant values digits

f49036f

salvatore-campagna changed the title ~~Use max value 3 for numberOfSignificantValueDigits on unmapped index~~ User correct value for numberOfSignificantValueDigits on empty results Feb 10, 2023

fix: undo first commit

39b1c42

salvatore-campagna requested a review from not-napoleon February 10, 2023 16:09

fix: update yaml

1f5fc39

salvatore-campagna commented Feb 10, 2023

View reviewed changes

salvatore-campagna changed the title ~~User correct value for numberOfSignificantValueDigits on empty results~~ Use correct value for numberOfSignificantValueDigits on empty results Feb 10, 2023

martijnvg reviewed Feb 10, 2023

View reviewed changes

salvatore-campagna added 4 commits February 10, 2023 19:19

fix: use just 0 digits for empty histograms to save memory

8721bc9

fix: do a little bit less work while merging histograms

18b79a6

docs: include some javadoc for the merge method

da68312

salvatore-campagna requested a review from martijnvg February 10, 2023 18:39

yaml: update bug resolution description

7c2f22b

salvatore-campagna removed request for martijnvg and not-napoleon February 10, 2023 20:40

fix: set auto resize

4fc598a

salvatore-campagna added 2 commits February 10, 2023 21:46

fix: set autorize before merging

6683480

fix: use 3 digits minimum for empty histograms bwc

c1be8c2

salvatore-campagna changed the title ~~Use correct value for numberOfSignificantValueDigits on empty results~~ Merging histograms without precision loss Feb 10, 2023

salvatore-campagna changed the title ~~Merging histograms without precision loss~~ Merging histograms without losing precision Feb 10, 2023

salvatore-campagna changed the title ~~Merging histograms without losing precision~~ Merge histograms without losing precision Feb 10, 2023

salvatore-campagna added 6 commits February 10, 2023 23:41

fix: try using zero digits empty histograms

44e739e

fix: restore test

9f8907f

fix: do not use recursive call

7eac3f8

fix: undo changes to TDigest percentiles

30126e7

This change will end up in another PR focused on improving memory usage for TDigest percentiles aggregations.

fix: undo changes to TDigest percentiles

c57b68a

martijnvg approved these changes Feb 13, 2023

View reviewed changes

salvatore-campagna merged commit bc7ac10 into elastic:main Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge histograms without losing precision #93704

Merge histograms without losing precision #93704

salvatore-campagna commented Feb 10, 2023 •

edited

elasticsearchmachine commented Feb 10, 2023

elasticsearchmachine commented Feb 10, 2023

salvatore-campagna commented Feb 10, 2023

salvatore-campagna Feb 10, 2023

martijnvg Feb 10, 2023

martijnvg left a comment

martijnvg Feb 10, 2023

salvatorecampagna commented Feb 10, 2023

salvatore-campagna commented Feb 13, 2023

martijnvg left a comment

Merge histograms without losing precision #93704

Merge histograms without losing precision #93704

Conversation

salvatore-campagna commented Feb 10, 2023 • edited

elasticsearchmachine commented Feb 10, 2023

elasticsearchmachine commented Feb 10, 2023

salvatore-campagna commented Feb 10, 2023

salvatore-campagna Feb 10, 2023

Choose a reason for hiding this comment

martijnvg Feb 10, 2023

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg Feb 10, 2023

Choose a reason for hiding this comment

salvatorecampagna commented Feb 10, 2023

salvatore-campagna commented Feb 13, 2023

martijnvg left a comment

Choose a reason for hiding this comment

salvatore-campagna commented Feb 10, 2023 •

edited