Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge histograms without losing precision #93704

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Feb 10, 2023

For empty histogram aggregations we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value for
the number of significant digits that is in the range [0, 5]. As a result, if the
test runs with 4 or 5 significant value digits but the HdrHistogram sketch only
uses 3, checking errors on results fails since all computations are done with
lower than expected precision.

The issue happens at reduction time in AbstractInternalHDRPercentiles when
merging histograms coming from different shards. If a shard returns no data the
sketch is empty but uses 3 significant digits, while for non empty results the
correct number of digits is used. Now, depending on the order sketches are
merged it might happen, for instance, that we merge a sketch using 4 or 5
significant digits into one using 3 significant digits (used for the empty result).
The result, will then use whatever precision is used by the first "merged" object
created. This in some cases leads to correct result and sometimes not.

Here, when merging histograms, we always use the one with higher value of
numberOfSignificantValueDigits so to avoid reducing precision of the result.

Note that, as a result of this merging strategy, we can even use just 0 digits
precision for empty results and save on some serialization/deserialization
for empty histograms.

Resolves #92822

For unmapped histogram fields we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value
for the number of significant digits that is in the range [0, 5].
As a result, if the test runs with 4 or 5 significant value digits but
the HdrHistogram sketch only uses 3, checking errors on results will fail.

Here we change the maximum value for the significant value digits to 3
if the query involves an index with unmapped fields.
@salvatore-campagna salvatore-campagna self-assigned this Feb 10, 2023
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.8.0 labels Feb 10, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @salvatore-campagna, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Feb 10, 2023
@salvatore-campagna
Copy link
Contributor Author

I think I can try to just instantiate the DoubleHistogram using the correct/non-default number of significant digits.

@salvatore-campagna salvatore-campagna changed the title Use max value 3 for numberOfSignificantValueDigits on unmapped index User correct value for numberOfSignificantValueDigits on empty results Feb 10, 2023
@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) {

@Override
public InternalAggregation buildEmptyAggregation() {
return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata());
return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in TDigestPercentilesAggregator#buildEmptyAggregation we do something similar

return new InternalTDigestPercentiles(name, keys, new TDigestState(compression), keyed, formatter, metadata());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating DoubleHistogram with a high numberOfSignificantValueDigits consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.

Can we maybe create the right DoubleHistogram instance in AbstractInternalHDRPercentiles#reduce(...) method? Right now it always uses a EMPTY_HISTOGRAM in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits != 3?

@salvatore-campagna salvatore-campagna changed the title User correct value for numberOfSignificantValueDigits on empty results Use correct value for numberOfSignificantValueDigits on empty results Feb 10, 2023
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last change to this code made the test fail.
I left a comment.

@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) {

@Override
public InternalAggregation buildEmptyAggregation() {
return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata());
return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating DoubleHistogram with a high numberOfSignificantValueDigits consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.

Can we maybe create the right DoubleHistogram instance in AbstractInternalHDRPercentiles#reduce(...) method? Right now it always uses a EMPTY_HISTOGRAM in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits != 3?

Normallt we would create a histogram with the correct number of digits,
nut for empty histograms we end up serializing and deserializing large
arrays for empty aggregations. Here we have a kind of workaroud were we
use just 3 digits for empty histograms and, at reduce time, we always
merge usign the larger number of digits among all histograms.
@salvatorecampagna
Copy link

I wonder if we should do the same for tdigest too.

@salvatore-campagna salvatore-campagna changed the title Use correct value for numberOfSignificantValueDigits on empty results Merging histograms without precision loss Feb 10, 2023
@salvatore-campagna salvatore-campagna changed the title Merging histograms without precision loss Merging histograms without losing precision Feb 10, 2023
@salvatore-campagna salvatore-campagna changed the title Merging histograms without losing precision Merge histograms without losing precision Feb 10, 2023
When an empty aggregations uses a TDigest object the udnelrying
arrays used by the AvlTree TDigest implementation is eagerly
allocated. If the aggreation produces no result, we serialize and
deserialize an array which might be large if the comrpession value
is large (about 5 * compression centroids are tracked). Here we
use a null value for empty aggreations while building the result
and later on use a static empty TDigest object at reduce time and
merge it with the non-empty results.
This change will end up in another PR focused on improving
memory usage for TDigest percentiles aggregations.
@salvatore-campagna
Copy link
Contributor Author

In another PR I will address the issue of using a "zero" histogram object also for TDigest, similar to what we do with HDR.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@salvatore-campagna salvatore-campagna merged commit bc7ac10 into elastic:main Feb 13, 2023
salvatore-campagna added a commit to salvatore-campagna/elasticsearch that referenced this pull request Feb 16, 2023
For empty histogram aggregations we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value for
the number of significant digits that is in the range [0, 5]. As a result, if the
test runs with 4 or 5 significant value digits but the HdrHistogram sketch only
uses 3, checking errors on results fails since all computations are done with
lower than expected precision.

The issue happens at reduction time in AbstractInternalHDRPercentiles when
merging histograms coming from different shards. If a shard returns no data the
sketch is empty but uses 3 significant digits, while for non empty results the
correct number of digits is used. Now, depending on the order sketches are
merged it might happen, for instance, that we merge a sketch using 4 or 5
significant digits into one using 3 significant digits (used for the empty result).
The result, will then use whatever precision is used by the first "merged" object
created. This in some cases leads to correct result and sometimes not.

Here, when merging histograms, we always use the one with higher value of
numberOfSignificantValueDigits so to avoid reducing precision of the result.

Note that, as a result of this merging strategy, we can even use just 0 digits
precision for empty results and save on some serialization/deserialization
for empty histograms.

Resolves elastic#92822
saarikabhasi pushed a commit to saarikabhasi/elasticsearch that referenced this pull request Apr 10, 2023
For empty histogram aggregations we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value for
the number of significant digits that is in the range [0, 5]. As a result, if the
test runs with 4 or 5 significant value digits but the HdrHistogram sketch only
uses 3, checking errors on results fails since all computations are done with
lower than expected precision.

The issue happens at reduction time in AbstractInternalHDRPercentiles when
merging histograms coming from different shards. If a shard returns no data the
sketch is empty but uses 3 significant digits, while for non empty results the
correct number of digits is used. Now, depending on the order sketches are
merged it might happen, for instance, that we merge a sketch using 4 or 5
significant digits into one using 3 significant digits (used for the empty result).
The result, will then use whatever precision is used by the first "merged" object
created. This in some cases leads to correct result and sometimes not.

Here, when merging histograms, we always use the one with higher value of
numberOfSignificantValueDigits so to avoid reducing precision of the result.

Note that, as a result of this merging strategy, we can even use just 0 digits
precision for empty results and save on some serialization/deserialization
for empty histograms.

Resolves elastic#92822
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure in HDRPercentilesIT.testScriptSingleValued
4 participants