New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge histograms without losing precision #93704
Merge histograms without losing precision #93704
Conversation
For unmapped histogram fields we instantiate by default a DoubleHistogram with 3 significant digits precision. The test generates a random value for the number of significant digits that is in the range [0, 5]. As a result, if the test runs with 4 or 5 significant value digits but the HdrHistogram sketch only uses 3, checking errors on results will fail. Here we change the maximum value for the significant value digits to 3 if the query involves an index with unmapped fields.
Hi @salvatore-campagna, I've created a changelog YAML for you. |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
I think I can try to just instantiate the |
@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) { | |||
|
|||
@Override | |||
public InternalAggregation buildEmptyAggregation() { | |||
return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata()); | |||
return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in TDigestPercentilesAggregator#buildEmptyAggregation
we do something similar
return new InternalTDigestPercentiles(name, keys, new TDigestState(compression), keyed, formatter, metadata());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating DoubleHistogram
with a high numberOfSignificantValueDigits
consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.
Can we maybe create the right DoubleHistogram
instance in AbstractInternalHDRPercentiles#reduce(...)
method? Right now it always uses a EMPTY_HISTOGRAM
in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits
!= 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last change to this code made the test fail.
I left a comment.
@@ -55,6 +55,6 @@ public double metric(String name, long bucketOrd) { | |||
|
|||
@Override | |||
public InternalAggregation buildEmptyAggregation() { | |||
return new InternalHDRPercentiles(name, keys, null, keyed, format, metadata()); | |||
return new InternalHDRPercentiles(name, keys, new DoubleHistogram(numberOfSignificantValueDigits), keyed, format, metadata()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating DoubleHistogram
with a high numberOfSignificantValueDigits
consumes a lot of memory. Can we avoid doing this here? Also serialising it would be non trivial.
Can we maybe create the right DoubleHistogram
instance in AbstractInternalHDRPercentiles#reduce(...)
method? Right now it always uses a EMPTY_HISTOGRAM
in case of an empty bucket. Maybe we create another instances if numberOfSignificantValueDigits
!= 3?
Normallt we would create a histogram with the correct number of digits, nut for empty histograms we end up serializing and deserializing large arrays for empty aggregations. Here we have a kind of workaroud were we use just 3 digits for empty histograms and, at reduce time, we always merge usign the larger number of digits among all histograms.
I wonder if we should do the same for tdigest too. |
When an empty aggregations uses a TDigest object the udnelrying arrays used by the AvlTree TDigest implementation is eagerly allocated. If the aggreation produces no result, we serialize and deserialize an array which might be large if the comrpession value is large (about 5 * compression centroids are tracked). Here we use a null value for empty aggreations while building the result and later on use a static empty TDigest object at reduce time and merge it with the non-empty results.
This change will end up in another PR focused on improving memory usage for TDigest percentiles aggregations.
In another PR I will address the issue of using a "zero" histogram object also for TDigest, similar to what we do with HDR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For empty histogram aggregations we instantiate by default a DoubleHistogram with 3 significant digits precision. The test generates a random value for the number of significant digits that is in the range [0, 5]. As a result, if the test runs with 4 or 5 significant value digits but the HdrHistogram sketch only uses 3, checking errors on results fails since all computations are done with lower than expected precision. The issue happens at reduction time in AbstractInternalHDRPercentiles when merging histograms coming from different shards. If a shard returns no data the sketch is empty but uses 3 significant digits, while for non empty results the correct number of digits is used. Now, depending on the order sketches are merged it might happen, for instance, that we merge a sketch using 4 or 5 significant digits into one using 3 significant digits (used for the empty result). The result, will then use whatever precision is used by the first "merged" object created. This in some cases leads to correct result and sometimes not. Here, when merging histograms, we always use the one with higher value of numberOfSignificantValueDigits so to avoid reducing precision of the result. Note that, as a result of this merging strategy, we can even use just 0 digits precision for empty results and save on some serialization/deserialization for empty histograms. Resolves elastic#92822
For empty histogram aggregations we instantiate by default a DoubleHistogram with 3 significant digits precision. The test generates a random value for the number of significant digits that is in the range [0, 5]. As a result, if the test runs with 4 or 5 significant value digits but the HdrHistogram sketch only uses 3, checking errors on results fails since all computations are done with lower than expected precision. The issue happens at reduction time in AbstractInternalHDRPercentiles when merging histograms coming from different shards. If a shard returns no data the sketch is empty but uses 3 significant digits, while for non empty results the correct number of digits is used. Now, depending on the order sketches are merged it might happen, for instance, that we merge a sketch using 4 or 5 significant digits into one using 3 significant digits (used for the empty result). The result, will then use whatever precision is used by the first "merged" object created. This in some cases leads to correct result and sometimes not. Here, when merging histograms, we always use the one with higher value of numberOfSignificantValueDigits so to avoid reducing precision of the result. Note that, as a result of this merging strategy, we can even use just 0 digits precision for empty results and save on some serialization/deserialization for empty histograms. Resolves elastic#92822
For empty histogram aggregations we instantiate by default a DoubleHistogram
with 3 significant digits precision. The test generates a random value for
the number of significant digits that is in the range [0, 5]. As a result, if the
test runs with 4 or 5 significant value digits but the HdrHistogram sketch only
uses 3, checking errors on results fails since all computations are done with
lower than expected precision.
The issue happens at reduction time in
AbstractInternalHDRPercentiles
whenmerging histograms coming from different shards. If a shard returns no data the
sketch is empty but uses 3 significant digits, while for non empty results the
correct number of digits is used. Now, depending on the order sketches are
merged it might happen, for instance, that we merge a sketch using 4 or 5
significant digits into one using 3 significant digits (used for the empty result).
The result, will then use whatever precision is used by the first "merged" object
created. This in some cases leads to correct result and sometimes not.
Here, when merging histograms, we always use the one with higher value of
numberOfSignificantValueDigits
so to avoid reducing precision of the result.Note that, as a result of this merging strategy, we can even use just 0 digits
precision for empty results and save on some serialization/deserialization
for empty histograms.
Resolves #92822