Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,45 @@ public static double getQuantile(ExponentialHistogram histo, double quantile) {
return removeNegativeZero(result);
}

/**
* Estimates the rank of a given value in the distribution represented by the histogram.
* In other words, returns the number of values which are less than (or less-or-equal, if {@code inclusive} is true)
* the provided value.
*
* @param histo the histogram to query
* @param value the value to estimate the rank for
* @param inclusive if true, counts values equal to the given value as well
* @return the number of elements less than (or less-or-equal, if {@code inclusive} is true) the given value
*/
public static long estimateRank(ExponentialHistogram histo, double value, boolean inclusive) {
if (value >= 0) {
long rank = histo.negativeBuckets().valueCount();
if (value > 0 || inclusive) {
rank += histo.zeroBucket().count();
}
rank += estimateRank(histo.positiveBuckets().iterator(), value, inclusive, histo.max());
return rank;
} else {
long numValuesGreater = estimateRank(histo.negativeBuckets().iterator(), -value, inclusive == false, -histo.min());
return histo.negativeBuckets().valueCount() - numValuesGreater;
}
}

private static long estimateRank(BucketIterator buckets, double value, boolean inclusive, double maxValue) {
long rank = 0;
while (buckets.hasNext()) {
double bucketMidpoint = ExponentialScaleUtils.getPointOfLeastRelativeError(buckets.peekIndex(), buckets.scale());
bucketMidpoint = Math.min(bucketMidpoint, maxValue);
if (bucketMidpoint < value || (inclusive && bucketMidpoint == value)) {
rank += buckets.peekCount();
buckets.advance();
Comment on lines +100 to +104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For buckets where the value is between the lower and the upper boundary, I'm wondering whether we should add a count proportional to where the value falls into the bucket. It seems like that could increase the accuracy of the estimate.

In other words, we'd increment the rank by (value - lowerBound) / (upperBound - lowerBound) * count. We can have an optimized special case for value > upperBound where we increment by count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the algorithm currently assumes that all values in a bucket lie on the point of least relative error, just like for the percentiles algorithm. This ensures that we minimize the relative error: we minimize the relative error of percentile( rank(someValue) / valueCount), meaning that the returned percentile is as close as possible to someValue.

If we now change the assumption of how values are distributed in a bucket, I think we'd need to do the same for the percentiles algorithm. While this would smoothen the values, it would also increase the worst-case relative error.

Also changing this assumption would probably also mean that we should get rid of upscaling in the exponential histogram merging algorithm: The upscaling there happens to make sure that misconfigured SDKs (e.g. way too low bucket count) don't drag down the accuracy of the overall aggregation.
While the upscaling barely moves the point of least relative error of buckets, it greatly reduces their size.

So with your proposed change this can lead to the weir behaviour where the rank of a given value shifts by a large margin pre and post merging of histograms.

So I'd propose to stay with the "mathematically most correct" way of assuming that all points in a bucket lie on a single point. In practice buckets should be small enough that this is not really noticeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Definitely agree that we want to keep the percentile ranks implementation in sync with the percentile implementation. Is there a specific percentiles implementation suggested by OTel that uses midpoints?

Maybe add some commentary why we're using midpoints rather than interpolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing in Otel, but this implementation is what is used in the DDSketch and UDDSketch papers, as they provide proofs for the worst-case relative error.

Prometheus does this differently for their native histograms (which are actually exponential histograms):

The worst case is an estimation at one end of a bucket where the actual value is at the other end of the bucket. Therefore, the maximum possible error is the whole width of a bucket. Not doing any interpolation and using some fixed midpoint within a bucket (for example the arithmetic mean or even the harmonic mean) would minimize the maximum possible error (which would then be half of the bucket width in case of the arithmetic mean), but in practice, the linear interpolation yields an error that is lower on average. Since the interpolation has worked well over many years of classic histogram usage, interpolation is also applied for native histograms.

Therefore, PromQL uses exponential extrapolation for the standard schemas, which models the assumption that dividing a bucket into two when increasing the schema number by one (i.e. doubling the resolution) will on average see similar populations in both new buckets. A more detailed explanation can be found in the PR implementing the interpolation method.

(Source)

So in order words, they assume an exponential distribution within the bucket (fewer values on the border towards zero, more on the further away border). We could adopt that approach, which means we would have to drop the upscaling and make converting explicit-bucket histograms more expensive and inaccurate.

I also noticed after thinking about it further that what I said above is wrong:

we minimize the relative error of percentile( rank(someValue) / valueCount), meaning that the returned percentile is as close as possible to someValue.
If we now change the assumption of how values are distributed in a bucket, I think we'd need to do the same for the percentiles algorithm.

It doesn't matter if we return the rank of the first or last element within a bucket, the resulting percentile would be the same with our current algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd say let's leave it as-is for now and add an issue to re-think midpoint vs interpolation. Should we decide to switch the algorithm, it should be done consistently both for percentile rank and percentile. It's probably also a matter of how strictly we want to be compliant with prometheus and if we actually want convert explicit bounds histograms to exponential histograms long-term, or whether we want to have a dedicated type for it.

} else {
break;
}
}
return rank;
}

private static double removeNegativeZero(double result) {
return result == 0.0 ? 0.0 : result;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
/*
* Copyright Elasticsearch B.V., and/or licensed to Elasticsearch B.V.
* under one or more license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch B.V. licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*
* This file is based on a modification of https://github.com/open-telemetry/opentelemetry-java which is licensed under the Apache 2.0 License.
*/

package org.elasticsearch.exponentialhistogram;

import java.util.Arrays;
import java.util.stream.DoubleStream;

import static org.hamcrest.Matchers.equalTo;

public class RankAccuracyTests extends ExponentialHistogramTestCase {

public void testRandomDistribution() {
int numValues = randomIntBetween(10, 10_000);
double[] values = new double[numValues];

int valuesGenerated = 0;
while (valuesGenerated < values.length) {
double value;
if (randomDouble() < 0.01) { // 1% chance of exact zero
value = 0;
} else {
value = randomDouble() * 2_000_000 - 1_000_000;
}
// Add some duplicates
for (int i = 0; i < randomIntBetween(1, 10) && valuesGenerated < values.length; i++) {
values[valuesGenerated++] = value;
}
}

int numBuckets = randomIntBetween(4, 400);
ExponentialHistogram histo = createAutoReleasedHistogram(numBuckets, values);

Arrays.sort(values);
double min = values[0];
double max = values[values.length - 1];

double[] valuesRoundedToBucketCenters = DoubleStream.of(values).map(value -> {
if (value == 0) {
return 0;
}
long index = ExponentialScaleUtils.computeIndex(value, histo.scale());
double bucketCenter = Math.signum(value) * ExponentialScaleUtils.getPointOfLeastRelativeError(index, histo.scale());
return Math.clamp(bucketCenter, min, max);
}).toArray();

// Test the values at exactly the bucket center for exclusivity correctness
for (double v : valuesRoundedToBucketCenters) {
long inclusiveRank = getRank(v, valuesRoundedToBucketCenters, true);
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, true), equalTo(inclusiveRank));
long exclusiveRank = getRank(v, valuesRoundedToBucketCenters, false);
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, false), equalTo(exclusiveRank));
}
// Test the original values to have values in between bucket centers
for (double v : values) {
long inclusiveRank = getRank(v, valuesRoundedToBucketCenters, true);
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, true), equalTo(inclusiveRank));
long exclusiveRank = getRank(v, valuesRoundedToBucketCenters, false);
assertThat(ExponentialHistogramQuantile.estimateRank(histo, v, false), equalTo(exclusiveRank));
}

}

private static long getRank(double value, double[] sortedValues, boolean inclusive) {
for (int i = 0; i < sortedValues.length; i++) {
if (sortedValues[i] > value || (inclusive == false && sortedValues[i] == value)) {
return i;
}
}
return sortedValues.length;
}
}