Improvements to the `histogram` field an its interactions with aggregations #74213

benwtrent · 2021-06-16T20:01:42Z

In the documentation of histogram fields we have the following snippet:

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. 
...<snip> description of t-digist and HDRHistos</snip>
The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

This is very flexible but may cause worse aggregation results in the long run.

An example of this is the range aggregation.

A naive way (and possibly the best way, unsure) to implement a range aggregation over a histogram field is to:

iterate the histogram values
check if the bucket value is in a range
increment the range with that count

A different way would be an attempt to rebuild the appropriate statistical distribution from the histogram results. If we knew the histogram was built utilizing the HDR structure, could we implement ranges similarly?

Iterate the histogram values
Add values to an HDR data structure
Look in the HDR structure by seeing the count "between values" for the ranges

The HDR methodology MAY provide better results. I am specifically thinking of the following situation:

histogram: {
values: [0.2, 0.4, 0.6, 0.8],
counts: [4, 3, 5, 10]
}

With a range like:

range: {
ranges: [{from: 0.3, to: 0.4]
}

Should this range return values? Would an interpolation of values or seeing where the range values would fit in the HDR help?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-16T20:01:44Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

benwtrent · 2021-06-21T20:17:13Z

In an effort to see if this would be worth it (specifically for range aggregations, it may still be valuable for other reasons, like automatically choosing the correct percentiles config), here is some data and tests I have ran.
Termonology:

naive: is no interpolation of range values. Meaning, if the histogram mapped value is in the range, we count the doc_count.
hdr: The aggregation rebuilds an HDR histogram, and attempts to interpolate values that may not exactly cover the histogram values.

Metholody and data

I built multiple "ranges" of random double values. Then ran multiple test passes on hdr and naive range bucketing over the raw range values and the histogram values. I created one histogram doc for each double "range".

Then to compare, I checked the difference in document count between the range bucket count over the raw docs and the range over histogram for the same ranges.

Here is the resulting data

Here is the gist of the test files

It seems to point to that there is NO significant difference between the way I am interpolating the histogram values vs using the naive way. Also, it seems that the hdr interpolation provides WORSE results than the naive implementation for all the test cases (though the difference is small).

This indicates that interpolation is not useful for range aggs over histogram fields.

Let me know if anything of this seems off...

Visualization of absolute error for each range bucket for all the test runs. Smaller is better:

The key method for the HDR interpolation is (this is not particularly production ready, I was just trying to put something together to see if interpolation gave us better results):

public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException {
            InternalAggregation[] results = new InternalAggregation[owningBucketOrds.length];
            for (int owningOrdIdx = 0; owningOrdIdx < owningBucketOrds.length; owningOrdIdx++) {
                List<org.elasticsearch.search.aggregations.bucket.range.Range.Bucket> buckets = new ArrayList<>(ranges.length);
                DoubleHistogram hdrHisto = hdrHistos.get(0);
                final double min = hdrHisto.getMinValue();
                final double max = hdrHisto.getMaxValue();
                for (Range range : ranges) {
                    long count = 0;
                    try {
                        if (range.getFrom() <= min && range.getTo() >= max) {
                            count = hdrHisto.getTotalCount();
                        } else if (range.getFrom() > max || range.getTo() < min) {
                        } else if (range.getFrom() != range.getTo()) {
                            double from = Math.max(range.getFrom(), min);
                            double to = Math.min(range.getTo(), max);
                            double fromNext = hdrHisto.highestEquivalentValue(from);
                            double toDown = hdrHisto.lowestEquivalentValue(to);
                            double fullyCapturedBuckets = hdrHisto.getCountBetweenValues(fromNext, toDown);

                            double fromSize = hdrHisto.sizeOfEquivalentValueRange(from);
                            double fromIntersection = (fromNext - from)/fromSize;
                            double fromIntersectionCount = Math.max(
                                (hdrHisto.getCountAtValue(from) - hdrHisto.getCountAtValue(fromNext)) * fromIntersection,
                                0.0
                            );

                            double toSize = hdrHisto.sizeOfEquivalentValueRange(to);
                            double toIntersection = (to - toDown)/toSize;
                            double toIntersectionCount = Math.max(
                                (hdrHisto.getCountAtValue(to) - hdrHisto.getCountAtValue(toDown)) * toIntersection,
                                0.0
                            );

                            // I am not sure why I continually have fence post errors
                            // Without this, the aggregated value is usually too high :(
                            count = Math.round((fullyCapturedBuckets + fromIntersectionCount + toIntersectionCount)) - 1;
                        }
                    } catch (ArrayIndexOutOfBoundsException ex) {
                        //???
                        count = 0L;
                    }
                    buckets.add(rangeFactory.createBucket(
                        range.getKey(),
                        range.getFrom(),
                        range.getTo(),
                        count,
                        InternalAggregations.EMPTY, keyed, format)
                    );
                }
                results[owningOrdIdx] = rangeFactory.create(name, buckets, format, keyed, metadata());
            }
            return results;
        }
    }

Folks who might be interested:

@tveasey @csoulios

wchaparro · 2022-02-22T19:33:27Z

@benwtrent given your analysis (thanks btw) showing no real advantage to interpolation... we could close this one out - or would you like to have a team-discuss on this? thx

benwtrent · 2022-02-22T20:36:39Z

@wchaparro interpolation when it comes to range is probably not useful.

But, something that would be useful is percentiles automatically applying the appropriate settings based on some indexed values.

Right now, to make sure the histogram values return sane results, you have to make sure that:

Your percentiles agg is the right kind (t-digest vs. hdr)
and the internal settings are the same

This is problematic as the USER of the histogram data may not be the same individual/org that set it up and may not know the internals of how it was created.

axw · 2022-02-23T01:04:38Z

++ recording the histogram field params in field metadata, or something along those lines, would be helpful for APM. Histograms might come from some third-party instrumentation which a user won't know the details of.

This would help provide a sensible default for Lens, maybe making the UI selection in elastic/kibana#98499 unnecessary.

benwtrent added >enhancement :Analytics/Aggregations Aggregations labels Jun 16, 2021

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 16, 2021

axw mentioned this issue Jun 17, 2021

New mapping parameters to annotate dimensions and metrics in timeseries data #74014

Closed

martijnvg mentioned this issue May 2, 2024

Add algorithm attribute to histogram field mapper. #108208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the `histogram` field an its interactions with aggregations #74213

Improvements to the `histogram` field an its interactions with aggregations #74213

benwtrent commented Jun 16, 2021

elasticmachine commented Jun 16, 2021

benwtrent commented Jun 21, 2021

wchaparro commented Feb 22, 2022

benwtrent commented Feb 22, 2022

axw commented Feb 23, 2022

Improvements to the histogram field an its interactions with aggregations #74213

Improvements to the histogram field an its interactions with aggregations #74213

Comments

benwtrent commented Jun 16, 2021

elasticmachine commented Jun 16, 2021

benwtrent commented Jun 21, 2021

Metholody and data

wchaparro commented Feb 22, 2022

benwtrent commented Feb 22, 2022

axw commented Feb 23, 2022

Improvements to the `histogram` field an its interactions with aggregations #74213

Improvements to the `histogram` field an its interactions with aggregations #74213