Skip to content

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Aug 1, 2022

Backport #2367.

…maly detection (elastic#2367)

We use a histogram sketch for estimating data quantiles and also computing the time bucket median for anomaly
detection.

Once data are merged into buckets we have to make some assumptions about how values are distributed on buckets.
Previously, we assumed data are uniformly distributed on buckets whose endpoints are the midpoints between the
bucket "centres" we track. In fact, the points we track are the weighted means of the values in each bucket, so it is
more accurate to assume that roughly half the data in a bucket falls either side of this point.

This changes the interpolation scheme to use the same endpoints but to incorporate this new assumption. This gives
a very nice additional property: a quantile which falls between two buckets each containing a single value is computed
exactly. An immediate corollary is that all quantiles are exact if the data size is less than the sketch size. Previously, we
used piecewise constant interpolation for estimating the median because it is exact in this case. We now cut over to
linear interpolation. This is attractive because for large data, if the data distribution is smooth, linear interpolation is
significantly more accurate for fixed memory usage than piecewise constant interpolation.

Closes elastic#2364.
@tveasey tveasey merged commit 81d425a into elastic:7.17 Aug 3, 2022
@tveasey tveasey deleted the port/2367 branch August 3, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant