Skip to content

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jul 19, 2022

We use a histogram sketch for estimating data quantiles and also computing the time bucket median for anomaly detection.

Once data are merged into buckets we have to make some assumptions about how values are distributed on buckets. Previously, we assumed data are uniformly distributed on buckets whose endpoints are the midpoints between the bucket "centres" we track. In fact, the points we track are the weighted means of the values in each bucket, so it is more accurate to assume that roughly half the data in a bucket falls either side of this point.

This changes the interpolation scheme to use the same endpoints but to incorporate this new assumption. This gives a very nice additional property: a quantile which falls between two buckets each containing a single value is computed exactly. An immediate corollary is that all quantiles are exact if the data size is less than the sketch size. Previously, we used piecewise constant interpolation for estimating the median because it is exact in this case. We now cut over to linear interpolation. This is attractive because for large data, if the data distribution is smooth, linear interpolation is significantly more accurate for fixed memory usage than piecewise constant interpolation.

Closes #2364.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit 6523700 into elastic:main Jul 25, 2022
@tveasey tveasey deleted the quantile-estimation branch July 25, 2022 09:26
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Aug 1, 2022
…maly detection (elastic#2367)

We use a histogram sketch for estimating data quantiles and also computing the time bucket median for anomaly
detection.

Once data are merged into buckets we have to make some assumptions about how values are distributed on buckets.
Previously, we assumed data are uniformly distributed on buckets whose endpoints are the midpoints between the
bucket "centres" we track. In fact, the points we track are the weighted means of the values in each bucket, so it is
more accurate to assume that roughly half the data in a bucket falls either side of this point.

This changes the interpolation scheme to use the same endpoints but to incorporate this new assumption. This gives
a very nice additional property: a quantile which falls between two buckets each containing a single value is computed
exactly. An immediate corollary is that all quantiles are exact if the data size is less than the sketch size. Previously, we
used piecewise constant interpolation for estimating the median because it is exact in this case. We now cut over to
linear interpolation. This is attractive because for large data, if the data distribution is smooth, linear interpolation is
significantly more accurate for fixed memory usage than piecewise constant interpolation.

Closes elastic#2364.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Investigate approximate percentile accuracy for anomaly detection
2 participants