[ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2367

tveasey · 2022-07-19T09:03:24Z

We use a histogram sketch for estimating data quantiles and also computing the time bucket median for anomaly detection.

Once data are merged into buckets we have to make some assumptions about how values are distributed on buckets. Previously, we assumed data are uniformly distributed on buckets whose endpoints are the midpoints between the bucket "centres" we track. In fact, the points we track are the weighted means of the values in each bucket, so it is more accurate to assume that roughly half the data in a bucket falls either side of this point.

This changes the interpolation scheme to use the same endpoints but to incorporate this new assumption. This gives a very nice additional property: a quantile which falls between two buckets each containing a single value is computed exactly. An immediate corollary is that all quantiles are exact if the data size is less than the sketch size. Previously, we used piecewise constant interpolation for estimating the median because it is exact in this case. We now cut over to linear interpolation. This is attractive because for large data, if the data distribution is smooth, linear interpolation is significantly more accurate for fixed memory usage than piecewise constant interpolation.

Closes #2364.

lib/model/unittest/CMetricModelTest.cc

…lues

… better to just always use inlear interpolation

droberts195

LGTM

…maly detection (elastic#2367) We use a histogram sketch for estimating data quantiles and also computing the time bucket median for anomaly detection. Once data are merged into buckets we have to make some assumptions about how values are distributed on buckets. Previously, we assumed data are uniformly distributed on buckets whose endpoints are the midpoints between the bucket "centres" we track. In fact, the points we track are the weighted means of the values in each bucket, so it is more accurate to assume that roughly half the data in a bucket falls either side of this point. This changes the interpolation scheme to use the same endpoints but to incorporate this new assumption. This gives a very nice additional property: a quantile which falls between two buckets each containing a single value is computed exactly. An immediate corollary is that all quantiles are exact if the data size is less than the sketch size. Previously, we used piecewise constant interpolation for estimating the median because it is exact in this case. We now cut over to linear interpolation. This is attractive because for large data, if the data distribution is smooth, linear interpolation is significantly more accurate for fixed memory usage than piecewise constant interpolation. Closes elastic#2364.

Improve quantile estimation

867b9de

tveasey added >enhancement review :ml :ml/DataFrameAnalysis v8.4.0 v7.17.6 labels Jul 19, 2022

tveasey added 2 commits July 19, 2022 10:08

Docs

3af7af1

Test fallout

b0a0c9c

droberts195 reviewed Jul 19, 2022

View reviewed changes

lib/model/unittest/CMetricModelTest.cc Outdated Show resolved Hide resolved

tveasey added 8 commits July 22, 2022 11:52

Better handling of uncompressed sketch

a8ad541

Test fallout

f1d9f5a

Merge branch 'main' into quantile-estimation

21caf1b

Formatting

3f81a7e

Improve split selection for the case the sketch holds all distinct va…

5b8c026

…lues

On second thoughts because we compute splits on downsampled data it's…

390b060

… better to just always use inlear interpolation

Test threshold

7d56110

One more small tweak

907cf00

droberts195 approved these changes Jul 25, 2022

View reviewed changes

tveasey merged commit 6523700 into elastic:main Jul 25, 2022

tveasey deleted the quantile-estimation branch July 25, 2022 09:26

darnautov mentioned this pull request Aug 1, 2022

7.17 ES8.4 forward compatibility test failures elastic/kibana#137602

Closed

tveasey mentioned this pull request Aug 1, 2022

[7.17][ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2367

[ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2367

Uh oh!

tveasey commented Jul 19, 2022

Uh oh!

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

[ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2367

[ML] Improve quantile estimation accuracy and median accuracy for anomaly detection #2367

Uh oh!

Conversation

tveasey commented Jul 19, 2022

Uh oh!

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!