Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Support Cormode-Muthukrishnan percentiles #283

Open
mheffner opened this Issue · 7 comments

5 participants

@mheffner

For timers, provide the option to calculate percentiles using the Cormode-Muthukrishnan algorithm which doesn't require storing all timer samples. This would be more performant for timers that receive lots of samples.

See implementation from statsite: https://github.com/armon/statsite/blob/master/src/cm_quantile.c

@msiebuhr

For the curious, the algorithm is described more formally in the paper Effective Computation of Biased Quantiles over Data Streams (PDF).

It is basically a way to keep accurately track of given percentiles without storing the complete dataset; quoting the introduction:

[...] To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? [...]

@msiebuhr

Another option would be to limit the size of the array and simply begin to replace random elements when the maximum size has been reached.

A straight forward implementation will obviously bias towards older samples, but it would be quite simple to do and fix the problem with a lot of samples.

@lukevenediger

@msiebuhr would you do anything to choose an array size that will give you more accurate results, or doesn't that matter?

@msiebuhr

@lukevenediger I honestly have no idea.

Any ideas, @jlouis?

@jlouis

I get good results in boundary/folsom which uses a 1028 entry array in which it performs random replacement. Of course, this only works when your number of samples is "high enough" for some value of being high enough. If your samples arrive at a very low rate, then it won't work. We usually process some thousand samples per second to the array, but we push the result of the percentile calculation further on. More advanced solutions, like the paper, will probably perform better.

@cyberrodent
Owner

We would love to implement this but are short on time. Pull requests would be very welcome.

@cyberrodent cyberrodent added the Feature label
@jlouis

For the record. The algorithm I'm mentioning is called "Vitter's algorithm R" and it does require you to remove old entries. For instance by running 12 buckets on five second intervals over the last minute. The Cormode-Muthukrishnan algorithm can avoid this, but on the other hand, it is probably not as simple to implement as algorithm R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.