Removing Outliers #22

Closed
adamsmd opened this Issue Feb 19, 2013 · 2 comments

Projects

None yet

3 participants

@adamsmd
adamsmd commented Feb 19, 2013

When attempting to collect benchmarking data, it occasionally happens that system noise will make a few of the iterations take much longer than other iterations. If one iteration out of a thousand does this and is ten times slower than the rest, this inflates the average by 10% and can cause havoc with the standard deviation. Even if only one iteration out of a hundred does this, then the average and standard deviation are useless even though the data is meaningful if you exclude these outliers.

I would like the ability to instruct Criterion to omit such outliers from it's calculations. Even something as simple as removing the best and worst 10% of samples would often be sufficient.

Of course this could be abused (e.g., the standard deviation doesn't mean quite so much if you remove the best and worst 49% of samples), but for benchmarking things like CPU times of code that does no IO or external calculation it can be very useful as benchmark numbers often need to reflect the performance of the code being benchmarked instead of whatever system noise happened to randomly kick in.

Possible extensions of this idea include reporting the median and/or mode. Another possibility is to try to fit the sampling data to some sort of distribution is flexible enough to account for system noise (e.g. a Poisson, bimodal(*) or mixture based distribution) and then reporting the parameters of that distribution (e.g., the location of the peak (or peaks) of the Poisson or or bimodal distribution rather than just the mean).

(*) The second peak represents when the system noise kicks up.

@letmaik
letmaik commented Apr 9, 2013

+1 I would have thought that criterion already does that, seems a fairly standard technique.

@bos
Owner
bos commented Sep 11, 2013

Rejecting outliers from a sample is not statistically sound when the distribution being sampled is not understood. There should be enough of the machinery exposed that if you want to remove outliers yourself, it should be easy to do so - but this is not something I plan to add to criterion itself.

@bos bos closed this Sep 11, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment