Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical soundness of removing outliers from a heavy tailed distribution? #1256

Open
kqr opened this issue Sep 30, 2019 · 5 comments
Open
Assignees

Comments

@kqr
Copy link

kqr commented Sep 30, 2019

Latency values (such as those from performance time measurements) often follow a heavy-tailed distribution -- especially when there are spontaneous delays involved, e.g. when garbage collection enters the picture.

It is a common mistake to assume a normal distribution for latency/performance measurements. Sometimes this normality assumption holds, but this is not a safe assumption without testing it first.

The central limit theorem helps, of course, but severely fat-tailed distributions only converge to a normal distribution very, very slowly. This means that even when looking at a sum of measurements (which is the case when computing the mean, of course) normality should not be blindly assumed.

Under a heavy-tailed distribution, the sample mean (and even the high centiles) tend to underestimate the corresponding true values for the population. Imagine my surprise when I run benchmarking of a computation that has clearly heavy-tailed running times, and BenchmarkDotNet reports the sample mean as the primary measure of interest. To make matters worse, it reports outliers have been discarded! The very values that are supposed to at least help a tiny bit in reducing the underestimation -- removed!

When dealing with heavy-tailed distributions, the majority of samples don't tell us a whole lot about the true underlying distribution. The strongest signal can be found in the "outliers" -- these should not be ignored; if anything, they're what I would focus on! I can't in good faith use whatever values BenchmarkDotNet keep when it reports having ignored the strongest signal.

In my case, BenchmarkDotNet was friendly enough to report what range the removed outliers were in, so I could report that value to my co-worker.

However, if BenchmarkDotNet does not verify normality before computing means and standard deviations (beyond just guessing an n that "should" suffice) it seems misleading to emphasise these values; especially considering the position of BenchmarkDotNet as a black box tool (that in all other ways is VERY GOOD in that regard.)

My suggestion in this issue is emphasising some other measure in the cases where normality cannot be confirmed.

A simple and more robust measure is the maximum value, but I'm interested in what other alternatives there are to consider.

@AndreyAkinshin AndreyAkinshin self-assigned this Sep 30, 2019
@kqr
Copy link
Author

kqr commented Oct 1, 2019

As a meta-benchmark, one could create a method that sleeps a number of milliseconds drawn at random from a Pareto(1, a) distribution where 1 < a < 2. This has a known true mean a/(a-1) but all higher moments (including variance) are infinite.

Doing this should be fairly easy and I can probably take a look at it later today.

@kqr
Copy link
Author

kqr commented Oct 1, 2019

The code attached at the bottom works as such a meta-benchmark.

I tried running it with a=1.12 (which has a finite mean around 9.3) and BenchmarkDotNet consistently under-estimates the mean (attached output from one of the runs suggests around 5 – almost half the true mean.)

(It also appears that BenchmarkDotNet falsely flags Pareto distributed run times as bimodal very frequently. This is probably harder to do something about, but if BenchmarkDotNet would start recognising heavy-tailed distributions for what they are, the bimodality warning might not be required as often as it is now.)

// * Detailed results *
Benchmark.Pareto: DefaultJob
Runtime = .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT; GC = Concurrent Workstation
Mean = 5.1712 ms, StdErr = 0.1859 ms (3.59%); N = 92, StdDev = 1.7830 ms
Min = 2.7432 ms, Q1 = 3.8996 ms, Median = 4.6586 ms, Q3 = 5.9657 ms, Max = 10.1838 ms
IQR = 2.0660 ms, LowerFence = 0.8006 ms, UpperFence = 9.0647 ms
ConfidenceInterval = [4.5390 ms; 5.8033 ms] (CI 99.9%), Margin = 0.6322 ms (12.22% of Mean)
Skewness = 1.13, Kurtosis = 3.6, MValue = 2.23
-------------------- Histogram --------------------
[2.670 ms ;  3.364 ms) | @@@@@@@@
[3.364 ms ;  4.079 ms) | @@@@@@@@@@@@@@@@@@
[4.079 ms ;  4.770 ms) | @@@@@@@@@@@@@@@@@@@@@@@@
[4.770 ms ;  5.477 ms) | @@@@@@@@@@@
[5.477 ms ;  6.168 ms) | @@@@@@@@@@@@
[6.168 ms ;  6.548 ms) | @
[6.548 ms ;  7.239 ms) | @@@@@@@
[7.239 ms ;  7.729 ms) |
[7.729 ms ;  8.420 ms) | @@@@
[8.420 ms ;  8.806 ms) |
[8.806 ms ;  9.600 ms) | @@@@
[9.600 ms ; 10.291 ms) | @@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14.6 (18G95) [Darwin 18.7.0]
Intel Core i7-7820HQ CPU 2.90GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT


| Method |     Mean |     Error |   StdDev |   Median |
|------- |---------:|----------:|---------:|---------:|
| Pareto | 5.171 ms | 0.6322 ms | 1.783 ms | 4.659 ms |

// * Hints *
Outliers
  Benchmark.Pareto: Default -> 8 outliers were removed (12.31 ms..49.45 ms)
    public class Benchmark
    {
        const double m = 1;
        const double a = 1.12;
        private Random r;

        [GlobalSetup]
        public void Setup()
        {
            r = new Random();
        }

        [Benchmark]
        public int Pareto()
        {
            var sleepMs = (int) (m/Math.Pow(1 - r.NextDouble(), 1/a));
            System.Threading.Thread.Sleep(sleepMs);
            return sleepMs;
        }
    }

@AndreyAkinshin
Copy link
Member

@kqr thanks for the report! In fact, you described two important problems that we have.

  • Default set of columns
    I'm 100% agree that Mean/StdDev metrics may be misleading in some cases. Currently, I'm working on a better version of a summary table that doesn't include misleading metrics. It's a pretty hard task, but it seems that I found a solution. Right now I'm working on a proposal for the improved summary table (I'm going to publish a series of blog posts that highlights all the problems and shows how we can resolve them). Next, we are going to discuss it. If everything is OK, I'm going to release a new version of BenchmarkDotNet with the new summary table.
  • Default outlier policy
    Originally, BenchmarkDotNet was designed for CPU-bound microbenchmarks. That's why it removes the upper outlier by default (you can customize this behavior for your benchmarks and don't touch outliers). I believe that it's a good strategy because it reduces noise from other applications and OS. Unfortunately, this approach doesn't work well for I/O-bound benchmarks (when we have some disk or network operations). In this case, outliers are pretty important and we shouldn't exclude them from the summary. Currently, you have to manually ask BenchmarkDotNet to don't touch the outliers. I think that we can introduce an automatic logic that checks benchmarks for I/O operations and sets the most appropriate mode for ourlier processing. What do you think?

@kqr
Copy link
Author

kqr commented Oct 8, 2019

Thank you for your thought-through and well-written reply!

Currently, I'm working on a better version of a summary table that doesn't include misleading metrics. It's a pretty hard task, but it seems that I found a solution. Right now I'm working on a proposal for the improved summary table (I'm going to publish a series of blog posts that highlights all the problems and shows how we can resolve them).

Very exciting! Please keep me updated on the blog posts; I'm looking forward to reading them! :)

Originally, BenchmarkDotNet was designed for CPU-bound microbenchmarks. I believe that it's a good strategy because it reduces noise from other applications and OS.

Ah, interesting context. That said, I'm still not sure it's sensible to use as the default. If noise from other applications and OS causes one benchmarks to have significantly worse outlier problems than another, is that not something one would want to include in the analysis?

I guess what I'm saying is that while I see there could be reasons to filter out large values, the circumstances under which that is meaningful are so few that it doesn't seem sensible to do it as the default, and allow users to opt out. Seems much more sensible, in that case, to not do any large-value filtering by default and allow users to opt into it. What am I missing?

Unfortunately, this approach doesn't work well for I/O-bound benchmarks (when we have some disk or network operations). In this case, outliers are pretty important and we shouldn't exclude them from the summary. I think that we can introduce an automatic logic that checks benchmarks for I/O operations and sets the most appropriate mode for ourlier processing. What do you think?

For this to work, you'd have to define I/O very broadly. Off the top of my head, there are many more subtle things that can give rise to heavy-tailed latencies, besides disk and network operations:

  • memory access through various layers of caches
  • multithreading with locks
  • garbage collection

While one could automatically try to detect the presence of these and use that information to handle outliers differently, I think the utility is limited: after all, what benchmark does not access memory?

@mkosina
Copy link

mkosina commented Jan 16, 2020

Great discussion and wonderful tool - just discovered benchmark.net after rolling my own (mostly with perf counters / HdrHistogram) and the simplicity of defining common test drivers is amazing.

I do understand the value of allowing to drop outliers in some cases - however I am with kqr on this, making this the default only reinforces the general computing public's misconception that performance profiles follow normal distributions - and that ignoring jitter is automatically valid. I believe this should be a conscious decision (i.e. "I do not care that one in every 100 or 1000 requests results in terrible experience for my user, or exposes me to a great risk", for example). As we all know, it is one thing to design a system that performs a task under a millisecond at the median, quite an accomplishment (and possibly radically different implementation) to hold on to at least 10 ms at the 4-9's.

I think there is an opportunity to educate a large audience about that, given the tool's adoption - and continue to elevate its status beyond micro-benchmarking to bigger-picture performance mindset.

I would suggest the default to be outlier inclusion and a statistics display of min / median / 99 / 99.9 / 99.99 / max (is it possible to configure such display now, by the way- only saw constants up to StatisticsColumn.P95 ?).

Best and thanks again for a great tool !

Martin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants