-
-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistical soundness of removing outliers from a heavy tailed distribution? #1256
Comments
As a meta-benchmark, one could create a method that sleeps a number of milliseconds drawn at random from a Pareto(1, a) distribution where 1 < a < 2. This has a known true mean a/(a-1) but all higher moments (including variance) are infinite. Doing this should be fairly easy and I can probably take a look at it later today. |
The code attached at the bottom works as such a meta-benchmark. I tried running it with a=1.12 (which has a finite mean around 9.3) and BenchmarkDotNet consistently under-estimates the mean (attached output from one of the runs suggests around 5 – almost half the true mean.) (It also appears that BenchmarkDotNet falsely flags Pareto distributed run times as bimodal very frequently. This is probably harder to do something about, but if BenchmarkDotNet would start recognising heavy-tailed distributions for what they are, the bimodality warning might not be required as often as it is now.)
public class Benchmark
{
const double m = 1;
const double a = 1.12;
private Random r;
[GlobalSetup]
public void Setup()
{
r = new Random();
}
[Benchmark]
public int Pareto()
{
var sleepMs = (int) (m/Math.Pow(1 - r.NextDouble(), 1/a));
System.Threading.Thread.Sleep(sleepMs);
return sleepMs;
}
} |
@kqr thanks for the report! In fact, you described two important problems that we have.
|
Thank you for your thought-through and well-written reply!
Very exciting! Please keep me updated on the blog posts; I'm looking forward to reading them! :)
Ah, interesting context. That said, I'm still not sure it's sensible to use as the default. If noise from other applications and OS causes one benchmarks to have significantly worse outlier problems than another, is that not something one would want to include in the analysis? I guess what I'm saying is that while I see there could be reasons to filter out large values, the circumstances under which that is meaningful are so few that it doesn't seem sensible to do it as the default, and allow users to opt out. Seems much more sensible, in that case, to not do any large-value filtering by default and allow users to opt into it. What am I missing?
For this to work, you'd have to define I/O very broadly. Off the top of my head, there are many more subtle things that can give rise to heavy-tailed latencies, besides disk and network operations:
While one could automatically try to detect the presence of these and use that information to handle outliers differently, I think the utility is limited: after all, what benchmark does not access memory? |
Great discussion and wonderful tool - just discovered benchmark.net after rolling my own (mostly with perf counters / HdrHistogram) and the simplicity of defining common test drivers is amazing. I do understand the value of allowing to drop outliers in some cases - however I am with kqr on this, making this the default only reinforces the general computing public's misconception that performance profiles follow normal distributions - and that ignoring jitter is automatically valid. I believe this should be a conscious decision (i.e. "I do not care that one in every 100 or 1000 requests results in terrible experience for my user, or exposes me to a great risk", for example). As we all know, it is one thing to design a system that performs a task under a millisecond at the median, quite an accomplishment (and possibly radically different implementation) to hold on to at least 10 ms at the 4-9's. I think there is an opportunity to educate a large audience about that, given the tool's adoption - and continue to elevate its status beyond micro-benchmarking to bigger-picture performance mindset. I would suggest the default to be outlier inclusion and a statistics display of min / median / 99 / 99.9 / 99.99 / max (is it possible to configure such display now, by the way- only saw constants up to StatisticsColumn.P95 ?). Best and thanks again for a great tool ! Martin |
Latency values (such as those from performance time measurements) often follow a heavy-tailed distribution -- especially when there are spontaneous delays involved, e.g. when garbage collection enters the picture.
It is a common mistake to assume a normal distribution for latency/performance measurements. Sometimes this normality assumption holds, but this is not a safe assumption without testing it first.
The central limit theorem helps, of course, but severely fat-tailed distributions only converge to a normal distribution very, very slowly. This means that even when looking at a sum of measurements (which is the case when computing the mean, of course) normality should not be blindly assumed.
Under a heavy-tailed distribution, the sample mean (and even the high centiles) tend to underestimate the corresponding true values for the population. Imagine my surprise when I run benchmarking of a computation that has clearly heavy-tailed running times, and BenchmarkDotNet reports the sample mean as the primary measure of interest. To make matters worse, it reports outliers have been discarded! The very values that are supposed to at least help a tiny bit in reducing the underestimation -- removed!
When dealing with heavy-tailed distributions, the majority of samples don't tell us a whole lot about the true underlying distribution. The strongest signal can be found in the "outliers" -- these should not be ignored; if anything, they're what I would focus on! I can't in good faith use whatever values BenchmarkDotNet keep when it reports having ignored the strongest signal.
In my case, BenchmarkDotNet was friendly enough to report what range the removed outliers were in, so I could report that value to my co-worker.
However, if BenchmarkDotNet does not verify normality before computing means and standard deviations (beyond just guessing an n that "should" suffice) it seems misleading to emphasise these values; especially considering the position of BenchmarkDotNet as a black box tool (that in all other ways is VERY GOOD in that regard.)
My suggestion in this issue is emphasising some other measure in the cases where normality cannot be confirmed.
A simple and more robust measure is the maximum value, but I'm interested in what other alternatives there are to consider.
The text was updated successfully, but these errors were encountered: