SR-4597 Benchmark results have wrong MEAN, MEDIAN and SD
The compare_perf_tests.py performs statistically questionable analysis to determine what regressions and improvements are significant. This results in high noise in the results, forcing reviewers to perform more judgment calls than necessary.
In its current state, compare_perf_tests.py plays with MIN and MAX values to find some kind of significant performance change. But this is misguided. We are taking multiple samples of every performance test in order to eliminate one-off measurement aberrations (MIN, MAX) and getting to the truer value of MEAN. We have to use standard deviation (SD) to evaluate the difference between new and old values of MEAN represent meaningful improvements in performance of Swift.
To be fair, MEAN and SD values were probably ignored because they were incorrectly generated by Benchmark_Driver. That is SR-4597.
The text was updated successfully, but these errors were encountered: