[benchmark] More Robust Benchmark_Driver #19910

palimondo · 2018-10-16T15:02:51Z

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal.

This PR improves the benchmark measurement methodology used by Benchmark_Driver, as originally outlined in Towards Robust Performance Measurement. Building on the Benchmark_O's quantile report format from PR #19328, it now collects representative sub-sample (exact measurements up to sample size of 21, ventiles plus min and max for larger sample sizes) from benchmark measurements, which is filtered to exclude outliers (values higher than Q3 + 1.5 * IQR) and combined into one empirical distribution by merging results from -i independent samples. The long form of this argument is renamed from misleading --iterations to --independent-samples. The log output format is changed to report Five-number summary, replacing mean and standard deviation (not applicable, because the underlying distribution is non-normal) by quartiles.

To facilitate this change, the Benchmark_O was modified to use the same auto-scaling mechanism, previously used to determine number of iterations, to collect variable number of samples based on the --sample-time argument, when --num-iters=1. This type of measurement, without averaging of multiple values into the reported runtimes, is now used by the Benchmark_Driver. It now also reports time to run the setUpFunction in the --verbose mode.

Additional changes required to enable the above:

LogParser support for parsing:
- benchmark report summary in quantile format,
- setup function timing, and yields from the --verbose output of Benchmark_O.
BenchmarkDoctor validation:
- new rule to enforce sensible runtimes for setUpFunction,
- modified recommended runtime to be <1ms, runtimes >10ms are reported as errors.

I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.

Measure the duration of the `setUpFunction` and report it in verbose mode. This will be used by `BenchmarkDoctor`, to ensure there isn’t unreasonably big imbalance between the time it takes to set up and run the actual benchmark.

Gather all samples published in the benchamark summary from the `Benchmark_O --quantile` output format.

Support for reading delta-encoded quantiles format.

When measuring with specified number of iterations (generally, `--num-iters=1` makes sense), automaticially determine the number of samples to take, so that the overall measurement duration comes close to `sample-time`. This is the same technique used to scale `num-iters` before, but for `num-samples`.

Renamed Benchmark_Driver’s `iterations` argument to `independent-samples` to clarify its true meaning and disambiguate it from the concept of `num-iters` used in Benchmark_O. The short form of the argument — `-i` — remains unchanged.

Switching the measurement technique from gathering `i` independent samples characterized by their mean values, to a finer grained characterization of these measurements using quantiles. The distribution of benchmark measurements is non-normal, with outliers that significantly inflate the mean and standard-deviation due to presence of uncontrolled variable of the system load. Therefore the MEAN and SD were incorrect statistics to properly characterize the benchmark measurements. Benchmark_Driver now gathers more individual measurements from Benchmark_O. It is executed with `--num-iters=1`, because we don’t want to average the runtimes, we want raw data. This collects variable number of measurements gathered in about 1 second. Using the `--quantile=20` we get up to 20 measured values that properly characterize the empirical distribution of the benchmark from each independent run. The measurements from `i` independent executions are combined to form the final empirical distribution, which is reported in a five-number summary (MIN, Q1, MEDIAN, Q3, MAX).

When num_samples is less than quantile + 1, some of the measurements are repeated in the report summary. Parsed samples should strive to be a true reflection of the measured distribution, so we’ll correct this by discarding the repetated artifacts from quantile estimation. This avoids introducting a bias from this oversampling into the empirical distribution obtained from merging independent samples. See also: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.

Since the meaning of some columns was changed, but their overall number remained, let’s include the header in the CSV log to make it clear that we are now reporting MIN, Q1, MEDIAN, Q3, MAX, MAX_RSS, instead of the old MIN, MAX, MEAN, SD, MEDIAN, MAX_RSS format.

* Lowered the threshold for healthy benchmark runtime to be under 1000 μs. * Offer suitable divisor that is power of 10, in addition to the one that’s power of 2. * Expanded the motivation in the docstring.

Add a check against unreasonably long setup times for benchmarks that do their initialization work in the `setUpFunction`. Given the typical benchmark measurements will last about 1 second, it’s reasonable to expect the setup to take at most 20% extra, on top of that: 200 ms. The `DictionaryKeysContains*` benchmarks are an instance of this mistake. The setup of `DictionaryKeysContainsNative` takes 3 seconds on my machine, to prepare a dictionary for the run function, whose typical runtime is 90 μs. The setup of Cocoa version takes 8 seconds!!! It is trivial to rewite these with much smaller dictionaries that demonstrate the point of these benchmarks perfectly well, without the need to wait for ages to setup these benchmarks.

palimondo · 2018-10-16T15:06:05Z

@eeckstein Please review. Please also run full test, to exercise the benchmark tests (if it's not part of the smoke-test yet...).

palimondo · 2018-10-17T17:54:25Z

I'm working on another PR, that should land in sync with this one, since this changes the measurement methodology used in Benchmark_Driver to always measure with --num-iters=1 and the benchmarks that exhibit setup overhead would report different minimum value, because the overhead is no longer amortized over multiple iterations.

eeckstein · 2018-10-18T01:11:17Z

@swift-ci benchmark

swift-ci · 2018-10-18T01:57:26Z

Build comment file:

No performance and code size changes

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

eeckstein · 2018-10-26T20:51:39Z

@swift-ci smoke test

palimondo · 2018-10-27T04:18:20Z

@eeckstein I've pushed the fix for linter nitpicks that stopped the last build. Please re-test. (Are benchmark tests now part of smoke test, too — or should we run full test as we did before?)

eeckstein · 2018-10-29T19:23:25Z

@swift-ci smoke test

eeckstein · 2018-10-29T19:23:42Z

@swift-ci smoke test

palimondo added 16 commits October 2, 2018 14:34

[benchmark] Measure setUpFunction

9d9200e

Measure the duration of the `setUpFunction` and report it in verbose mode. This will be used by `BenchmarkDoctor`, to ensure there isn’t unreasonably big imbalance between the time it takes to set up and run the actual benchmark.

[benchmark] Fix merging max_rss when None

f729b8e

[benchmark] Fix drop mean and sd on merge

78159e1

[benchmark] Parse setup time from verbose log

0f25849

[benchmark] Parse yield timings from verbose log

9ba571f

[benchmark] LogParser support for quantile format

012e07c

Gather all samples published in the benchamark summary from the `Benchmark_O --quantile` output format.

[benchmark] LogParser delta quantiles support

61a092a

Support for reading delta-encoded quantiles format.

[benchmark] Exclude outliers from sample

397c447

Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.

[benchmark] Recommended runtime should be < 1ms

638f4f8

* Lowered the threshold for healthy benchmark runtime to be under 1000 μs. * Offer suitable divisor that is power of 10, in addition to the one that’s power of 2. * Expanded the motivation in the docstring.

[benchmark] Update lit test for variable sampling

e0fa6cf

palimondo mentioned this pull request Oct 25, 2018

[benchmark] Extract Setup from Benchmarks #20048

Merged

[benchmark] Gardening: Fix linter nitpicks

897b9ef

eeckstein merged commit cd920b6 into apple:master Oct 29, 2018

palimondo mentioned this pull request Nov 17, 2018

[benchmark] Existential Redux #20666

Merged

palimondo mannequin mentioned this pull request Nov 13, 2018

[SR-4597] Benchmark results have wrong MEAN, MEDIAN and SD #47174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmark] More Robust Benchmark_Driver #19910

[benchmark] More Robust Benchmark_Driver #19910

palimondo commented Oct 16, 2018 •

edited

palimondo commented Oct 16, 2018 •

edited

palimondo commented Oct 17, 2018

eeckstein commented Oct 18, 2018

swift-ci commented Oct 18, 2018

eeckstein commented Oct 26, 2018

palimondo commented Oct 27, 2018

eeckstein commented Oct 29, 2018

eeckstein commented Oct 29, 2018

[benchmark] More Robust Benchmark_Driver #19910

[benchmark] More Robust Benchmark_Driver #19910

Conversation

palimondo commented Oct 16, 2018 • edited

palimondo commented Oct 16, 2018 • edited

palimondo commented Oct 17, 2018

eeckstein commented Oct 18, 2018

swift-ci commented Oct 18, 2018

Build comment file:

No performance and code size changes

eeckstein commented Oct 26, 2018

palimondo commented Oct 27, 2018

eeckstein commented Oct 29, 2018

eeckstein commented Oct 29, 2018

palimondo commented Oct 16, 2018 •

edited

palimondo commented Oct 16, 2018 •

edited