Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] More Robust Benchmark_Driver #19910

Merged
merged 17 commits into from Oct 29, 2018

Conversation

palimondo
Copy link
Collaborator

@palimondo palimondo commented Oct 16, 2018

Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal.

This PR improves the benchmark measurement methodology used by Benchmark_Driver, as originally outlined in Towards Robust Performance Measurement. Building on the Benchmark_O's quantile report format from PR #19328, it now collects representative sub-sample (exact measurements up to sample size of 21, ventiles plus min and max for larger sample sizes) from benchmark measurements, which is filtered to exclude outliers (values higher than Q3 + 1.5 * IQR) and combined into one empirical distribution by merging results from -i independent samples. The long form of this argument is renamed from misleading --iterations to --independent-samples. The log output format is changed to report Five-number summary, replacing mean and standard deviation (not applicable, because the underlying distribution is non-normal) by quartiles.

To facilitate this change, the Benchmark_O was modified to use the same auto-scaling mechanism, previously used to determine number of iterations, to collect variable number of samples based on the --sample-time argument, when --num-iters=1. This type of measurement, without averaging of multiple values into the reported runtimes, is now used by the Benchmark_Driver. It now also reports time to run the setUpFunction in the --verbose mode.

Additional changes required to enable the above:

  • LogParser support for parsing:
    • benchmark report summary in quantile format,
    • setup function timing, and yields from the --verbose output of Benchmark_O.
  • BenchmarkDoctor validation:
    • new rule to enforce sensible runtimes for setUpFunction,
    • modified recommended runtime to be <1ms, runtimes >10ms are reported as errors.

I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.

Measure the duration of the `setUpFunction` and report it in verbose mode.

This will be used by `BenchmarkDoctor`, to ensure there isn’t unreasonably big imbalance between the time it takes to set up and run the actual benchmark.
Gather all samples published in the benchamark summary from the `Benchmark_O --quantile` output format.
Support for reading delta-encoded quantiles format.
When measuring with specified number of iterations (generally, `--num-iters=1` makes sense), automaticially determine the number of samples to take, so that the overall measurement duration comes close to `sample-time`.

This is the same technique used to scale `num-iters` before, but for `num-samples`.
Renamed Benchmark_Driver’s `iterations` argument to `independent-samples` to clarify its true meaning and  disambiguate it from the concept of `num-iters` used in Benchmark_O. The short form of the argument — `-i` — remains unchanged.
Switching the measurement technique from gathering `i` independent samples characterized by their mean values, to a finer grained characterization of these measurements using quantiles.

The distribution of benchmark measurements is non-normal, with outliers that significantly inflate the mean and standard-deviation due to presence of uncontrolled variable of the system load. Therefore the MEAN and SD were incorrect statistics to properly characterize the benchmark measurements.

Benchmark_Driver now gathers more individual measurements from Benchmark_O. It is executed with `--num-iters=1`, because we don’t want to average the runtimes, we want raw data. This collects variable number of measurements gathered in about 1 second.  Using the `--quantile=20` we get up to 20 measured values that properly characterize the empirical distribution of the benchmark from each independent run. The measurements from `i` independent executions are combined to form the final empirical distribution, which is reported in a five-number summary (MIN, Q1, MEDIAN, Q3, MAX).
When num_samples is less than quantile + 1, some of the measurements are repeated in the report summary. Parsed samples should strive to be a true reflection of the measured distribution, so we’ll correct this by discarding the repetated artifacts from quantile estimation.

This avoids introducting a bias from this oversampling into the empirical distribution obtained from merging independent samples.

See also:
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.
Since the meaning of some columns was changed, but their overall number remained, let’s include the header in the CSV log to make it clear that we are now reporting MIN, Q1, MEDIAN, Q3, MAX, MAX_RSS, instead of the old MIN, MAX, MEAN, SD, MEDIAN, MAX_RSS format.
* Lowered the threshold for healthy benchmark runtime to be under 1000 μs.
* Offer suitable divisor that is power of 10, in addition to the one that’s power of 2.
* Expanded the motivation in the docstring.
Add a check against unreasonably long setup times for benchmarks that do their initialization work in the `setUpFunction`. Given the typical benchmark measurements will last about 1 second, it’s reasonable to expect the setup to take at most 20% extra, on top of that: 200 ms.

The `DictionaryKeysContains*` benchmarks are an instance of this mistake. The setup of `DictionaryKeysContainsNative` takes 3 seconds on my machine, to prepare a dictionary for the run function, whose typical runtime is 90 μs. The setup of Cocoa version takes 8 seconds!!! It is trivial to rewite these with much smaller dictionaries that demonstrate the point of these benchmarks perfectly well, without the need to wait for ages to setup these benchmarks.
@palimondo
Copy link
Collaborator Author

palimondo commented Oct 16, 2018

@eeckstein Please review. Please also run full test, to exercise the benchmark tests (if it's not part of the smoke-test yet...).

@palimondo
Copy link
Collaborator Author

I'm working on another PR, that should land in sync with this one, since this changes the measurement methodology used in Benchmark_Driver to always measure with --num-iters=1 and the benchmarks that exhibit setup overhead would report different minimum value, because the overhead is no longer amortized over multiple iterations.

@eeckstein
Copy link
Member

@swift-ci benchmark

@swift-ci
Copy link
Collaborator

Build comment file:

No performance and code size changes

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@eeckstein
Copy link
Member

@swift-ci smoke test

@palimondo
Copy link
Collaborator Author

@eeckstein I've pushed the fix for linter nitpicks that stopped the last build. Please re-test. (Are benchmark tests now part of smoke test, too — or should we run full test as we did before?)

@eeckstein
Copy link
Member

@swift-ci smoke test

1 similar comment
@eeckstein
Copy link
Member

@swift-ci smoke test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants