[benchmark] Driver Improvements #26303

palimondo · 2019-07-23T18:43:03Z

This PR adds 3 features to the benchmark driver:

--min-samples attribute to specify minimal number of samples to gather.
--meta option to log measurement metadata in the benchmark summary (along with corresponding support for parsing such log format):
- number of memory pages used,
- number of involuntary context switches,
- number of voluntary yields.
Support for running benchmarks using substring filters:
- Positional arguments prefixed with a single + or - sign are interpreted as benchmark name filters. Executes all benchmarks whose names include any of the strings prefixed with a plus sign but none of the strings prefixed with a minus sign.'

Example:

$ ./Benchmark_O --num-iters=1 --quantile=4 --meta +Angry -Small
#,TEST,SAMPLES,MIN(μs),Q1(μs),Q2(μs),Q3(μs),MAX(μs),PAGES,ICS,YIELD
2,AngryPhonebook,200,6762,6846,6846,6881,17178,11,161,22
3,AngryPhonebook.ASCII,200,1840,1859,1866,1885,2929,18,355,50
5,AngryPhonebook.Armenian,200,678,684,690,697,1290,46,384,15
7,AngryPhonebook.Cyrillic,200,668,674,682,701,1804,53,552,15
9,AngryPhonebook.Strasse,200,610,612,621,628,4360,56,632,14

Total performance tests executed: 5

Motivation

The support for running benchmarks using substring filter is meant as an alternative to running benchmarks using tags (and skip-tags). Maybe if we consistently applied the benchmark naming convention to the whole Swift Benchmark Suite, the tags could be retired? Let's see how we'll like that in practice…

I'd like to further experiment with the measurement process to find a better balance between runtime and robustness of measurement by doing parallel benchmarking. The --meta option surfaces the key data about the quality of measurement environment, without the need to resort to the much more expensive --verbose mode. The --min-samples controls the minimal sample size for a proper statistical analysis, so that we could rely more on --sample-time (when measuring with --num-samples=1), without needing to set it too high in order to get enough data from the worst cases (slow benchmarks).

Best reviewed by individual commits.

Support for gathering a minimal number of samples per benchmark, using the optional `--min-samples` argument, which overrides the automatically computed number of samples per `sample-time` if this is lower�.

Added support for running benchmarks using substring filters. Positional arguments prefixed with a single + or - sign are interpreted as benchmark name filters. Excecutes all benchmarks whose names include any of the strings prefixed with a plus sign but none of the strings prefixed with a minus sign.

Added --meta option to log measurement metadata: * PAGES – number of memory pages used * ICS – number of involuntary context switches * YIELD – number of voluntary yields (Pages and ICS were previously available only in --verbose mode.)

palimondo · 2019-07-23T18:43:21Z

@swift-ci please benchmark

palimondo · 2019-07-23T18:43:30Z

@swift-ci please test

swift-ci · 2019-07-23T19:26:33Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
Dictionary4	155	199	+28.4%	0.78x (?)
Dictionary4OfObjects	197	234	+18.8%	0.84x (?)

Improvement	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	4267	3672	-13.9%	1.16x (?)

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
DriverUtils.o	140129	155317	+10.8%	0.90x

Performance: -Osize

Regression	OLD	NEW	DELTA	RATIO
SortLettersInPlace	393	423	+7.6%	0.93x (?)

Improvement	OLD	NEW	DELTA	RATIO
MapReduceShortString	13	12	-7.7%	1.08x (?)

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
DriverUtils.o	120610	134806	+11.8%	0.89x

Performance: -Onone

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

palimondo · 2019-07-24T03:35:44Z

@swift-ci please benchmark

palimondo · 2019-07-24T03:42:57Z

I’m re-running the benchmarks to see if the reported false changes are stable across the runs or random (I hope). I suspect the instability is caused by the small sample size, since the run_smoke_bench is in the end doing just --num-samples=10. My working theory is that this is too small a sample, ergo the need to gather larger samples. It’s possible some benchmarks are not reaching a steady performance state during such brief sampling.

swift-ci · 2019-07-24T04:00:36Z

Performance: -O

Regression	OLD	NEW	DELTA	RATIO
Dictionary4	155	199	+28.4%	0.78x (?)
Dictionary4OfObjects	198	235	+18.7%	0.84x (?)

Improvement	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	4011	3719	-7.3%	1.08x (?)

Code size: -O

Regression	OLD	NEW	DELTA	RATIO
DriverUtils.o	140129	155317	+10.8%	0.90x

Performance: -Osize

Code size: -Osize

Regression	OLD	NEW	DELTA	RATIO
DriverUtils.o	120610	134806	+11.8%	0.89x

Performance: -Onone

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

palimondo · 2019-07-24T04:07:15Z

Hrm… stable change. @eeckstein, do you have some theory that could explain this?

eeckstein · 2019-07-24T06:52:45Z

Some benchmarks are just unstable. We should consider disabling them

palimondo · 2019-07-24T07:05:03Z

I have #20552 open that would address the FlattenListFlatMap, FWIW. But I have a hard time accepting the “Some benchmarks are just unstable” thesis without understanding the reason behind it. I’ll try to examine Dictionary4 in more detail once I’m back from dog walk. (You’re responding in an atypical time for you… vacation?)

palimondo · 2019-07-24T13:59:00Z

Looking at the stability of benchmark results during measurement, I would not call Dictionary4 unstable in any way, but it seems to pop-up from time to time in CI reports. Most recently here above, then 16 days ago, 20 days ago, before than back in May...

@lorentey, maybe you have some idea to explain the 23–26% jumps in Dictionary4 performance? It seems it is triggered by some code alignment when the binary size changes. The benchmark seems to stress the automatic hashable conformance of a large key.

palimondo added 4 commits July 7, 2019 10:13

[benchmark] Add min-sample argument to drivers

ad24ca4

Support for gathering a minimal number of samples per benchmark, using the optional `--min-samples` argument, which overrides the automatically computed number of samples per `sample-time` if this is lower�.

[benchmark] LogParser: measurement metadata

cc0e16c

palimondo requested a review from eeckstein July 23, 2019 18:43

eeckstein approved these changes Jul 24, 2019

View reviewed changes

palimondo merged commit 4c45b51 into apple:master Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmark] Driver Improvements #26303

[benchmark] Driver Improvements #26303

palimondo commented Jul 23, 2019 •

edited

palimondo commented Jul 23, 2019

palimondo commented Jul 23, 2019

swift-ci commented Jul 23, 2019

palimondo commented Jul 24, 2019

palimondo commented Jul 24, 2019 •

edited

swift-ci commented Jul 24, 2019

palimondo commented Jul 24, 2019

eeckstein commented Jul 24, 2019

palimondo commented Jul 24, 2019

palimondo commented Jul 24, 2019

[benchmark] Driver Improvements #26303

[benchmark] Driver Improvements #26303

Conversation

palimondo commented Jul 23, 2019 • edited

Motivation

palimondo commented Jul 23, 2019

palimondo commented Jul 23, 2019

swift-ci commented Jul 23, 2019

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

palimondo commented Jul 24, 2019

palimondo commented Jul 24, 2019 • edited

swift-ci commented Jul 24, 2019

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

palimondo commented Jul 24, 2019

eeckstein commented Jul 24, 2019

palimondo commented Jul 24, 2019

palimondo commented Jul 24, 2019

palimondo commented Jul 23, 2019 •

edited

palimondo commented Jul 24, 2019 •

edited