Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] Driver Improvements #26303

Merged
merged 4 commits into from Jul 24, 2019
Merged

[benchmark] Driver Improvements #26303

merged 4 commits into from Jul 24, 2019

Conversation

palimondo
Copy link
Collaborator

@palimondo palimondo commented Jul 23, 2019

This PR adds 3 features to the benchmark driver:

  1. --min-samples attribute to specify minimal number of samples to gather.
  2. --meta option to log measurement metadata in the benchmark summary (along with corresponding support for parsing such log format):
    • number of memory pages used,
    • number of involuntary context switches,
    • number of voluntary yields.
  3. Support for running benchmarks using substring filters:
    • Positional arguments prefixed with a single + or - sign are interpreted as benchmark name filters. Executes all benchmarks whose names include any of the strings prefixed with a plus sign but none of the strings prefixed with a minus sign.'

Example:

$ ./Benchmark_O --num-iters=1 --quantile=4 --meta +Angry -Small
#,TEST,SAMPLES,MIN(μs),Q1(μs),Q2(μs),Q3(μs),MAX(μs),PAGES,ICS,YIELD
2,AngryPhonebook,200,6762,6846,6846,6881,17178,11,161,22
3,AngryPhonebook.ASCII,200,1840,1859,1866,1885,2929,18,355,50
5,AngryPhonebook.Armenian,200,678,684,690,697,1290,46,384,15
7,AngryPhonebook.Cyrillic,200,668,674,682,701,1804,53,552,15
9,AngryPhonebook.Strasse,200,610,612,621,628,4360,56,632,14

Total performance tests executed: 5

Motivation

The support for running benchmarks using substring filter is meant as an alternative to running benchmarks using tags (and skip-tags). Maybe if we consistently applied the benchmark naming convention to the whole Swift Benchmark Suite, the tags could be retired? Let's see how we'll like that in practice…

I'd like to further experiment with the measurement process to find a better balance between runtime and robustness of measurement by doing parallel benchmarking. The --meta option surfaces the key data about the quality of measurement environment, without the need to resort to the much more expensive --verbose mode. The --min-samples controls the minimal sample size for a proper statistical analysis, so that we could rely more on --sample-time (when measuring with --num-samples=1), without needing to set it too high in order to get enough data from the worst cases (slow benchmarks).

Best reviewed by individual commits.

Support for gathering a minimal number of samples per benchmark, using the optional `--min-samples` argument, which overrides the automatically computed number of samples per `sample-time` if this is lower�.
Added support for running benchmarks using substring filters. Positional arguments prefixed with a single + or - sign are interpreted as benchmark name filters.

Excecutes all benchmarks whose names include any of the strings prefixed with a plus sign but none of the strings prefixed with a minus sign.
Added --meta option to log measurement metadata:

* PAGES – number of memory pages used
* ICS – number of involuntary context switches
* YIELD – number of voluntary yields

(Pages and ICS were previously available only in --verbose mode.)
@palimondo palimondo requested a review from eeckstein July 23, 2019 18:43
@palimondo
Copy link
Collaborator Author

@swift-ci please benchmark

@palimondo
Copy link
Collaborator Author

@swift-ci please test

@swift-ci
Copy link
Collaborator

Performance: -O

Regression OLD NEW DELTA RATIO
Dictionary4 155 199 +28.4% 0.78x (?)
Dictionary4OfObjects 197 234 +18.8% 0.84x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 4267 3672 -13.9% 1.16x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
DriverUtils.o 140129 155317 +10.8% 0.90x

Performance: -Osize

Regression OLD NEW DELTA RATIO
SortLettersInPlace 393 423 +7.6% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
MapReduceShortString 13 12 -7.7% 1.08x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
DriverUtils.o 120610 134806 +11.8% 0.89x

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@palimondo
Copy link
Collaborator Author

@swift-ci please benchmark

@palimondo
Copy link
Collaborator Author

palimondo commented Jul 24, 2019

I’m re-running the benchmarks to see if the reported false changes are stable across the runs or random (I hope). I suspect the instability is caused by the small sample size, since the run_smoke_bench is in the end doing just --num-samples=10. My working theory is that this is too small a sample, ergo the need to gather larger samples. It’s possible some benchmarks are not reaching a steady performance state during such brief sampling.

@swift-ci
Copy link
Collaborator

Performance: -O

Regression OLD NEW DELTA RATIO
Dictionary4 155 199 +28.4% 0.78x (?)
Dictionary4OfObjects 198 235 +18.7% 0.84x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 4011 3719 -7.3% 1.08x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
DriverUtils.o 140129 155317 +10.8% 0.90x

Performance: -Osize

Code size: -Osize

Regression OLD NEW DELTA RATIO
DriverUtils.o 120610 134806 +11.8% 0.89x

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@palimondo
Copy link
Collaborator Author

Hrm… stable change. @eeckstein, do you have some theory that could explain this?

@eeckstein
Copy link
Member

Some benchmarks are just unstable. We should consider disabling them

@palimondo
Copy link
Collaborator Author

I have #20552 open that would address the FlattenListFlatMap, FWIW. But I have a hard time accepting the “Some benchmarks are just unstable” thesis without understanding the reason behind it. I’ll try to examine Dictionary4 in more detail once I’m back from dog walk. (You’re responding in an atypical time for you… vacation?)

@palimondo palimondo merged commit 4c45b51 into apple:master Jul 24, 2019
@palimondo
Copy link
Collaborator Author

Looking at the stability of benchmark results during measurement, I would not call Dictionary4 unstable in any way, but it seems to pop-up from time to time in CI reports. Most recently here above, then 16 days ago, 20 days ago, before than back in May...

@lorentey, maybe you have some idea to explain the 23–26% jumps in Dictionary4 performance? It seems it is triggered by some code alignment when the binary size changes. The benchmark seems to stress the automatic hashable conformance of a large key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants