Benchmark statistics #493

pv · 2016-12-09T22:16:32Z

Record more benchmark samples, and compute, display, and use statistics based on them.

Change default settings to record more (but shorter) samples. ~~Changes meaning of goal_time~~ Makes determination of goal_time more accurate; however, no big changes to methodology --- there's room to improve here.
Do warmup before benchmark --- this seems to matter also on CPython (although possibly not due to CPython itself but some OS/CPU effects)
Display spread of measurements (median +/- half of interquartile range)
Estimate confidence interval and use that to decide what is not significant in asv compare.
Optionally, save samples to the json files.
~~Switch to gzipped files.~~

The statistical confidence estimates are a somewhat tricky point, because timing samples usually have strong autocorrelation (multimodality, stepwise changes in location, etc.), which makes simple stuff often misleading. There's some alleviation for this currently there in that it tries to regress the timing sample time series looking for steps, and adds those to the CI. Not rigorous, but probably better than nothing.

The problem is that there's low-frequency noise in the measurement, so measuring for a couple of second does not give a good idea of the full distribution.

todo:

more tests, e.g. printing and formatting is mostly untested
do we really need to gzip?
update documentation
tuning for pypy

coveralls · 2016-12-09T22:44:31Z

Coverage increased (+0.7%) to 72.988% when pulling f953513 on pv:timing-estimator6 into cfe3d3e on spacetelescope:master.

coveralls · 2016-12-09T23:46:52Z

Coverage increased (+0.6%) to 72.971% when pulling bd4ddca on pv:timing-estimator6 into cfe3d3e on spacetelescope:master.

coveralls · 2016-12-10T00:10:58Z

Coverage increased (+0.6%) to 72.971% when pulling bd4ddca on pv:timing-estimator6 into cfe3d3e on spacetelescope:master.

coveralls · 2016-12-10T14:36:47Z

Coverage increased (+0.6%) to 72.941% when pulling 364e0ed on pv:timing-estimator6 into cfe3d3e on spacetelescope:master.

wrwrwr · 2016-12-12T12:19:59Z

Changing goal_time to mark a single repeat time is nice, but do I read it right that setting repeat = 5 will take 5 samples if number is fixed and 50 samples if it is left at zero? Why not make the number of samples equal to repeat in the adaptive case too (with a single setup per sample and a higher default)?

pv · 2016-12-12T12:49:59Z

Yes, I've got something like that in the works. This is partly orthogonal to this PR, however. . The benchmark methodology may still need tuning. One issue faced here not faced by perf is that the benchmarks that run later, run under different CPU/systemload/heating conditions than the benchmarks that run earlier. So the sampling probably should use multiple processes, and be done in A B C ... A' B' C' ... A'' B'' C'' ... order.

wrwrwr · 2016-12-12T13:41:30Z

Interleaving benchmarks would surely be good, comparability across separate import benchmarks was something I was aiming for. The whole samples/repeats logic would probably need to be moved up to realize that though (with a "sample" rather than a "benchmark" script?).

coveralls · 2016-12-16T23:17:24Z

Coverage increased (+0.7%) to 73.009% when pulling 8c41b73 on pv:timing-estimator6 into 390145a on spacetelescope:master.

coveralls · 2016-12-17T00:04:30Z

Coverage increased (+0.7%) to 72.991% when pulling 8c8fbe0 on pv:timing-estimator6 into 390145a on spacetelescope:master.

coveralls · 2016-12-17T19:32:18Z

Coverage increased (+0.9%) to 73.191% when pulling 2c5b78e on pv:timing-estimator6 into 390145a on spacetelescope:master.

pv · 2016-12-17T23:17:08Z

I think this is more or less finished now. Some fine-tuning may be useful to do later on.

The changes in benchmarking parameter default values may be annoying, but these should give better accuracy. The default runtime also becomes shorter. Big improvements on the accuracy probably will need stuff such as splitting the benchmark runs into several parts, to be run in an interleaved order to ensure long sampling time spans to capture the low-frequency noise properly. https://github.com/pv/asv/tree/many-proc

coveralls · 2016-12-17T23:29:02Z

Coverage increased (+0.9%) to 73.203% when pulling 26a42ba on pv:timing-estimator6 into 390145a on spacetelescope:master.

coveralls · 2016-12-18T17:21:22Z

Coverage increased (+0.9%) to 73.187% when pulling fc2045d on pv:timing-estimator6 into 390145a on spacetelescope:master.

… methodology Makes `goal_time` to be more accurate, and changes the default values for `repeat` and `goal_time`. Adds `warmup_time`.

Use a more uniform API format for parameterized and non-parameterized benchmark data (non-parameterized ~ parameterized with 1 combination). Move the API access to methods, to decouple the API from the file format. Also, the accessors now do the work of compatible_results, so that it doesn't have to be done explicitly.

The number of repeats may be smaller if too_slow() condition was encountered.

coveralls · 2017-02-11T22:47:19Z

Coverage increased (+0.9%) to 73.233% when pulling 347bb53 on pv:timing-estimator6 into e581104 on spacetelescope:master.

coveralls · 2017-02-18T16:56:28Z

Coverage increased (+0.9%) to 73.253% when pulling 50ddeb8 on pv:timing-estimator6 into e581104 on spacetelescope:master.

coveralls · 2017-02-18T17:42:04Z

Coverage increased (+0.9%) to 73.268% when pulling 10fdd49 on pv:timing-estimator6 into e581104 on spacetelescope:master.

pv · 2017-02-18T19:08:45Z

After some dogfooding, I think this is ready to go.

pv mentioned this pull request Dec 9, 2016

Measure statistics of benchmark results #486

Closed

pv force-pushed the timing-estimator6 branch 2 times, most recently from 6500cb3 to bd4ddca Compare December 9, 2016 23:37

pv force-pushed the timing-estimator6 branch from bd4ddca to 364e0ed Compare December 10, 2016 14:04

pv force-pushed the timing-estimator6 branch 2 times, most recently from 76012d6 to 8c41b73 Compare December 16, 2016 22:38

pv force-pushed the timing-estimator6 branch from 8c41b73 to 8c8fbe0 Compare December 16, 2016 23:33

pv force-pushed the timing-estimator6 branch from 8c8fbe0 to 2c5b78e Compare December 17, 2016 19:00

pv changed the title ~~WIP: Benchmark statistics~~ Benchmark statistics Dec 17, 2016

pv force-pushed the timing-estimator6 branch from 2c5b78e to 26a42ba Compare December 17, 2016 22:57

pv added 7 commits February 11, 2017 22:30

Add benchmark result analysis routines

5d92049

Implement formatting +/- errors + showing significant digits

06dd018

Record samples in timing benchmarks, analyze in main process + adjust…

db5cc30

… methodology Makes `goal_time` to be more accurate, and changes the default values for `repeat` and `goal_time`. Adds `warmup_time`.

Move mean_na, is_na, etc to asv.util

0e013c4

Fix up asv compare/continuous to make use of statistics

96094eb

Make recording measurement samples optional in asv run

5225b45

pv added 4 commits February 11, 2017 22:30

Add --quick and --record-samples to asv continuous

e1781a0

Update documentation vs. error estimates

8c8c853

Fix issue in test

9fb9bda

The number of repeats may be smaller if too_slow() condition was encountered.

Fix documentation of 'repeat' parameter

347bb53

pv force-pushed the timing-estimator6 branch from fc2045d to 347bb53 Compare February 11, 2017 22:01

Take timeout into account in early termination of sampling

50ddeb8

TST: relax check of sample numbers

10fdd49

pv merged commit f3d7935 into airspeed-velocity:master Feb 18, 2017

astrojuanlu mentioned this pull request Jul 15, 2018

Big time difference for benchmarks that require high warmup #677

Closed

pv deleted the timing-estimator6 branch August 19, 2018 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark statistics #493

Benchmark statistics #493

pv commented Dec 9, 2016 •

edited

coveralls commented Dec 9, 2016

coveralls commented Dec 9, 2016

coveralls commented Dec 10, 2016

coveralls commented Dec 10, 2016

wrwrwr commented Dec 12, 2016

pv commented Dec 12, 2016 via email

wrwrwr commented Dec 12, 2016

coveralls commented Dec 16, 2016

coveralls commented Dec 17, 2016

coveralls commented Dec 17, 2016

pv commented Dec 17, 2016

coveralls commented Dec 17, 2016

coveralls commented Dec 18, 2016

coveralls commented Feb 11, 2017

coveralls commented Feb 18, 2017

coveralls commented Feb 18, 2017

pv commented Feb 18, 2017

Benchmark statistics #493

Benchmark statistics #493

Conversation

pv commented Dec 9, 2016 • edited

coveralls commented Dec 9, 2016

coveralls commented Dec 9, 2016

coveralls commented Dec 10, 2016

coveralls commented Dec 10, 2016

wrwrwr commented Dec 12, 2016

pv commented Dec 12, 2016 via email

wrwrwr commented Dec 12, 2016

coveralls commented Dec 16, 2016

coveralls commented Dec 17, 2016

coveralls commented Dec 17, 2016

pv commented Dec 17, 2016

coveralls commented Dec 17, 2016

coveralls commented Dec 18, 2016

coveralls commented Feb 11, 2017

coveralls commented Feb 18, 2017

coveralls commented Feb 18, 2017

pv commented Feb 18, 2017

pv commented Dec 9, 2016 •

edited