Consider using the trivial `id` benchmark on an example output-type value as a baseline or lower measurement sanity bound #77

Open
jberryman opened this Issue Mar 5, 2015 · 2 comments

Projects

None yet

2 participants

@jberryman

I just started experimenting with a tiny function that does some bit twiddling (compiled to a handful of instructions, ideally). It looks like

innerLoop :: Word8 -> Word32

Then for testing purposes I have hand-unrolled variants that just do the same operation 4 times, 8 times, etc.

innderLoop8 ::  :: (Word8 , Word8 , Word8 , Word8) -> Word32  -- and so on

A test run gives output something like:

         innerLoop  =  9.37
         innerLoop4 = 12.66
         innerLoop8 = 17.83

So it looks like there's overhead somewhere that's not from the precise code I'm interested in. Obviously at this scale, that's not unexpected. But I was curious if I could measure this overhead directly, so I added an id benchmark on a value of the expected output type of the functions I was testing:

  -- `id` might work too, or maybe some rewriting will happen that we don't want:
, bench "baseline" $ nf (\x-> x) (0 :: Word32) 
]

Which gave:

benchmarking baseline
time                 9.494 ns   (9.471 ns .. 9.519 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.495 ns   (9.464 ns .. 9.536 ns)
std dev              121.5 ps   (99.28 ps .. 163.7 ps)
variance introduced by outliers: 15% (moderately inflated)

If I subtract this measured "overhead" from my other measurements (posted above) the results seem sensible; they're both in line with what I'd expect from the compiled code (haven't looked at assembly yet) and the time increases linearly-ish.

I then tried out a baseline benchmark with () like bench "baseline" $ nf (\x-> x) () which was a little faster, suggesting this is measuring the overhead of nf, not just, say, the cost of a function call per se or something:

benchmarking baseline()
time                 6.781 ns   (6.753 ns .. 6.818 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 6.810 ns   (6.774 ns .. 6.864 ns)
std dev              141.1 ps   (103.7 ps .. 202.2 ps)
variance introduced by outliers: 33% (moderately inflated)

So my proposal: I wonder if it makes sense for criterion to do some or all of:

  1. for each bench line demand the user provide an example value of the result type (or use a Default type class or something), and run the id benchmark on that
  2. use the resulting value as a baseline for the other measurements
  3. complain when the measured value is too close to the baseline to be of any value (e.g. as in my innerLoop)

I wonder if it's even possible for criterion to do this internally, i.e. if we can only "accurately" measure this overhead in the way shown above. In that case it would be possible to use Template Haskell in a new defaultMain-style splice but which inserts new baseline stanzas (and then obviously processes the output, combining benchmarks with their baseline).

@jberryman jberryman changed the title from Consider using the trivial `id` on inputs as a baseline / measurement lower sanity bound to Consider using the trivial `id` on an example output-type value as a baseline / measurement lower sanity bound Mar 5, 2015
@jberryman jberryman changed the title from Consider using the trivial `id` on an example output-type value as a baseline / measurement lower sanity bound to Consider using the trivial `id` benchmark on an example output-type value as a baseline / measurement lower sanity bound Mar 5, 2015
@jberryman jberryman changed the title from Consider using the trivial `id` benchmark on an example output-type value as a baseline / measurement lower sanity bound to Consider using the trivial `id` benchmark on an example output-type value as a baseline or measurement lower sanity bound Mar 5, 2015
@jberryman jberryman changed the title from Consider using the trivial `id` benchmark on an example output-type value as a baseline or measurement lower sanity bound to Consider using the trivial `id` benchmark on an example output-type value as a baseline or lower measurement sanity bound Mar 5, 2015
@jberryman

(sorry about the title changes; I'm not having a stroke, I promise)

@ivanperez-keera

It would be nice to parameterise defaultMain over a result processing function to determine how to actually calculate the results, and what to report.

Use cases:

  • Introduce base benchmarks whose running times can be subtracted to get more reliable information.
  • Divide results by sample size.
  • Comparisons of semantically equivalent functions, tagging one as X% faster/slower.
  • Equalize units, to facilitate visual identification of faster/slower implementations.
  • Output data in machine-readable format, for post-analysis.
  • Load results from previous benchmarks to do regression benchmarking. To obtain more reliable data, this could be combined with a script (not part of criterion) that checks out and runs the benchmarks on two different commits (HEAD and HEAD^), outputting data in a machine-readable format, and compares the two.

I'm going to check what can already be done with the existing facilities (templates?). Otherwise, I'll fork to try and implement this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment