Add CPU time measurements to Benchmark.ips output #9846

jgaskins · 2020-10-24T20:49:05Z

Wall-clock time is a fantastic approximation of scalability if the CPU is never yielded inside the block. Unfortunately, in measuring execution time for database queries, HTTP requests, or things like that, you need a lot of iterations over the block to understand how well it scales in terms of the application. Benchmark.ips gives us a really nice way to measure time spent per iteration without having to dial in the number of iterations ourselves, but we can't get CPU usage information from it.

This PR adds that to the Benchmark.ips report:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 207.99  (  4.81ms real, 528.26µs user,   1.14ms sys) (± 2.80%)  2.34kB/op  20.77× slower
pipeline   4.32k (231.53µs real,  40.31µs user,  30.75µs sys) (± 0.98%)  21.6kB/op        fastest

This will probably need a little more work (for example, I used the simple form of humanize instead of the block form as was used elsewhere, which may not be ideal), but I'm curious what people's thoughts are before putting more time into this. 🙂

If folks are good with this, since we'd be outputting so much information here in a quasi-tabular format, would it make sense to make it an actual table?

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
     raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest

straight-shoota · 2020-10-30T13:18:22Z

I like the idea 👍
However, I'm a bit sceptical of the amount of data this prins to the terminal. Having short and concise output is always preferred. So the question is, whether the additional data is relevant enough in a large number of use cases to warrant more convoluted output.
Maybe this could be optional? Also the table formatting.

It's also worth considering that inserting additional time metrics at the proposed location disturbes the visual correlation between variance and the data it's based on (ips / real time).

asterite · 2020-10-30T14:03:37Z

Maybe this could be optional?

I also like this, but I think it should be enabled by passing an argument to the ips method.

Fryguy · 2020-10-30T18:30:55Z

Love the table formatting. If you could just slightly change it to left align the data in the first column, then it also would support markdown rendering, making it easy to copy/paste into a github issue/PR.

Example:

Before:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
     raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest

Benchmark	ips	real	user	system	variance	heap	result

 raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower

pipeline | 4.32k |231.53µs | 40.31µs | 30.75µs | ± 0.98% | 21.6kB/op | fastest

After:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
raw       | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest

Benchmark	ips	real	user	system	variance	heap	result
raw	207.99	4.81ms	528.26µs	1.14ms	± 2.80%	2.34kB/op	20.77× slower
pipeline	4.32k	231.53µs	40.31µs	30.75µs	± 0.98%	21.6kB/op	fastest

If you wanted to keep the right alignment of the other columns:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|--------:|--------:|---------:|---------:|---------:|-----------:|-------------:
raw       | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest

Benchmark	ips	real	user	system	variance	heap	result
raw	207.99	4.81ms	528.26µs	1.14ms	± 2.80%	2.34kB/op	20.77× slower
pipeline	4.32k	231.53µs	40.31µs	30.75µs	± 0.98%	21.6kB/op	fastest

bcardiff · 2020-10-30T20:59:49Z

FYI, the Crystal::TablePrint helper might be helpful here.

jgaskins · 2020-11-13T16:01:03Z

Apologies for the delay in response. Been a wild few weeks.

@straight-shoota

So the question is, whether the additional data is relevant enough in a large number of use cases to warrant more convoluted output.

Whether it's useful in enough cases, I'm not sure. Anecdotally, I work primarily on distributed systems so I/O is a huge component of what I do, and for anything that deals with I/O (even to localhost) wall-clock time remains a good latency metric but its usefulness for understanding throughput is reduced to very little as soon as concurrency comes into play.

With that said, though, I think it makes sense to keep ips as iterations per wall-clock second. It's still a good baseline for how long the person running the benchmark should expect to wait. Iterations per CPU second in an I/O-heavy scenario could take hours. 🙂

Can you expand on what you think is convoluted, though? I think we may have different interpretations of that word in mind.

@asterite

I think it should be enabled by passing an argument to the ips method.

I think it makes a lot of sense to put the distinction between user and system CPU time behind a flag since that's a very specific type of insight. But the CPU-time metric can be just as useful as the heap-allocations metric included in the output by default.

Part of this comes from a concern about the ways our tools influence how we measure. If we ignore CPU time by default, I wonder if anyone will ever look at it.

@bcardiff

FYI, the Crystal::TablePrint helper might be helpful here.

Whoa, there's already a thing for this?? If we end up going this route, would it make sense to document that helper/move it out of the Crystal namespace or keep it there since this is an internal thing?

bcardiff · 2020-11-13T16:10:57Z

It should remain internal. There are shards more feature complete for others to use. But if you need to expand things for have better formatting or alignments feel free to change that helper.

straight-shoota · 2020-11-13T16:17:29Z

Can you expand on what you think is convoluted, though? I think we may have different interpretations of that word in mind.

I mean that the number of data points per row in Benchmark.ips output nearly doubles from four to seven and most uses will probably be irritated about what to do with such an overload of information.

Additionally, the line length grows enormously. In your examples it's 98 characters. Table format it's a bit less because you can leave the units, but not by much. But with so much data, chances are that it might cause the line to wrap if the terminal is not wide enough. Broken lines are going to look terrible. That may be acceptable if you really need the additional data. But I assume most use cases don't.

Bottom line, adding more measurements makes the output harder to read because it presents data most users likely didn't even ask for.

jgaskins · 2020-11-14T00:31:27Z

I mean that the number of data points per row in Benchmark.ips output nearly doubles from four to seven

Not quite, I only added 2 metrics. 🙂 In v0.35.1, Benchmark.ips shows 5. The two I added can be coalesced into a single metric for aggregate CPU time, though, because distinguishing between userland and kernel time is unlikely to be useful to most people. So it'd go from 5->6. Here's what that looks like:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 179.90  (  5.56ms real,   1.99ms cpu) (± 1.88%)  2.34kB/op  23.78× slower
pipeline   4.28k (233.72µs real,  72.00µs cpu) (± 0.83%)  21.6kB/op        fastest

Your point about line length is important and it's something I didn't even think about, so I appreciate you pointing that out. In this example, the lines are 84 characters. We could probably shave off a few by using 3 significant digits instead of 2 decimal places (because 5 significant digits isn't exactly useful) and trim down a couple other places. Here's a mockup for that:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 179.90  (5.56ms real/1.99ms cpu) (± 1.88%) 2.34kB/op  23.78× slower
pipeline   4.28k ( 233µs real/72.0µs cpu) (± 0.83%) 21.6kB/op        fastest

This comes in at 79 characters wide with an 8-character benchmark name. If we go with the table:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
Benchmark |     ips |   real |    CPU |  stddev |      heap |        result
----------|---------|--------|--------|---------|-----------|--------------
     raw  | 207.99  | 4.81ms |  528µs | ± 2.80% | 2.34kB/op | 20.77× slower
pipeline  |   4.32k |  231µs | 40.3µs | ± 0.98% | 21.6kB/op |       fastest

Surprisingly not much shorter. 😂 But it might be easier to read.

Bottom line, adding more measurements makes the output harder to read because it presents data most users likely didn't even ask for.

How many people asked for heap allocations in the output? Judging from the PR where it was added, it doesn't look like a lot. I'd be willing to bet most people who use it to tune their code regularly didn't even know getting the memory usage of a block could feasibly be done. I know I didn't, but now I use it all the time. 🙂

In the same way, I'm trying to give people something they need, rather than something they asked for. And I do think anyone trying to optimize any kind of code which performs I/O benefits from this, even if they don't realize it right now.

straight-shoota · 2020-12-09T11:47:20Z

I'm on board with adding only CPU time by default.

Maybe we can find a way to more unambiguously communicate that ips and stddev relate to real time measurement.

Add CPU time measurements to Benchmark.ips output

d06ec8a

straight-shoota added kind:feature pr:needs-work A PR requires modifications by the author. topic:stdlib labels Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU time measurements to Benchmark.ips output #9846

Add CPU time measurements to Benchmark.ips output #9846

jgaskins commented Oct 24, 2020

straight-shoota commented Oct 30, 2020

asterite commented Oct 30, 2020

Fryguy commented Oct 30, 2020 •

edited

bcardiff commented Oct 30, 2020

jgaskins commented Nov 13, 2020

bcardiff commented Nov 13, 2020

straight-shoota commented Nov 13, 2020

jgaskins commented Nov 14, 2020

straight-shoota commented Dec 9, 2020

Add CPU time measurements to Benchmark.ips output #9846

Are you sure you want to change the base?

Add CPU time measurements to Benchmark.ips output #9846

Conversation

jgaskins commented Oct 24, 2020

straight-shoota commented Oct 30, 2020

asterite commented Oct 30, 2020

Fryguy commented Oct 30, 2020 • edited

bcardiff commented Oct 30, 2020

jgaskins commented Nov 13, 2020

bcardiff commented Nov 13, 2020

straight-shoota commented Nov 13, 2020

jgaskins commented Nov 14, 2020

straight-shoota commented Dec 9, 2020

Fryguy commented Oct 30, 2020 •

edited