Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU time measurements to Benchmark.ips output #9846

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jgaskins
Copy link
Contributor

Wall-clock time is a fantastic approximation of scalability if the CPU is never yielded inside the block. Unfortunately, in measuring execution time for database queries, HTTP requests, or things like that, you need a lot of iterations over the block to understand how well it scales in terms of the application. Benchmark.ips gives us a really nice way to measure time spent per iteration without having to dial in the number of iterations ourselves, but we can't get CPU usage information from it.

This PR adds that to the Benchmark.ips report:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 207.99  (  4.81ms real, 528.26µs user,   1.14ms sys) (± 2.80%)  2.34kB/op  20.77× slower
pipeline   4.32k (231.53µs real,  40.31µs user,  30.75µs sys) (± 0.98%)  21.6kB/op        fastest

This will probably need a little more work (for example, I used the simple form of humanize instead of the block form as was used elsewhere, which may not be ideal), but I'm curious what people's thoughts are before putting more time into this. 🙂

If folks are good with this, since we'd be outputting so much information here in a quasi-tabular format, would it make sense to make it an actual table?

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
     raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest

@straight-shoota
Copy link
Member

I like the idea 👍
However, I'm a bit sceptical of the amount of data this prins to the terminal. Having short and concise output is always preferred. So the question is, whether the additional data is relevant enough in a large number of use cases to warrant more convoluted output.
Maybe this could be optional? Also the table formatting.

It's also worth considering that inserting additional time metrics at the proposed location disturbes the visual correlation between variance and the data it's based on (ips / real time).

@asterite
Copy link
Member

Maybe this could be optional?

I also like this, but I think it should be enabled by passing an argument to the ips method.

@Fryguy
Copy link
Contributor

Fryguy commented Oct 30, 2020

Love the table formatting. If you could just slightly change it to left align the data in the first column, then it also would support markdown rendering, making it easy to copy/paste into a github issue/PR.

Example:


Before:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
     raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest
Benchmark ips real user system variance heap result
 raw  | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower

pipeline | 4.32k |231.53µs | 40.31µs | 30.75µs | ± 0.98% | 21.6kB/op | fastest


After:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|---------|---------|----------|----------|----------|------------|--------------
raw       | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest
Benchmark ips real user system variance heap result
raw 207.99 4.81ms 528.26µs 1.14ms ± 2.80% 2.34kB/op 20.77× slower
pipeline 4.32k 231.53µs 40.31µs 30.75µs ± 0.98% 21.6kB/op fastest

If you wanted to keep the right alignment of the other columns:

Benchmark |     ips |    real |     user |   system | variance |       heap |        result
----------|--------:|--------:|---------:|---------:|---------:|-----------:|-------------:
raw       | 207.99  |  4.81ms | 528.26µs |   1.14ms |  ± 2.80% |  2.34kB/op | 20.77× slower
pipeline  |   4.32k |231.53µs |  40.31µs |  30.75µs |  ± 0.98% |  21.6kB/op |       fastest
Benchmark ips real user system variance heap result
raw 207.99 4.81ms 528.26µs 1.14ms ± 2.80% 2.34kB/op 20.77× slower
pipeline 4.32k 231.53µs 40.31µs 30.75µs ± 0.98% 21.6kB/op fastest

@bcardiff
Copy link
Member

FYI, the Crystal::TablePrint helper might be helpful here.

@jgaskins
Copy link
Contributor Author

Apologies for the delay in response. Been a wild few weeks.

@straight-shoota

So the question is, whether the additional data is relevant enough in a large number of use cases to warrant more convoluted output.

Whether it's useful in enough cases, I'm not sure. Anecdotally, I work primarily on distributed systems so I/O is a huge component of what I do, and for anything that deals with I/O (even to localhost) wall-clock time remains a good latency metric but its usefulness for understanding throughput is reduced to very little as soon as concurrency comes into play.

With that said, though, I think it makes sense to keep ips as iterations per wall-clock second. It's still a good baseline for how long the person running the benchmark should expect to wait. Iterations per CPU second in an I/O-heavy scenario could take hours. 🙂

Can you expand on what you think is convoluted, though? I think we may have different interpretations of that word in mind.


@asterite

I think it should be enabled by passing an argument to the ips method.

I think it makes a lot of sense to put the distinction between user and system CPU time behind a flag since that's a very specific type of insight. But the CPU-time metric can be just as useful as the heap-allocations metric included in the output by default.

Part of this comes from a concern about the ways our tools influence how we measure. If we ignore CPU time by default, I wonder if anyone will ever look at it.


@bcardiff

FYI, the Crystal::TablePrint helper might be helpful here.

Whoa, there's already a thing for this?? If we end up going this route, would it make sense to document that helper/move it out of the Crystal namespace or keep it there since this is an internal thing?

@bcardiff
Copy link
Member

It should remain internal. There are shards more feature complete for others to use. But if you need to expand things for have better formatting or alignments feel free to change that helper.

@straight-shoota
Copy link
Member

Can you expand on what you think is convoluted, though? I think we may have different interpretations of that word in mind.

I mean that the number of data points per row in Benchmark.ips output nearly doubles from four to seven and most uses will probably be irritated about what to do with such an overload of information.

Additionally, the line length grows enormously. In your examples it's 98 characters. Table format it's a bit less because you can leave the units, but not by much. But with so much data, chances are that it might cause the line to wrap if the terminal is not wide enough. Broken lines are going to look terrible. That may be acceptable if you really need the additional data. But I assume most use cases don't.

Bottom line, adding more measurements makes the output harder to read because it presents data most users likely didn't even ask for.

@jgaskins
Copy link
Contributor Author

I mean that the number of data points per row in Benchmark.ips output nearly doubles from four to seven

Not quite, I only added 2 metrics. 🙂 In v0.35.1, Benchmark.ips shows 5. The two I added can be coalesced into a single metric for aggregate CPU time, though, because distinguishing between userland and kernel time is unlikely to be useful to most people. So it'd go from 5->6. Here's what that looks like:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 179.90  (  5.56ms real,   1.99ms cpu) (± 1.88%)  2.34kB/op  23.78× slower
pipeline   4.28k (233.72µs real,  72.00µs cpu) (± 0.83%)  21.6kB/op        fastest

Your point about line length is important and it's something I didn't even think about, so I appreciate you pointing that out. In this example, the lines are 84 characters. We could probably shave off a few by using 3 significant digits instead of 2 decimal places (because 5 significant digits isn't exactly useful) and trim down a couple other places. Here's a mockup for that:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
     raw 179.90  (5.56ms real/1.99ms cpu) (± 1.88%) 2.34kB/op  23.78× slower
pipeline   4.28k ( 233µs real/72.0µs cpu) (± 0.83%) 21.6kB/op        fastest

This comes in at 79 characters wide with an 8-character benchmark name. If we go with the table:

➜  redis git:(master) ✗ crystal run --release bench/pipeline_vs_raw.cr
Benchmark |     ips |   real |    CPU |  stddev |      heap |        result
----------|---------|--------|--------|---------|-----------|--------------
     raw  | 207.99  | 4.81ms |  528µs | ± 2.80% | 2.34kB/op | 20.77× slower
pipeline  |   4.32k |  231µs | 40.3µs | ± 0.98% | 21.6kB/op |       fastest

Surprisingly not much shorter. 😂 But it might be easier to read.

Bottom line, adding more measurements makes the output harder to read because it presents data most users likely didn't even ask for.

How many people asked for heap allocations in the output? Judging from the PR where it was added, it doesn't look like a lot. I'd be willing to bet most people who use it to tune their code regularly didn't even know getting the memory usage of a block could feasibly be done. I know I didn't, but now I use it all the time. 🙂

In the same way, I'm trying to give people something they need, rather than something they asked for. And I do think anyone trying to optimize any kind of code which performs I/O benefits from this, even if they don't realize it right now.

@straight-shoota
Copy link
Member

I'm on board with adding only CPU time by default.

Maybe we can find a way to more unambiguously communicate that ips and stddev relate to real time measurement.

@straight-shoota straight-shoota added kind:feature pr:needs-work A PR requires modifications by the author. topic:stdlib labels Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature pr:needs-work A PR requires modifications by the author. topic:stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants