Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csv Report #147

Closed
nickbabcock opened this Issue May 23, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@nickbabcock
Copy link

nickbabcock commented May 23, 2018

While the criterion html report is decent for most situations, there are occasions when a different graph allows for an easier interpretation. Instead making the html report endlessly configurable, I propose exposing the data in a csv format. This way, anyone can easily create their graph in R, python, etc. While the json in target/criterion/$BENCHMARK/$FUNCTION is a good start -- pulling everything together into a (preferably) single file would greatly lower the bar.

For instance a $cargobenchname-raw.csv could hold the raw sampling data with headers like:

group id id parameter elapsed (ns)
a b 100
a b 150

Where group id and parameter can be left blank for the benchmarks that don't apply. I'm torn if throughput should be included. I'm leaning towards no, as I believe it's calculable. Users of the raw data can calculate additional aggregations they are interested in (like percentiles)

Alternatively (or in addition), the aggregated values can be written to $cargobenchname.csv

group id id parameter elapsed (ns) x̅ p95 upper x̅ p95 lower σ

More than just the mean statistics would be included.

As a comparison, the C# BenchmarkDotNet outputs a csv like this, which can be slurped up to create nice looking graphs.

Maybe this could help people transition to presenting rust benchmark results in graphical form. I know that my Rust projects don't get enough love in the dataviz department 😄

@bheisler

This comment has been minimized.

Copy link
Owner

bheisler commented May 25, 2018

Hey, thanks for the suggestion and sorry for the slow response.

There have been requests for a stable machine-readable output format (eg. for lolbench). The current JSON storage format is an internal implementation detail. I was leaning more towards defining a stable JSON interchange format that could be different from the storage format, rather than CSV, but I could be convinced otherwise. Why did you choose to go with CSV over JSON?

That said, I don't agree that a single large file is the way to go, for a lot of reasons. The most important one is that it's actually really difficult to have a complete list of the benchmarks for a given repository - some benchmarks might have been filtered out, some might be run with a different instance of the Criterion struct, some might be in a totally different binary. The HTML report works around this to generate the index by walking the directory tree, which is kind of an ugly hack and often involves overwriting the file multiple times. I'd prefer not to repeat that hack if possible. I don't see the need for combining all of the output into a single file anyway; anyone who is writing a script to parse these files can surely handle reading multiple files.

Also if Criterion.rs does add something like this, the output format should include the number of iterations in each sample. That will be necessary for the consumer to calculate summary statistics correctly.

It's not a bad idea, but I think it needs some more work.

@nickbabcock

This comment has been minimized.

Copy link
Author

nickbabcock commented May 25, 2018

I didn't mean to pump and dump an issue and PR (the PR was to communicate my thinking in case the issue wasn't clear 😄). So we can have the discussion here.

Why did you choose to go with CSV over JSON?

A CSV seems to be the lowest bar for data analysis (including BI tools). However, I can be convinced too 😄

My use case involves graphing benchmark data using R/Python, so CSV or JSON doesn't make a difference to me. I had done the CSV PR as it is easier to append to a CSV after each measurement

some benchmarks might have been filtered out

Only include benchmarks that finished. It's how google's benchmark, BenchmarkDotNet, and JMH do it. Seems like a reasonable philosophy to follow. If a benchmark didn't run, don't include it in the output.

some might be run with a different instance of the Criterion struct

While switching up the sample size could conceivably affect the number of records in the CSV / JSON, I don't see how it invalidates writing to a single file. If you're worried about reporting differences in warmup, confidence level, etc, you can always note these values as csv columns or json entries.

some might be in a totally different binary

I haven't done criterion binary benchmarking (though I think it is a cool feature!). Does it not fall under the same philosophy of warmup, samples, and confidence? If only a subset of those apply, you can always mark those values are null or missing.

Er -- if this in reference to having two benchmarks bencha.rs and benchb.rs: then I still think it prudent to strive towards combining the outputs into a single file. You could have a column (eg. "suite") that contains "bencha" / "benchb". I couldn't figure it out for the PR (maybe it's not possible!?). Alternatively, you could name it bencha.json / bencha.csv to avoid clobbering. It's easier to concat bencha.csv and benchb.csv than 100 different files for individual functions.

I don't see the need for combining all of the output into a single file anyway; anyone who is writing a script to parse these files can surely handle reading multiple files

Right, it's not a question of technical ability. It's a question of convenience.

  • I don't want to have to know criterion's directory hierarchy to parse out the group, id, and parameter values! Considering that the full id is ambiguous with respect to what it is composed of, this is easily mistaken (ie assume function id as group).
  • I don't want to decipher new/sample.json. It's two arrays that pair together to result in iterations and elapsed nanoseconds. Creating my PR took a tad longer than anticipated because I was confused with these numbers as they appeared semi-sorted to me so I thought they were cumulative. These units can be labeled. Google Benchmark and JMH have a separate column for units of the recorded observation (eg. "ops/msec" / "ns"). BenchmarkDotNet puts the units in the column header (eg. "Elapsed (ns)") or inlined with the value (eg. "42 ns") -- not my preference.
  • I don't want to merge all the JSON files.

TLDR: I'm lazy 😅

If criterion's goal is to give Rust programmers the tools to easily interpret their benchmarks then it also has to recognize that the cli and html output can be complemented with a single file (or a file per suite) that can be slurped up into any Viz tool or programming language. Especially if the canonical benchmarking frameworks for C++, Java, and C# has shown that such reports are useful.

@bheisler

This comment has been minimized.

Copy link
Owner

bheisler commented May 26, 2018

I'm sorry, I wasn't clear earlier. What you're proposing is much harder to accomplish than you realize, but it's not immediately obvious why that's the case so I'll try to explain.

Where do you propose to store the results before writing them to the CSV file, and how do you decide when to overwrite (as opposed to appending to) that file?

If we store them in the Criterion struct, we run into the problem that the Criterion struct gets discarded after every benchmark group. You could write to disk at the end of the benchmark (as your current implementation does) but then each benchmark will overwrite the last. You can get around that by storing the results in some sort of global static storage. However, Cargo makes it easy to define multiple benchmark executables - what you refer to as bencha and benchb. It's fairly common for Criterion.rs users to do this, so your solution has to be able to handle that usage pattern. If you naively write your CSV file at the end of main, then the last executable to run will win, which is also not what you want.

As far as I'm aware, Cargo doesn't provide any way to know which executable we're in or what other executables there are or what benchmarks might be in them, so you can't even write to different files. I don't think you can even look at the executable path in argv[0] because Cargo mangles the names of the benchmark executables and they change every so often.

You could append to the file instead of overwriting it, but now you have a different problem - you can't tell if the existing results file was written by a previous executable in the same cargo bench invocation, or if it was written by an earlier benchmarking run (potentially benchmarking a different version of the code). Now your CSV file grows unbounded and probably contains wrong data as well.

The problem here is that we don't have any good way to determine a complete list of which benchmarks were run in a given cargo bench run - there might be filters or there might be more benchmark executables waiting in Cargo's queue. We also don't have any way to run code at the start or end of a cargo bench run (eg. to clear the CSV file at the start of a run).

As I said before, the way I work around this to generate the HTML report index is to walk the file tree to scan for stored benchmark files, but this is unsatisfactory too. It's expensive - a full directory traversal. It's wasteful - all benchmark executables will perform this traversal and write a file, only to have it overwritten by the next executable. And it's not even all that accurate - if I delete a benchmark from my code and run cargo bench again, it still shows up in the index because the stored results files are still there. In your case, there's even more potential to confuse users. If a user changes their code and then runs a subset of the benchmarks (using the --bench argument to pick a single executable or just providing a filter) then the directory scan will pick up all the other results files from benchmarks that weren't executed and may no longer be accurate.

It's probably not possible to correctly handle all of these corner cases without some kind of support from Cargo. Although I suppose you could get close, it would involve a lot of guesswork and complexity on Criterion's part and I don't want that maintenance burden.

If you're feeling particularly ambitious, you could try to convince the developers of Cargo to set environment variables when running the benchmark executables. For example, you could tell Criterion 'this is executable 1 of 3' and then Criterion could clear the CSV file. But even there you run into corner cases - what if some of the benchmark executables don't use Criterion? If the first executable uses Bencher or something instead, then you'll never clear the CSV file.

Or you could just write a separate CSV file for every benchmarked function. That's easy to do and there's no problem with different benchmarks or different executables interfering with each other. True, the user now has to do find | cat | grep | analysis.py or something like that to do their analysis. It's more work, sure, but it's explicit about the directory scan so there's less potential for confusion in odd corner cases and the user knows exactly what they're analyzing. We could make this easier by including the benchmark ID information (in separate columns to avoid ambiguity), even though it's technically redundant with the file path.

I hope this rant explains some of my thinking on this issue. I've already made all of these mistakes while trying to implement the HTML report index, you see, and it was not so much fun that I want to do it again.

I don't want to decipher new/sample.json.

Well, that's not really intended to be a user-facing format. In fact, it could change without warning (and already has at least once) so it's probably not a good idea to rely on it too heavily.

I'd also like to think some more about what should be in this hypothetical CSV file. Is it more useful to have the raw samples, or would you want to read the summary statistics instead? It's tempting to say that Criterion should save both, but that means more code to maintain. I'm kind of inclined to only expose the raw sample data to keep the API surface as minimal as possible, but then users have to compute their own summary statistics. I don't want to inflict work on users without good reason, but I also don't want to have to support something that nobody uses.

@nickbabcock

This comment has been minimized.

Copy link
Author

nickbabcock commented May 27, 2018

Ah thanks for the patience! It appears that we're blocked on Cargo before we can even begin to talk specifics. I'm feeling strangely motivated to make this possible, so I'll look to open a pre-rfc thread on internals in the next couple days and see what others think. There may be a solution that custom test harnesses could use too 🤔

I'm kind of inclined to only expose the raw sample data to keep the API surface as minimal as possible

That would be fine with me.

then users have to compute their own summary statistics

True, but if / when that is too much work, people will open an issue with suggestions, so you'll know what's important to others and that others are using the feature.

@nickbabcock

This comment has been minimized.

Copy link
Author

nickbabcock commented May 29, 2018

So I've updated the PR so that criterion takes advantage of module_path! This avoids clobbering, so that benchmarks under bencha.rs end up under bencha-raw.csv and benchb.rs ends up under benchb-raw.csv, so it is possible to consolidate all benchmarks in one benchmark executable into a single output file. I've tested it for my own use cases and it works.

Was module_path what you were looking for to fix the cargo blocker? If so, we can move onto what your thoughts are on defining an output that most can agree on. No need to review the PR (again!) until we settle up here, I'm just using it as a reference for discussion.

@bheisler

This comment has been minimized.

Copy link
Owner

bheisler commented Jun 1, 2018

Well, the issue there is that module_path! resolves to... well, the module path. If the user has their criterion group inside a module, and many will, that will produce a file name like "test::foo-raw.csv". Windows, at least, doesn't like to have : in file names.

@nickbabcock

This comment has been minimized.

Copy link
Author

nickbabcock commented Jun 2, 2018

Yes, colons can be replaced with a permissible character (like a dash) or take everything after the last colon

@bheisler bheisler added this to the Version 0.2.4 milestone Jun 29, 2018

@bheisler bheisler closed this in 94dfce5 Jul 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.