reproducible benchmarks? #79

BurntSushi · 2017-05-24T17:24:12Z

I'd like to try and reproduce some of your benchmarks. Is possible to acquire the data used in each of the benchmarks? Thanks.

BurntSushi · 2017-05-24T17:28:04Z

Ah, sorry about the noise, I found them at the bottom of the post. Thanks!

jondegenhardt · 2017-05-24T18:34:09Z

No problem. It'd be fantastic to have someone else take a shot at reproducing these. You have some nice tools, by the way!

BurntSushi · 2017-05-25T20:14:56Z

All righty, so I got around to trying a couple of the benchmarks. I only tried out your tools, csvtk and xsv. I skipped the join benchmark because I wasn't sure how to recreate the data. To produce TREE_GRM_ESTN_14mil.csv, I did:

$ xsv slice -e 14000000 TREE_GRM_ESTN.csv > TREE_GRM_ESTN_14mil.csv

I got my tsv-utils binaries from your releases. Specifically, tsv-utils-dlang-v1.1.11_linux-x86_64_ldc2. I got csvtk from their releases too, specifically 0.7.1. And I also grabbed xsv from my releases as well, specifically xsv 0.12.1. (Note that your benchmark uses xsv 0.10.x. There have been a number of performance improvements since then, some of which were motivated by your benchmark. :-))

Overall, I really liked your benchmark. It identified a few weak spots in xsv where it really should have been faster. With that said, I do have one very strong criticism to lodge against your benchmarks: you never actually point out that tools like xsv and, to a lesser extent, csvtk, are designed to handle CSV while your tools require a much stricter format. (You do mention that the tools aren't exact equivalents, but I think a benchmark should point out where they are different and how they might impact the interpretation of the benchmarks.) From the docs in your csv2tsv tool:

The key difference is that CSV uses escape sequences to
represent newlines and field separators in the data, whereas TSV disallows these
characters in the data.

I am lodging this criticism because this single design decision has wide reaching implications on the performance of the tools you're benchmarking. To be clear, I think the comparison itself is still interesting, because it shows that folks might benefit from shoving their data into a stricter TSV format if they can.

Note that I said "to a lesser extent, csvtk" above because it too does not actually handle arbitrary CSV data. From the multicorecsv parser's README, which is used inside csvtk:

muticorecsv does not support reading CSV files with properly quoted/escaped newlines! If you have \n in your source data fields, multicorecsv Read() will not work for you.

This is also a critical assumption that csvtk makes that xsv does not make because it impacts performance. This one assumption permits parallelizing the parsing of CSV data. You can't parallelize arbitrary CSV data without some kind of index, which is why the xsv index command exists. (It also calls into question the correctness of tools like ParaText, although I haven't really dug into that one yet.)

One other thing: the results in your blog post are somewhat difficult to read, since your tables use labels like "Toolkit 1", but as far as I can see, you never actually say what "Toolkit 1" is?

OK, with that out of the way, I figured I'd share my data. Sorry if I sound bitter! (I've spent a lot of time on making CSV parsing fast while handling all the corner cases that, say, Python's CSV parser handles.) Nevertheless, nice work on the tsv utilities, they are quite fast! :-) And they certainly make me wonder whether xsv might benefit from supporting a stricter format! :P

For regular expression filtering:

command	time
`csvtk grep -t -l -f 10 -r -p '[RD].*(ION[0-2])' < TREE_GRM_ESTN_14mil.tsv > /dev/null`	21.7s
`xsv search -s COMPONENT '[RD].*(ION[0-2])' TREE_GRM_ESTN_14mil.tsv > /dev/null`	6.2s
`tsv-filter -H --regex 10:'[RD].*(ION[0-2])' TREE_GRM_ESTN_14mil.tsv > /dev/null`	7.5s

csvtk's performance isn't too surprising, since Go's regexp engine isn't that fast. From looking at a profile of tsv-filter, it looks like D's regex engine is also holding you back, because you really ought to be faster than xsv here.

For column selection:

command	time
`csvtk cut -t -l -f 1,8,19 all_train.tsv > /dev/null`	24.1s
`xsv select 1,8,19 all_train.tsv > /dev/null`	6.5s
`tsv-select -f 1,8,19 all_train.tsv > /dev/null`	3.5s

This one isn't too interesting. tsv-select is faster because it makes stronger assumptions about the format of the data, so it can do a lot less work.

For summary statistics:

command	time
`csvtk stats2 -t -l -f 3,5,20 all_train.tsv > /dev/null`	36.1s
`xsv stats -s 3,5,20 all_train.tsv > /dev/null`	32.244s
with indexing `xsv stats -s 3,5,20 all_train.tsv > /dev/null`	3.9s
`tsv-summarize -H --count --sum 3,5,20 --min 3,5,20 --max 3,5,20 --mean 3,5,20 --stdev 3,5,20 all_train.tsv > /dev/null`	15.4s

This is an interesting one. Your single threaded performance is quite impressive, and from profiling, it looks like there might be room for improvement in Rust's parsing of floating point numbers. xsv works really well with indexing because it enables parallelism, and I'm somewhat disappointed you didn't mention it in your article. :-( Indexing is fast and cheap, but it makes commands like stats, join, frequency, and slice much faster.

Given that you only handle strict TSV formats, it seems like you could probably benefit from parallelism on this one without any sort of indexing.

For CSV to TSV conversion:

command	time
`csvtk csv2tab TREE_GRM_ESTN_14mil.csv > /dev/null`	34.6s
`xsv fmt -t'\t' TREE_GRM_ESTN_14mil.csv > /dev/null`	18.9s
`csv2tsv TREE_GRM_ESTN_14mil.csv > /dev/null`	24.9s

This benchmark was interesting, because up until very recently, xsv performed really terribly, and it was mostly because I neglected to ever try to optimize CSV writing. As you can see, that's fixed now. :-)

jondegenhardt · 2017-05-25T21:31:21Z

Excellent! I'm glad someone tried to reproduce these. And, seems like your tools have gotten faster since I ran my benchmarks. Very good!

A big picture thing I'd like to communicate - the purpose behind my benchmarks was not to identify the which toolkit is the fastest. And, my reason for anonymizing the timing of what I called the "specialty toolkits" was to avoid a shootout and especially language flame wars. Also, it appears most of the specialty toolkits are written by one or two people, who frankly, are doing a service by open-sourcing their software. There is simply no reason to bash people who are doing this.

Then what was the purpose of the benchmarks? I was doing an evaluation of D. I wanted to see what the performance would be like writing in a somewhat obvious style, using standard libraries, etc. Think of a large software team at a company. To get an idea of what might be expected, I needed some baselines. I took every native compiled thing I could find that had equivalent functionality. And, I tested these tools, including yours, after completing mine. That is, I didn't study other tools and figure out what was needed to do to beat them.

So, from my perspective, what was significant was that the D programs did well against so many different implementations, written by a number of people. I was shocked they finished first in every metric I tried, except for csv-to-tsv conversion, which is far slower than it should be for reasons I haven't identified yet. And, some tool should be faster on the "join" metric, what I wrote can be much faster.

Now, as you point out, the different tools have different functionality choices that affect performance. The CSV tools need to handle escape characters, and even with a "TSV" mode, it's asking an awful lot to support both and optimize the performance for both. And, the Awk family of tools is handling an arbitrary expression stack. My tools (tsv-filter) do not. Handling an arbitrary expression stack will be slower, despite very serious attempts to optimize it (ie. mawk). (And yes, both using TSV and not supporting arbitrary expressions are deliberate design decisions for these reasons.)

As to why the performance benchmarks page doesn't go into more details about both issues - Mainly, the page is already too long. The page does draw the conclusion that D is showing up very favorably on the performance front. And, one should infer that processing TSV should be faster than CSV, but that's hardly a new observation. However, the benchmarks certainly do not conclude D is "faster" than another language (in this case, C, Rust, or Go). The individual tools also, especially those that handle CSV and arbitrary expression trees.

Perhaps though, trying to keep the page from growing longer was a mistake, I'll take a look and see if I can add a bit more of an explanation.

By the way, to me, an interesting comparison is GNU DataMash. It should be faster, at least when data is in sorted order. It's not. This partly says that you can't draw conclusions from a single comparison point.

I'm happy to have further conversations about these topics. For the next several weeks I'm going to have trouble responding quickly, so don't take silence the wrong way.

BurntSushi · 2017-05-25T21:47:03Z

@jondegenhardt Thanks for the response! To clarify: I care less about "which language is faster" and a lot more about "readers should be given enough information to interpret what the benchmark results mean." My tool happened to be involved, so I'm bleating about it, that's all. :-) I think a few clarifying sentences is really all you need, although I think the best case is providing an analysis of the results. But, having done that myself for other tools, that's not a reasonable request because of how much time it takes. It's hard.

jondegenhardt · 2017-05-25T21:57:55Z

@BurntSushi Very good, I appreciate your being understanding. I'll add some clarifying text. A fuller analysis is hard, as you say.

Add note about computation complexity of CSV escape processing and expression trees to performance benchmarks doc (issue #79)

jondegenhardt · 2017-06-07T02:57:42Z

@BurntSushi The "join benchmark" now describes the steps to reproduce the data should you wish to run this test.

BurntSushi · 2017-06-07T11:25:27Z

@jondegenhardt Thanks! :-)

BurntSushi closed this as completed May 24, 2017

jondegenhardt added a commit that referenced this issue May 26, 2017

Perf doc update may2017 (#80)

bce26a3

Add note about computation complexity of CSV escape processing and expression trees to performance benchmarks doc (issue #79)

jondegenhardt mentioned this issue Jun 7, 2017

Perf doc: Updates to specific tools comparison caveats. #81

Merged

jondegenhardt added the question label May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducible benchmarks? #79

reproducible benchmarks? #79

BurntSushi commented May 24, 2017

BurntSushi commented May 24, 2017

jondegenhardt commented May 24, 2017 •

edited

BurntSushi commented May 25, 2017

jondegenhardt commented May 25, 2017 •

edited

BurntSushi commented May 25, 2017 •

edited

jondegenhardt commented May 25, 2017

jondegenhardt commented Jun 7, 2017

BurntSushi commented Jun 7, 2017

reproducible benchmarks? #79

reproducible benchmarks? #79

Comments

BurntSushi commented May 24, 2017

BurntSushi commented May 24, 2017

jondegenhardt commented May 24, 2017 • edited

BurntSushi commented May 25, 2017

jondegenhardt commented May 25, 2017 • edited

BurntSushi commented May 25, 2017 • edited

jondegenhardt commented May 25, 2017

jondegenhardt commented Jun 7, 2017

BurntSushi commented Jun 7, 2017

jondegenhardt commented May 24, 2017 •

edited

jondegenhardt commented May 25, 2017 •

edited

BurntSushi commented May 25, 2017 •

edited