-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducible benchmarks? #79
Comments
Ah, sorry about the noise, I found them at the bottom of the post. Thanks! |
No problem. It'd be fantastic to have someone else take a shot at reproducing these. You have some nice tools, by the way! |
All righty, so I got around to trying a couple of the benchmarks. I only tried out your tools, csvtk and xsv. I skipped the join benchmark because I wasn't sure how to recreate the data. To produce
I got my tsv-utils binaries from your releases. Specifically, Overall, I really liked your benchmark. It identified a few weak spots in
I am lodging this criticism because this single design decision has wide reaching implications on the performance of the tools you're benchmarking. To be clear, I think the comparison itself is still interesting, because it shows that folks might benefit from shoving their data into a stricter TSV format if they can. Note that I said "to a lesser extent,
This is also a critical assumption that One other thing: the results in your blog post are somewhat difficult to read, since your tables use labels like "Toolkit 1", but as far as I can see, you never actually say what "Toolkit 1" is? OK, with that out of the way, I figured I'd share my data. Sorry if I sound bitter! (I've spent a lot of time on making CSV parsing fast while handling all the corner cases that, say, Python's CSV parser handles.) Nevertheless, nice work on the tsv utilities, they are quite fast! :-) And they certainly make me wonder whether For regular expression filtering:
csvtk's performance isn't too surprising, since Go's regexp engine isn't that fast. From looking at a profile of For column selection:
This one isn't too interesting. For summary statistics:
This is an interesting one. Your single threaded performance is quite impressive, and from profiling, it looks like there might be room for improvement in Rust's parsing of floating point numbers. Given that you only handle strict TSV formats, it seems like you could probably benefit from parallelism on this one without any sort of indexing. For CSV to TSV conversion:
This benchmark was interesting, because up until very recently, |
Excellent! I'm glad someone tried to reproduce these. And, seems like your tools have gotten faster since I ran my benchmarks. Very good! A big picture thing I'd like to communicate - the purpose behind my benchmarks was not to identify the which toolkit is the fastest. And, my reason for anonymizing the timing of what I called the "specialty toolkits" was to avoid a shootout and especially language flame wars. Also, it appears most of the specialty toolkits are written by one or two people, who frankly, are doing a service by open-sourcing their software. There is simply no reason to bash people who are doing this. Then what was the purpose of the benchmarks? I was doing an evaluation of D. I wanted to see what the performance would be like writing in a somewhat obvious style, using standard libraries, etc. Think of a large software team at a company. To get an idea of what might be expected, I needed some baselines. I took every native compiled thing I could find that had equivalent functionality. And, I tested these tools, including yours, after completing mine. That is, I didn't study other tools and figure out what was needed to do to beat them. So, from my perspective, what was significant was that the D programs did well against so many different implementations, written by a number of people. I was shocked they finished first in every metric I tried, except for csv-to-tsv conversion, which is far slower than it should be for reasons I haven't identified yet. And, some tool should be faster on the "join" metric, what I wrote can be much faster. Now, as you point out, the different tools have different functionality choices that affect performance. The CSV tools need to handle escape characters, and even with a "TSV" mode, it's asking an awful lot to support both and optimize the performance for both. And, the Awk family of tools is handling an arbitrary expression stack. My tools (tsv-filter) do not. Handling an arbitrary expression stack will be slower, despite very serious attempts to optimize it (ie. mawk). (And yes, both using TSV and not supporting arbitrary expressions are deliberate design decisions for these reasons.) As to why the performance benchmarks page doesn't go into more details about both issues - Mainly, the page is already too long. The page does draw the conclusion that D is showing up very favorably on the performance front. And, one should infer that processing TSV should be faster than CSV, but that's hardly a new observation. However, the benchmarks certainly do not conclude D is "faster" than another language (in this case, C, Rust, or Go). The individual tools also, especially those that handle CSV and arbitrary expression trees. Perhaps though, trying to keep the page from growing longer was a mistake, I'll take a look and see if I can add a bit more of an explanation. By the way, to me, an interesting comparison is GNU DataMash. It should be faster, at least when data is in sorted order. It's not. This partly says that you can't draw conclusions from a single comparison point. I'm happy to have further conversations about these topics. For the next several weeks I'm going to have trouble responding quickly, so don't take silence the wrong way. |
@jondegenhardt Thanks for the response! To clarify: I care less about "which language is faster" and a lot more about "readers should be given enough information to interpret what the benchmark results mean." My tool happened to be involved, so I'm bleating about it, that's all. :-) I think a few clarifying sentences is really all you need, although I think the best case is providing an analysis of the results. But, having done that myself for other tools, that's not a reasonable request because of how much time it takes. It's hard. |
@BurntSushi Very good, I appreciate your being understanding. I'll add some clarifying text. A fuller analysis is hard, as you say. |
Add note about computation complexity of CSV escape processing and expression trees to performance benchmarks doc (issue #79)
@BurntSushi The "join benchmark" now describes the steps to reproduce the data should you wish to run this test. |
@jondegenhardt Thanks! :-) |
I'd like to try and reproduce some of your benchmarks. Is possible to acquire the data used in each of the benchmarks? Thanks.
The text was updated successfully, but these errors were encountered: