Visit the main page
2018 Benchmark Summary
The table below summarizes the results of the 2018 benchmark study. It shows the top-3 tools from each test. Times are in seconds. The tsv-utils tools are in italics. (Except for csv2tsv, tsv-utils tool names start with "tsv-".)
The benchmarks are described in detail in the Comparative Benchmark Study and the 2017 and 2018 comparative benchmark reports. These reports include goals, methodology, test details, caveats, conclusions, etc.
Top three tools in each benchmark
|Numeric Row Filter||Linux||tsv-filter||5.48||mawk||11.31||GNU awk||42.80|
|(4.8 GB, 7M lines)||Mac OS||tsv-filter||3.35||mawk||15.05||GNU awk||24.25|
|Regex Row Filter||Linux||xsv||7.97||tsv-filter||8.80||mawk||17.74|
|(2.7 GB, 14M lines)||Mac OS||xsv||7.03||tsv-filter||8.28||GNU awk||16.47|
|(4.8 GB, 7M lines)||Mac OS||tsv-select||2.93||xsv||7.67||csvtk||11.00|
|Column selection (narrow)||Linux||GNU cut||5.60||tsv-select||8.26||xsv||13.60|
|(1.7 GB, 86M lines)||Mac OS||xsv||9.22||tsv-select||10.18||GNU cut||10.65|
|Join two files||Linux||tsv-join||26.68||xsv||68.02||csvtk||98.51|
|(4.8 GB, 7M lines)||Mac OS||tsv-join||21.78||xsv||60.03||csvtk||82.43|
|Summary statistics||Linux||tsv-summarize||15.78||xsv||44.38||GNU Datamash||48.51|
|(4.8 GB, 7M lines)||Mac OS||tsv-summarize||9.82||xsv||35.32||csvtk||45.92|
|(2.7 GB, 14M lines)||Mac OS||csv2tsv||10.91||xsv||14.38||csvtk||32.49|
Comparative Benchmark Study
Performance is a key motivation for using D rather an interpreted language like Python or Perl. It is also a consideration in choosing between D and C/C++. To gauge D's performance, benchmarks were run comparing eBay's TSV Utilities to a number of similar tools written in other native compiled programming languages. Included were traditional Unix tools as well as several specialized toolkits. Programming languages involved were C, Go, and Rust.
The larger goal was to see how D programs would compare when written in a straightforward style, as if by a team of well qualified programmers in the course of normal development. Attention was giving to choosing good algorithms and identifying poorly performing code constructs, but heroic measures were not used to gain performance. D's standard library was used extensively, without writing custom versions of core algorithms or containers. Unnecessary GC allocation was avoided, but GC was used rather manual memory management. Higher-level I/O primitives were used rather than custom buffer management.
This larger goal was also the motivation for using multiple benchmarks and a variety of tools. Single points of comparison are more likely to be biased (less reliable) due to the differing goals and quality of the specific application.
The study was conducted in March 2017. An update done in April 2018 using the fastest tools from the initial study.
The D programs performed extremely well, exceeding the author's expectations. Six benchmarks were used in the 2017 study, the D tools were the fastest on each, often by significant margins. This is impressive given that very little low-level programming was done. In the 2018 update the TSV Utilities were first or second on all benchmarks. The TSV Utilities were faster than in 2017, but several of the other tools had gotten faster as well.
As with most benchmarks, there are caveats. The tools used for comparison are not exact equivalents, and in many cases have different design goals and capabilities likely to impact performance. Tasks performed are highly I/O dependent and follow similar computational patterns, so the results may not transfer to other applications.
Despite limitations of the benchmarks, this is certainly a good result. The benchmarks engage a fair range of programming constructs, and the comparison basis includes nine distinct implementations and several long tenured Unix tools. As a practical matter, performance of the tools has changed the author's personal work habits, as calculations that used to take 15-20 seconds are now instantaneous, and calculations that took minutes often finish in 10 seconds or so.
LTO and PGO studies
In the fall of 2017 eBay's TSV Utilities were used as the basis for studying Link Time Optimization (LTO) and Profile Guided Optimization (PGO). In D, the LLVM versions of these technologies are made available via LDC, the LLVM-based D Compiler.
Both LTO and PGO resulted in significant performance gains. Details are on the LTO and PGO Evaluation page.
Additional information about LTO and PGO can be found on the Building with LTO and PGO page. The slide decks from presentations at Silicon Valley D Meetup (December 2017) and DConf 2018 also contain useful information, including additional references to other resources about LTO and PGO.