Skip to content

Tuned csv Fork Benchmarks

Joel Natividad edited this page Jun 11, 2026 · 5 revisions

Tuned csv Fork — Benchmarks vs Upstream rust-csv 1.4.0

qsv patches the csv, csv-core, and csv-index crates to the dathere/rust-csv fork (branch qsv-tuned). The csv crate underpins most of qsv's functionality, so every perf tweak compounds across the whole toolkit.

This page documents how the fork performs against upstream rust-csv 1.4.0 (the release the fork is based on), measured with identical benchmark harnesses.

Bottom line: 41 of 42 benchmarks are as fast or faster than upstream — 30 by a clear margin (up to 2.05x faster), the rest at parity — with no benchmark meaningfully slower. The fork is a strict win on every hot path qsv exercises.

What the fork changes

  • SIMD-accelerated UTF-8 validation via simdutf8; non-ASCII records are validated with a single whole-buffer pass plus per-field-boundary checks instead of per-field calls
  • Non-allocating ByteRecord and StringRecord trim
  • Skip the redundant ASCII trim on the StringRecord read path (~12% on str reads);
  • First-byte gate in deserialize_any type inference — fields that cannot possibly be numeric (first byte not in 0-9 + - . i I n N) skip up to five guaranteed-to-fail integer/float parse attempts (-21% str/-45% bytes on inference-heavy deserialization)
  • Faster float parsing via fast-float2 in the Serde deserializer (the bytes path also skips UTF-8 validation entirely)
  • Faster float serialization via zmij (replaces ryu, ~10% on serialize)
  • needs_quotes uses memchr-based SIMD scanning with a lookup-table fast path for short (≤16 byte) fields
  • scan_and_copy fast-path in read_field_dfa (upstream only has it in read_record_dfa)
  • A faster is_non_numeric() helper, Copy for Position, and assorted clippy-driven cleanups

Results

All numbers are Criterion medians (100 samples). Speedup = upstream time / fork time; higher is better. Throughput is per Criterion's Throughput::Bytes over each dataset.

Inference-typed deserialization (deserialize_any)

The standout. Schema-less deserialization (every field inferred as bool/int/float/string at runtime) is a common qsv access pattern.

Benchmark upstream 1.4.0 qsv-tuned Speedup Throughput (MiB/s)
pop_infer/infer_borrowed_bytes 6.084 ms 2.971 ms 2.05x 150 → 307
pop_infer/infer_owned_str 3.133 ms 2.305 ms 1.36x 291 → 395

Trimmed reads

Benchmark upstream 1.4.0 qsv-tuned Speedup Throughput (MiB/s)
nfl_trimmed/iter_str 3.934 ms 2.131 ms 1.85x 331 → 611
nfl_trimmed/iter_bytes 2.727 ms 1.741 ms 1.57x 477 → 748

Serialization / writes

Benchmark upstream 1.4.0 qsv-tuned Speedup Throughput (MiB/s)
nfl_write/bytes 812.5 µs 730.3 µs 1.11x 1602 → 1782
nfl_write/record 1.136 ms 1.025 ms 1.11x 1146 → 1270
pop/serialize 2.298 ms 2.149 ms 1.07x 397 → 424
nfl/serialize 1.278 ms 1.195 ms 1.07x 1019 → 1089
game/serialize 3.967 ms 3.793 ms 1.05x 625 → 654
mbta/serialize 725.8 µs 782.6 µs 0.93x 951 → 882

mbta/serialize is the only benchmark below parity, and only just — at 0.93x it's the smallest benchmark in the suite and sits squarely within the ±2–3% thermal-drift noise floor, so it's effectively a tie rather than a real regression.

Raw reads (ByteRecord / StringRecord)

Benchmark upstream 1.4.0 qsv-tuned Speedup Throughput (MiB/s)
game/read_str 3.117 ms 2.852 ms 1.09x 795 → 869
pop/read_str 1.150 ms 1.057 ms 1.09x 793 → 862
game/read_bytes 2.781 ms 2.595 ms 1.07x 892 → 955
pop/iter_str 2.055 ms 1.928 ms 1.07x 444 → 473
game/iter_str 6.931 ms 6.573 ms 1.05x 358 → 377
mbta/read_str 643.0 µs 612.9 µs 1.05x 1073 → 1126
nfl/read_str 1.092 ms 1.060 ms 1.03x 1192 → 1228
mbta/iter_str 1.085 ms 1.056 ms 1.03x 636 → 653
nfl/iter_str 1.481 ms 1.442 ms 1.03x 879 → 903
mbta/read_bytes 599.5 µs 584.5 µs 1.03x 1151 → 1181
game/iter_bytes 6.670 ms 6.538 ms 1.02x 372 → 379
pop/iter_bytes 1.899 ms 1.868 ms 1.02x 480 → 488
nfl/iter_bytes 1.398 ms 1.387 ms 1.01x 931 → 938
mbta/iter_bytes 1.035 ms 1.030 ms 1.01x 666 → 670
pop/read_bytes 971.2 µs 966.8 µs 1.00x 938 → 943
nfl/read_bytes 946.9 µs 944.2 µs 1.00x 1374 → 1378

The read_str advantage comes from the single-pass UTF-8 validation; read_bytes parity is by design (the fork's raw-read hot loop is upstream's, see "History" below).

Typed deserialization (Serde structs)

All 16 benchmarks land between 1.00x and 1.09x faster — modest but uniformly positive. These paths are dominated by serde dispatch; the gains come from fast-float2 and the cheaper UTF-8 validation.

Benchmark Speedup
pop/deserialize_borrowed_str 1.09x
game/deserialize_borrowed_bytes 1.08x
game/deserialize_owned_bytes 1.07x
game/deserialize_borrowed_str 1.06x
pop/deserialize_owned_str 1.06x
game/deserialize_owned_str 1.05x
nfl/deserialize_owned_str 1.04x
remaining 9 benches 1.00–1.03x

Methodology

  • Upstream side: a git worktree at the upstream 1.4.0 release commit with the fork's Criterion bench suite ported onto it, so both sides run the exact same harness and datasets (nfl.csv, game.csv, worldcitiespop.csv, gtfs-mbta-stop-times.csv from examples/data/bench/)
  • Criterion 0.5, 100 samples per benchmark, medians from estimates.json via --save-baseline
  • Run-to-run noise floor was established by re-running identical code: ±2–3% (thermal drift), so single-digit deltas below ~2% are reported as parity
  • Environment: Apple M4 Max (aarch64), macOS 26.6, rustc 1.96.0
  • Fork revision: 45b9c4c (2026-06-11)

History: how benchmarking caught two fake optimizations

This comparison initially showed the fork 1.3–1.7x slower than upstream on raw reads. An automated git bisect over the fork's commits traced the damage to two earlier "optimizations" whose claimed wins were measured against invalid baselines (one changed the code and the bench harness in the same commit):

  1. Replacing upstream's byte-wise scan_and_copy loop with memchr-based SIMD scanning (+79% on nfl/read_bytes — the scan runs once per field, and typical CSV fields are too short to amortize memchr's per-call setup)
  2. Merging the DFA trans/has_output arrays into an array-of-structs (+8–10% on game/read_bytes)

Both were reverted (9c50796), restoring upstream's raw-read hot loop verbatim, which is what makes the table above all-wins-or-parity. Moral: perf claims need same-harness, interleaved A/B measurement — thermal drift alone can manufacture a false 5% "win".

Reproducing

# fork side
git clone -b qsv-tuned https://github.com/dathere/rust-csv
cd rust-csv && cargo bench -- --save-baseline tuned

# upstream side: worktree at the 1.4.0 release with the same harness
git worktree add --detach /tmp/csv-upstream-140 <merge-base with master>
cp benches/bench.rs /tmp/csv-upstream-140/benches/
# add criterion dev-dep + `[[bench]] harness = false` to its Cargo.toml, then:
cd /tmp/csv-upstream-140 && cargo bench -- --save-baseline upstream

# compare medians from target/criterion/*/*/<baseline>/estimates.json

Caveat: all numbers are aarch64 (Apple Silicon). An x86_64 run is a worthwhile sanity check — memchr/SIMD trade-offs differ across architectures.

Clone this wiki locally