Tuned csv Fork Benchmarks

Tuned csv Fork — Benchmarks vs Upstream rust-csv 1.4.0

qsv patches the csv, csv-core, and csv-index crates to the dathere/rust-csv fork (branch qsv-tuned). The csv crate underpins most of qsv's functionality, so every perf tweak compounds across the whole toolkit.

This page documents how the fork performs against upstream rust-csv 1.4.0 (the release the fork is based on), measured with identical benchmark harnesses.

Bottom line: 41 of 42 benchmarks are as fast or faster than upstream — 30 by a clear margin (up to 2.05x faster), the rest at parity — with no benchmark meaningfully slower. The fork is a strict win on every hot path qsv exercises.

What the fork changes

SIMD-accelerated UTF-8 validation via simdutf8; non-ASCII records are validated with a single whole-buffer pass plus per-field-boundary checks instead of per-field calls
Non-allocating ByteRecord and StringRecord trim
Skip the redundant ASCII trim on the StringRecord read path (~12% on str reads);
First-byte gate in deserialize_any type inference — fields that cannot possibly be numeric (first byte not in 0-9 + - . i I n N) skip up to five guaranteed-to-fail integer/float parse attempts (-21% str/-45% bytes on inference-heavy deserialization)
Faster float parsing via fast-float2 in the Serde deserializer (the bytes path also skips UTF-8 validation entirely)
Faster float serialization via zmij (replaces ryu, ~10% on serialize)
needs_quotes uses memchr-based SIMD scanning with a lookup-table fast path for short (≤16 byte) fields
scan_and_copy fast-path in read_field_dfa (upstream only has it in read_record_dfa)
A faster is_non_numeric() helper, Copy for Position, and assorted clippy-driven cleanups

Results

All numbers are Criterion medians (100 samples). Speedup = upstream time / fork time; higher is better. Throughput is per Criterion's Throughput::Bytes over each dataset.

Inference-typed deserialization (`deserialize_any`)

The standout. Schema-less deserialization (every field inferred as bool/int/float/string at runtime) is a common qsv access pattern.

Benchmark	upstream 1.4.0	qsv-tuned	Speedup	Throughput (MiB/s)
`pop_infer/infer_borrowed_bytes`	6.084 ms	2.971 ms	2.05x	150 → 307
`pop_infer/infer_owned_str`	3.133 ms	2.305 ms	1.36x	291 → 395

Trimmed reads

Benchmark	upstream 1.4.0	qsv-tuned	Speedup	Throughput (MiB/s)
`nfl_trimmed/iter_str`	3.934 ms	2.131 ms	1.85x	331 → 611
`nfl_trimmed/iter_bytes`	2.727 ms	1.741 ms	1.57x	477 → 748

Serialization / writes

Benchmark	upstream 1.4.0	qsv-tuned	Speedup	Throughput (MiB/s)
`nfl_write/bytes`	812.5 µs	730.3 µs	1.11x	1602 → 1782
`nfl_write/record`	1.136 ms	1.025 ms	1.11x	1146 → 1270
`pop/serialize`	2.298 ms	2.149 ms	1.07x	397 → 424
`nfl/serialize`	1.278 ms	1.195 ms	1.07x	1019 → 1089
`game/serialize`	3.967 ms	3.793 ms	1.05x	625 → 654
`mbta/serialize`	725.8 µs	782.6 µs	0.93x	951 → 882

mbta/serialize is the only benchmark below parity, and only just — at 0.93x it's the smallest benchmark in the suite and sits squarely within the ±2–3% thermal-drift noise floor, so it's effectively a tie rather than a real regression.

Raw reads (ByteRecord / StringRecord)

Benchmark	upstream 1.4.0	qsv-tuned	Speedup	Throughput (MiB/s)
`game/read_str`	3.117 ms	2.852 ms	1.09x	795 → 869
`pop/read_str`	1.150 ms	1.057 ms	1.09x	793 → 862
`game/read_bytes`	2.781 ms	2.595 ms	1.07x	892 → 955
`pop/iter_str`	2.055 ms	1.928 ms	1.07x	444 → 473
`game/iter_str`	6.931 ms	6.573 ms	1.05x	358 → 377
`mbta/read_str`	643.0 µs	612.9 µs	1.05x	1073 → 1126
`nfl/read_str`	1.092 ms	1.060 ms	1.03x	1192 → 1228
`mbta/iter_str`	1.085 ms	1.056 ms	1.03x	636 → 653
`nfl/iter_str`	1.481 ms	1.442 ms	1.03x	879 → 903
`mbta/read_bytes`	599.5 µs	584.5 µs	1.03x	1151 → 1181
`game/iter_bytes`	6.670 ms	6.538 ms	1.02x	372 → 379
`pop/iter_bytes`	1.899 ms	1.868 ms	1.02x	480 → 488
`nfl/iter_bytes`	1.398 ms	1.387 ms	1.01x	931 → 938
`mbta/iter_bytes`	1.035 ms	1.030 ms	1.01x	666 → 670
`pop/read_bytes`	971.2 µs	966.8 µs	1.00x	938 → 943
`nfl/read_bytes`	946.9 µs	944.2 µs	1.00x	1374 → 1378

The read_str advantage comes from the single-pass UTF-8 validation; read_bytes parity is by design (the fork's raw-read hot loop is upstream's, see "History" below).

Typed deserialization (Serde structs)

All 16 benchmarks land between 1.00x and 1.09x faster — modest but uniformly positive. These paths are dominated by serde dispatch; the gains come from fast-float2 and the cheaper UTF-8 validation.

Benchmark	Speedup
`pop/deserialize_borrowed_str`	1.09x
`game/deserialize_borrowed_bytes`	1.08x
`game/deserialize_owned_bytes`	1.07x
`game/deserialize_borrowed_str`	1.06x
`pop/deserialize_owned_str`	1.06x
`game/deserialize_owned_str`	1.05x
`nfl/deserialize_owned_str`	1.04x
remaining 9 benches	1.00–1.03x

Methodology

Upstream side: a git worktree at the upstream 1.4.0 release commit with the fork's Criterion bench suite ported onto it, so both sides run the exact same harness and datasets (nfl.csv, game.csv, worldcitiespop.csv, gtfs-mbta-stop-times.csv from examples/data/bench/)
Criterion 0.5, 100 samples per benchmark, medians from estimates.json via --save-baseline
Run-to-run noise floor was established by re-running identical code: ±2–3% (thermal drift), so single-digit deltas below ~2% are reported as parity
Environment: Apple M4 Max (aarch64), macOS 26.6, rustc 1.96.0
Fork revision: 45b9c4c (2026-06-11)

History: how benchmarking caught two fake optimizations

This comparison initially showed the fork 1.3–1.7x slower than upstream on raw reads. An automated git bisect over the fork's commits traced the damage to two earlier "optimizations" whose claimed wins were measured against invalid baselines (one changed the code and the bench harness in the same commit):

Replacing upstream's byte-wise scan_and_copy loop with memchr-based SIMD scanning (+79% on nfl/read_bytes — the scan runs once per field, and typical CSV fields are too short to amortize memchr's per-call setup)
Merging the DFA trans/has_output arrays into an array-of-structs (+8–10% on game/read_bytes)

Both were reverted (9c50796), restoring upstream's raw-read hot loop verbatim, which is what makes the table above all-wins-or-parity. Moral: perf claims need same-harness, interleaved A/B measurement — thermal drift alone can manufacture a false 5% "win".

Reproducing

# fork side
git clone -b qsv-tuned https://github.com/dathere/rust-csv
cd rust-csv && cargo bench -- --save-baseline tuned

# upstream side: worktree at the 1.4.0 release with the same harness
git worktree add --detach /tmp/csv-upstream-140 <merge-base with master>
cp benches/bench.rs /tmp/csv-upstream-140/benches/
# add criterion dev-dep + `[[bench]] harness = false` to its Cargo.toml, then:
cd /tmp/csv-upstream-140 && cargo bench -- --save-baseline upstream

# compare medians from target/criterion/*/*/<baseline>/estimates.json

Caveat: all numbers are aarch64 (Apple Silicon). An x86_64 run is a worthwhile sanity check — memchr/SIMD trade-offs differ across architectures.

qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense

Tuned csv Fork Benchmarks

Tuned csv Fork — Benchmarks vs Upstream rust-csv 1.4.0

What the fork changes

Results

Inference-typed deserialization (deserialize_any)

Trimmed reads

Serialization / writes

Raw reads (ByteRecord / StringRecord)

Typed deserialization (Serde structs)

Methodology

History: how benchmarking caught two fake optimizations

Reproducing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

Inference-typed deserialization (`deserialize_any`)