-
Notifications
You must be signed in to change notification settings - Fork 104
Tuned csv Fork Benchmarks
qsv patches the csv, csv-core, and
csv-index crates to the
dathere/rust-csv fork (branch qsv-tuned).
The csv crate underpins most of qsv's functionality, so every perf tweak compounds
across the whole toolkit.
This page documents how the fork performs against upstream rust-csv 1.4.0 (the release the fork is based on), measured with identical benchmark harnesses.
Bottom line: 41 of 42 benchmarks are as fast or faster than upstream — 30 by a clear margin (up to 2.05x faster), the rest at parity — with no benchmark meaningfully slower. The fork is a strict win on every hot path qsv exercises.
-
SIMD-accelerated UTF-8 validation via
simdutf8; non-ASCII records are validated with a single whole-buffer pass plus per-field-boundary checks instead of per-field calls - Non-allocating
ByteRecordandStringRecordtrim -
Skip the redundant ASCII trim on the
StringRecordread path (~12% on str reads); -
First-byte gate in
deserialize_anytype inference — fields that cannot possibly be numeric (first byte not in0-9 + - . i I n N) skip up to five guaranteed-to-fail integer/float parse attempts (-21% str/-45% bytes on inference-heavy deserialization) -
Faster float parsing via
fast-float2in the Serde deserializer (the bytes path also skips UTF-8 validation entirely) -
Faster float serialization via
zmij(replaces ryu, ~10% on serialize) -
needs_quotesuses memchr-based SIMD scanning with a lookup-table fast path for short (≤16 byte) fields -
scan_and_copyfast-path inread_field_dfa(upstream only has it inread_record_dfa) - A faster
is_non_numeric()helper,CopyforPosition, and assorted clippy-driven cleanups
All numbers are Criterion medians (100 samples). Speedup = upstream time / fork time;
higher is better. Throughput is per Criterion's Throughput::Bytes over each dataset.
The standout. Schema-less deserialization (every field inferred as bool/int/float/string at runtime) is a common qsv access pattern.
| Benchmark | upstream 1.4.0 | qsv-tuned | Speedup | Throughput (MiB/s) |
|---|---|---|---|---|
pop_infer/infer_borrowed_bytes |
6.084 ms | 2.971 ms | 2.05x | 150 → 307 |
pop_infer/infer_owned_str |
3.133 ms | 2.305 ms | 1.36x | 291 → 395 |
| Benchmark | upstream 1.4.0 | qsv-tuned | Speedup | Throughput (MiB/s) |
|---|---|---|---|---|
nfl_trimmed/iter_str |
3.934 ms | 2.131 ms | 1.85x | 331 → 611 |
nfl_trimmed/iter_bytes |
2.727 ms | 1.741 ms | 1.57x | 477 → 748 |
| Benchmark | upstream 1.4.0 | qsv-tuned | Speedup | Throughput (MiB/s) |
|---|---|---|---|---|
nfl_write/bytes |
812.5 µs | 730.3 µs | 1.11x | 1602 → 1782 |
nfl_write/record |
1.136 ms | 1.025 ms | 1.11x | 1146 → 1270 |
pop/serialize |
2.298 ms | 2.149 ms | 1.07x | 397 → 424 |
nfl/serialize |
1.278 ms | 1.195 ms | 1.07x | 1019 → 1089 |
game/serialize |
3.967 ms | 3.793 ms | 1.05x | 625 → 654 |
mbta/serialize |
725.8 µs | 782.6 µs | 0.93x | 951 → 882 |
mbta/serialize is the only benchmark below parity, and only just — at 0.93x it's the smallest benchmark in the suite and sits squarely within the ±2–3% thermal-drift noise floor, so it's effectively a tie rather than a real regression.
| Benchmark | upstream 1.4.0 | qsv-tuned | Speedup | Throughput (MiB/s) |
|---|---|---|---|---|
game/read_str |
3.117 ms | 2.852 ms | 1.09x | 795 → 869 |
pop/read_str |
1.150 ms | 1.057 ms | 1.09x | 793 → 862 |
game/read_bytes |
2.781 ms | 2.595 ms | 1.07x | 892 → 955 |
pop/iter_str |
2.055 ms | 1.928 ms | 1.07x | 444 → 473 |
game/iter_str |
6.931 ms | 6.573 ms | 1.05x | 358 → 377 |
mbta/read_str |
643.0 µs | 612.9 µs | 1.05x | 1073 → 1126 |
nfl/read_str |
1.092 ms | 1.060 ms | 1.03x | 1192 → 1228 |
mbta/iter_str |
1.085 ms | 1.056 ms | 1.03x | 636 → 653 |
nfl/iter_str |
1.481 ms | 1.442 ms | 1.03x | 879 → 903 |
mbta/read_bytes |
599.5 µs | 584.5 µs | 1.03x | 1151 → 1181 |
game/iter_bytes |
6.670 ms | 6.538 ms | 1.02x | 372 → 379 |
pop/iter_bytes |
1.899 ms | 1.868 ms | 1.02x | 480 → 488 |
nfl/iter_bytes |
1.398 ms | 1.387 ms | 1.01x | 931 → 938 |
mbta/iter_bytes |
1.035 ms | 1.030 ms | 1.01x | 666 → 670 |
pop/read_bytes |
971.2 µs | 966.8 µs | 1.00x | 938 → 943 |
nfl/read_bytes |
946.9 µs | 944.2 µs | 1.00x | 1374 → 1378 |
The read_str advantage comes from the single-pass UTF-8 validation;
read_bytes parity is by design (the fork's raw-read hot loop is upstream's,
see "History" below).
All 16 benchmarks land between 1.00x and 1.09x faster — modest but uniformly positive. These paths are dominated by serde dispatch; the gains come from fast-float2 and the cheaper UTF-8 validation.
| Benchmark | Speedup |
|---|---|
pop/deserialize_borrowed_str |
1.09x |
game/deserialize_borrowed_bytes |
1.08x |
game/deserialize_owned_bytes |
1.07x |
game/deserialize_borrowed_str |
1.06x |
pop/deserialize_owned_str |
1.06x |
game/deserialize_owned_str |
1.05x |
nfl/deserialize_owned_str |
1.04x |
| remaining 9 benches | 1.00–1.03x |
- Upstream side: a git worktree at the upstream
1.4.0release commit with the fork's Criterion bench suite ported onto it, so both sides run the exact same harness and datasets (nfl.csv,game.csv,worldcitiespop.csv,gtfs-mbta-stop-times.csvfromexamples/data/bench/) - Criterion 0.5, 100 samples per benchmark, medians from
estimates.jsonvia--save-baseline - Run-to-run noise floor was established by re-running identical code: ±2–3% (thermal drift), so single-digit deltas below ~2% are reported as parity
- Environment: Apple M4 Max (aarch64), macOS 26.6, rustc 1.96.0
- Fork revision:
45b9c4c(2026-06-11)
This comparison initially showed the fork 1.3–1.7x slower than upstream on
raw reads. An automated git bisect over the fork's commits traced the damage to
two earlier "optimizations" whose claimed wins were measured against invalid
baselines (one changed the code and the bench harness in the same commit):
- Replacing upstream's byte-wise
scan_and_copyloop with memchr-based SIMD scanning (+79% onnfl/read_bytes— the scan runs once per field, and typical CSV fields are too short to amortize memchr's per-call setup) - Merging the DFA
trans/has_outputarrays into an array-of-structs (+8–10% ongame/read_bytes)
Both were reverted (9c50796), restoring upstream's raw-read hot loop verbatim,
which is what makes the table above all-wins-or-parity. Moral: perf claims need
same-harness, interleaved A/B measurement — thermal drift alone can manufacture
a false 5% "win".
# fork side
git clone -b qsv-tuned https://github.com/dathere/rust-csv
cd rust-csv && cargo bench -- --save-baseline tuned
# upstream side: worktree at the 1.4.0 release with the same harness
git worktree add --detach /tmp/csv-upstream-140 <merge-base with master>
cp benches/bench.rs /tmp/csv-upstream-140/benches/
# add criterion dev-dep + `[[bench]] harness = false` to its Cargo.toml, then:
cd /tmp/csv-upstream-140 && cargo bench -- --save-baseline upstream
# compare medians from target/criterion/*/*/<baseline>/estimates.jsonCaveat: all numbers are aarch64 (Apple Silicon). An x86_64 run is a worthwhile sanity check — memchr/SIMD trade-offs differ across architectures.
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data