perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap) by jqnatividad · Pull Request #3839 · dathere/qsv

jqnatividad · 2026-05-10T14:27:27Z

Summary

Three independently-shippable performance opportunities for qsv stats,
all opt-in or behavior-preserving on default flags. Each landed as a
discrete commit so they can be reverted or cherry-picked individually.

9e937e50e — simdutf8 + Cow in output: replaces 7
String::from_utf8_lossy sites in to_record and format_antimodes
with a bytes_to_cow_str helper that uses SIMD-validated UTF-8 and
returns a borrowed &str on the happy path. Byte-identical for valid
UTF-8 (the entire test universe); only difference is faster validation.
e99fe9b5b — --quantile-method approx (t-digest): opt-in
approximate-quantile engine for --median, --quartiles, and
--percentiles, backed by the Apache DataSketches t-digest port
(datasketches crate). Trades exact for O(K) memory (~200 centroids)
and O(1) quantile reads with ~1% rank error. Default exact produces
byte-identical output to today, asserted by a regression test.
Restrictions enforced at parse time: --weight is rejected (upstream
TDigestMut has no public weighted-update); --mad warns and is
disabled (algorithmically incompatible with t-digest).
b20a7a407 — --mode-cardinality-cap N: opt-in upper bound on
mode-tracking memory. When N > 0, qsv drops the mode tracker for any
column whose Unsorted::len() (unweighted, total samples) or
HashMap::len() (weighted, unique keys) exceeds N, and emits
sentinels (*HIGH_CARDINALITY / >=N) instead of crashing or
truncating silently. Default 0 = unbounded (today's behavior).

Both new flags invalidate the stats cache via StatsArgs JSON sidecar
when changed, so cached output never silently mismatches the requested
algorithm.

5e6d92e39 and fc91fcba5 address LOW review findings on the t-digest
and cap commits respectively (review jobs 2002 and 2003) — comment
cleanups, the --jobs 1 determinism note, dropped unused Serde impls,
single-invocation test capture, plus added coverage for the
weighted-cap branch, the multi-chunk post-merge re-check, and the
--cardinality-only sentinel-field count.

Help docs auto-regenerated via qsv --generate-help-md.

Diff: 972 insertions / 195 deletions across 6 files; the bulk of the
deletions are formatter realignments triggered by widening the field-name
column in Args / StatsArgs / WhichStats.

Test plan

cargo build --locked --bin qsv -F all_features
cargo test --locked stats -F all_features — 740 passed, 0 failed
(728 baseline + 5 t-digest + 6 mode-cap + 1 empty-numeric edge case)
cargo clippy --locked --bin qsv -F all_features -- -D warnings — clean
Smoke test approx mode against exact mode on uniform 1..=10_000:
q1/q2/q3 within 2% envelope
Smoke test cap=3 on a 10-row file: *HIGH_CARDINALITY and >=3
emitted; cap=100 on the same file: normal output
Smoke test parse-time rejection: --weight + --quantile-method approx
errors out with a clear message; invalid --quantile-method nonsense
errors out
Reviewer: please run scripts/benchmarks.sh to quantify the perf
delta on the NYC 311 1M-row dataset (didn't run locally because the
script requires the dataset to be present)
Reviewer: please confirm --quantile-method approx output reads
sensibly under multi-thread (--jobs 4) — the test suite pins
--jobs 1 for deterministic snapshots; --jobs >= 2 is documented
as approximate-only

🤖 Generated with Claude Code

Replace `String::from_utf8_lossy` with a `bytes_to_cow_str` helper in the seven hot output paths in `to_record` and `format_antimodes`. The helper uses `simdutf8::basic::from_utf8` (SIMD-validated, already a dep) and returns a borrowed `&str` on the valid-UTF-8 happy path with no allocation, falling back to `from_utf8_lossy` only on invalid UTF-8. Behavior is byte-identical for valid UTF-8 inputs (the entire test universe). On invalid UTF-8 the fallback preserves the previous lossy substitution. Touches mode/antimode/min/max formatting in to_record + format_antimodes; the three remaining `from_utf8_lossy` call sites (header processing, header validation, stderr formatting) are not in the output hot path and are intentionally left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an opt-in approximate-quantile engine for `--median`, `--quartiles`, and `--percentiles`, backed by the Apache DataSketches t-digest port (`datasketches` crate, based on Dunning's MergingDigest paper). Why: the exact engine accumulates every numeric value in `Unsorted<f64>` and sorts it during `to_record`, costing O(N) memory and O(N log N) sort per numeric column. On wide numeric files this dominates the runtime. T-digest yields all four quantile readers from O(K) memory (K~200 centroids) with O(1) lookups and ~1% rank error (tighter at the tails). Default behavior unchanged: `--quantile-method exact` is the default and produces byte-identical output to today (verified by a regression test). A new TDigestSlot wrapper provides Clone / PartialEq / skip-Serde for the TDigestMut field on `Stats` without bleeding upstream traits. Restrictions enforced at parse time: * `--quantile-method approx` + `--weight` is rejected. The upstream TDigestMut::update is unweighted only and exposes no public weighted- update API; weighted approx will need an upstream change to land. * `--quantile-method approx` + `--mad` (or `--everything`) prints a warning and disables MAD; MAD requires `median(|x - median|)` which a t-digest cannot produce in a single pass. Determinism note: TDigestMut::merge is associative but not chunk-count- invariant. Output may shift by ~1% across runs with different `--jobs` values. New tests pin `--jobs 1` to avoid flakes. Tests added (all passing): * approx quartiles within ±2% envelope on N=10_000 uniform input * approx + --weight is rejected with clear error * approx + --mad warns and drops MAD column * invalid --quantile-method value is rejected * default and explicit-exact produce byte-identical output `stats.csv.json` cache key includes flag_quantile_method, so changing the engine invalidates the cache. Help doc regenerated via `qsv --generate-help-md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an opt-in upper bound on mode-tracking memory for high-cardinality columns. When `--mode-cardinality-cap N` is set (default 0 = unbounded, preserving today's behavior), qsv drops the mode tracker for any column whose `Unsorted::len()` (unweighted) or `HashMap::len()` (weighted) exceeds N during ingestion or merge. Output emits sentinels for the affected column rather than crashing or truncating silently: * mode / antimode columns: "*HIGH_CARDINALITY" * cardinality column: ">=<N>" (the ">=" prefix DOES break downstream parsers expecting a plain integer; the cap is opt-in only). Why: stats currently allocates `Unsorted<Vec<u8>>` of capacity = record_count for every cardinality/mode-tracked column. On wide tables with ID/UUID/timestamp columns this is largely wasted work — the column is unique-per-row, the eventual `*ALL` short-circuit fires at output time, but ingestion has already paid full cost (per-cell `sample.to_vec()` heap-allocs plus the eventual sort during `cardinality()`). With a cap, the tracker is dropped early and the sentinel signals "we gave up; treat as approximate." Implementation: * `flag_mode_cardinality_cap: u64` plumbed through Args / StatsArgs / WhichStats. Cache invalidates on cap change (StatsArgs JSON sidecar). * `Stats.modes_dropped: bool` flag set when the cap fires; sticky across parallel-chunk merges. Post-merge re-checks the cap to catch chunks that individually stayed below it but combined cross it. * `to_record` short-circuits to sentinel emission when `modes_dropped` is true, sitting BEFORE the existing weighted/unweighted dispatch. * HLL-style probabilistic alternatives intentionally rejected: they would make exact `cardinality` non-deterministic and break the stats cache validity model (downstream commands like `schema` and `tojsonl` consume cardinality as exact). See planning notes. Tests added (all passing): * default (cap=0) is byte-identical to omitting the flag * cap trips on a high-cardinality column → sentinels emitted * cap does not trip on a low-cardinality column → exact values Help doc regenerated via `qsv --generate-help-md`. The cache-hint skip half of the original plan (read prior cache to skip mode-tracking from the start) is deferred — the runtime cap delivers the memory bound without needing any cross-run state, and profiling can decide whether the cache-hint path is worth the additional complexity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. TDigestSlot: drop Serialize/Deserialize impls. The Stats::tdigest field uses #[serde(skip)] so they were dead code; updated the doc comment to explain the arrangement. 2. stats_quantile_method_approx_disables_mad: capture stdout+stderr from a single cmd.output() invocation and parse headers from the captured bytes. Previously the test ran the command twice (once for stderr, once via wrk.read_stdout for stdout), which doubled runtime and split the stderr/stdout assertions across two separate runs. 3. Approx percentiles: short-circuit on td.is_empty() up front instead of propagating None from any single quantile() failure mid-loop. This matches the exact path's contract that None means "no data," not "partial data." Added stats_quantile_method_approx_percentiles_empty_numeric_column to lock the all-string-column edge case. 4. Stats::merge t-digest comment: clarify that --jobs 1 routes through sequential_stats (no merge invoked, fully reproducible) and that determinism for --jobs >= 2 is intentionally not guaranteed. 737 stats tests pass (was 736); bin clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(job 2003) 1. Help text: --mode-cardinality-cap now documents the asymmetric semantic explicitly. The cap measures total samples added (unweighted) or unique-key count (weighted) — both are direct memory bounds on their respective tracker, not the same logical quantity. 2. Test coverage: added 3 tests filling the gaps the reviewer flagged: - stats_mode_cardinality_cap_weighted_path: --weight + cap exercises the HashMap::len() branch in update_modes. - stats_mode_cardinality_cap_multi_chunk_post_merge_check: indexes the file so qsv routes to parallel_stats, then verifies the post-merge cap re-check fires when chunks individually stay below cap but their merged result crosses it. - stats_mode_cardinality_cap_cardinality_only_emits_two_fields: locks the contract that --cardinality without --mode produces 2 sentinel fields, not 8. 3. to_record comment: replaced the misleading "fall through to the no-tracker path" wording (the modes_dropped branch is mutually exclusive with the else-if, so there is no fall-through) with a one-line "sentinels emitted; else-if chain is skipped." 4. Stats::merge: replaced the modes_dropped_now temp with the direct `self.modes_dropped |= other.modes_dropped`, and rewrote the accompanying comment to explain the actual ordering reason (clearing modes/weighted_modes happens AFTER the standard Option::merge calls so the merge plumbing stays simple) rather than the inaccurate "Commute trait happy" remark. 740 stats tests pass (was 737; 3 new); bin clippy clean. Help regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-10T14:27:57Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR adds three opt-in, performance-focused enhancements to qsv stats: faster UTF-8 handling in output formatting, an approximate quantile engine using t-digest, and an optional cap to bound mode/cardinality tracking memory on high-cardinality columns—while keeping default behavior byte-identical.

Changes:

Introduces --quantile-method {exact|approx} with an approx t-digest backend (including cache invalidation via serialized StatsArgs).
Adds --mode-cardinality-cap <n> to drop mode tracking once a column exceeds a configured size and emit sentinel outputs.
Optimizes stats output string decoding by using SIMD-validated UTF-8 and borrowing via Cow<str> when possible.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_stats.rs	Adds integration coverage for `--quantile-method approx` and `--mode-cardinality-cap`, including rejection/compatibility cases and sentinel behavior.
src/util.rs	Updates the internal `stats::Args` construction used by stats-cache plumbing to include the new flags with defaults.
src/cmd/stats.rs	Implements quantile-method parsing/validation, integrates t-digest, adds mode-cap drop/sentinel logic, and introduces `bytes_to_cow_str` for faster UTF-8 output.
docs/help/stats.md	Regenerates help markdown to document the two new flags.
Cargo.toml	Adds the `datasketches` dependency for t-digest support.
Cargo.lock	Locks `datasketches` and updates the dependency graph accordingly.

1. update_modes: tighten the cap check to `>= cap` BEFORE insert (was `> cap`, which allowed the tracker to briefly hit cap+1 before being dropped). Peak retained size is now exactly `cap`. For the weighted path, the check moves into the new-key branch so existing-key updates (which don't grow the map) skip the cap entirely and the get_mut fast-path is preserved. 2. TDigestSlot doc comment: replace the incorrect "Args::run()" reference with "module-level run(argv: ...) function" — there is no Args::run() method; validation lives in the free run function. 3. stats_quantile_method_approx_quartiles_within_envelope: replace the wrk.assert_success + wrk.read_stdout double-run with a single cmd.output() capture. Headers and the data row are parsed from stdout bytes; success is asserted on the captured output (the second run's status was previously unchecked). 4. stats_quantile_method_approx_percentiles_empty_numeric_column: same single-invocation refactor. 740 stats tests pass; bin clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 5 commits May 10, 2026 09:19

jqnatividad requested a review from Copilot May 10, 2026 14:29

Copilot started reviewing on behalf of jqnatividad May 10, 2026 14:30 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread src/cmd/stats.rs Outdated

Comment thread src/cmd/stats.rs Outdated

Comment thread tests/test_stats.rs Outdated

Comment thread tests/test_stats.rs Outdated

jqnatividad merged commit e454200 into master May 10, 2026
17 checks passed

jqnatividad deleted the stats-microoptimize-202605 branch May 10, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap)#3839

perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap)#3839
jqnatividad merged 6 commits into
masterfrom
stats-microoptimize-202605

jqnatividad commented May 10, 2026

Uh oh!

codacy-production Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 10, 2026

Summary

Test plan

Uh oh!

codacy-production Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented May 10, 2026 •

edited

Loading