Skip to content

perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap)#3839

Merged
jqnatividad merged 6 commits into
masterfrom
stats-microoptimize-202605
May 10, 2026
Merged

perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap)#3839
jqnatividad merged 6 commits into
masterfrom
stats-microoptimize-202605

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Three independently-shippable performance opportunities for qsv stats,
all opt-in or behavior-preserving on default flags. Each landed as a
discrete commit so they can be reverted or cherry-picked individually.

  • 9e937e50esimdutf8 + Cow in output: replaces 7
    String::from_utf8_lossy sites in to_record and format_antimodes
    with a bytes_to_cow_str helper that uses SIMD-validated UTF-8 and
    returns a borrowed &str on the happy path. Byte-identical for valid
    UTF-8 (the entire test universe); only difference is faster validation.
  • e99fe9b5b--quantile-method approx (t-digest): opt-in
    approximate-quantile engine for --median, --quartiles, and
    --percentiles, backed by the Apache DataSketches t-digest port
    (datasketches crate). Trades exact for O(K) memory (~200 centroids)
    and O(1) quantile reads with ~1% rank error. Default exact produces
    byte-identical output to today, asserted by a regression test.
    Restrictions enforced at parse time: --weight is rejected (upstream
    TDigestMut has no public weighted-update); --mad warns and is
    disabled (algorithmically incompatible with t-digest).
  • b20a7a407--mode-cardinality-cap N: opt-in upper bound on
    mode-tracking memory. When N > 0, qsv drops the mode tracker for any
    column whose Unsorted::len() (unweighted, total samples) or
    HashMap::len() (weighted, unique keys) exceeds N, and emits
    sentinels (*HIGH_CARDINALITY / >=N) instead of crashing or
    truncating silently. Default 0 = unbounded (today's behavior).

Both new flags invalidate the stats cache via StatsArgs JSON sidecar
when changed, so cached output never silently mismatches the requested
algorithm.

5e6d92e39 and fc91fcba5 address LOW review findings on the t-digest
and cap commits respectively (review jobs 2002 and 2003) — comment
cleanups, the --jobs 1 determinism note, dropped unused Serde impls,
single-invocation test capture, plus added coverage for the
weighted-cap branch, the multi-chunk post-merge re-check, and the
--cardinality-only sentinel-field count.

Help docs auto-regenerated via qsv --generate-help-md.

Diff: 972 insertions / 195 deletions across 6 files; the bulk of the
deletions are formatter realignments triggered by widening the field-name
column in Args / StatsArgs / WhichStats.

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo test --locked stats -F all_features — 740 passed, 0 failed
    (728 baseline + 5 t-digest + 6 mode-cap + 1 empty-numeric edge case)
  • cargo clippy --locked --bin qsv -F all_features -- -D warnings — clean
  • Smoke test approx mode against exact mode on uniform 1..=10_000:
    q1/q2/q3 within 2% envelope
  • Smoke test cap=3 on a 10-row file: *HIGH_CARDINALITY and >=3
    emitted; cap=100 on the same file: normal output
  • Smoke test parse-time rejection: --weight + --quantile-method approx
    errors out with a clear message; invalid --quantile-method nonsense
    errors out
  • Reviewer: please run scripts/benchmarks.sh to quantify the perf
    delta on the NYC 311 1M-row dataset (didn't run locally because the
    script requires the dataset to be present)
  • Reviewer: please confirm --quantile-method approx output reads
    sensibly under multi-thread (--jobs 4) — the test suite pins
    --jobs 1 for deterministic snapshots; --jobs >= 2 is documented
    as approximate-only

🤖 Generated with Claude Code

jqnatividad and others added 5 commits May 10, 2026 09:19
Replace `String::from_utf8_lossy` with a `bytes_to_cow_str` helper in the
seven hot output paths in `to_record` and `format_antimodes`. The helper
uses `simdutf8::basic::from_utf8` (SIMD-validated, already a dep) and
returns a borrowed `&str` on the valid-UTF-8 happy path with no
allocation, falling back to `from_utf8_lossy` only on invalid UTF-8.

Behavior is byte-identical for valid UTF-8 inputs (the entire test
universe). On invalid UTF-8 the fallback preserves the previous lossy
substitution.

Touches mode/antimode/min/max formatting in to_record + format_antimodes;
the three remaining `from_utf8_lossy` call sites (header processing,
header validation, stderr formatting) are not in the output hot path and
are intentionally left alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in approximate-quantile engine for `--median`, `--quartiles`,
and `--percentiles`, backed by the Apache DataSketches t-digest port
(`datasketches` crate, based on Dunning's MergingDigest paper).

Why: the exact engine accumulates every numeric value in `Unsorted<f64>`
and sorts it during `to_record`, costing O(N) memory and O(N log N) sort
per numeric column. On wide numeric files this dominates the runtime.
T-digest yields all four quantile readers from O(K) memory (K~200
centroids) with O(1) lookups and ~1% rank error (tighter at the tails).

Default behavior unchanged: `--quantile-method exact` is the default and
produces byte-identical output to today (verified by a regression test).
A new TDigestSlot wrapper provides Clone / PartialEq / skip-Serde for the
TDigestMut field on `Stats` without bleeding upstream traits.

Restrictions enforced at parse time:
* `--quantile-method approx` + `--weight` is rejected. The upstream
  TDigestMut::update is unweighted only and exposes no public weighted-
  update API; weighted approx will need an upstream change to land.
* `--quantile-method approx` + `--mad` (or `--everything`) prints a
  warning and disables MAD; MAD requires `median(|x - median|)` which a
  t-digest cannot produce in a single pass.

Determinism note: TDigestMut::merge is associative but not chunk-count-
invariant. Output may shift by ~1% across runs with different `--jobs`
values. New tests pin `--jobs 1` to avoid flakes.

Tests added (all passing):
* approx quartiles within ±2% envelope on N=10_000 uniform input
* approx + --weight is rejected with clear error
* approx + --mad warns and drops MAD column
* invalid --quantile-method value is rejected
* default and explicit-exact produce byte-identical output

`stats.csv.json` cache key includes flag_quantile_method, so changing
the engine invalidates the cache.

Help doc regenerated via `qsv --generate-help-md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in upper bound on mode-tracking memory for high-cardinality
columns. When `--mode-cardinality-cap N` is set (default 0 = unbounded,
preserving today's behavior), qsv drops the mode tracker for any column
whose `Unsorted::len()` (unweighted) or `HashMap::len()` (weighted)
exceeds N during ingestion or merge. Output emits sentinels for the
affected column rather than crashing or truncating silently:

* mode / antimode columns: "*HIGH_CARDINALITY"
* cardinality column: ">=<N>" (the ">=" prefix DOES break downstream
  parsers expecting a plain integer; the cap is opt-in only).

Why: stats currently allocates `Unsorted<Vec<u8>>` of capacity =
record_count for every cardinality/mode-tracked column. On wide tables
with ID/UUID/timestamp columns this is largely wasted work — the column
is unique-per-row, the eventual `*ALL` short-circuit fires at output
time, but ingestion has already paid full cost (per-cell
`sample.to_vec()` heap-allocs plus the eventual sort during
`cardinality()`). With a cap, the tracker is dropped early and the
sentinel signals "we gave up; treat as approximate."

Implementation:
* `flag_mode_cardinality_cap: u64` plumbed through Args / StatsArgs /
  WhichStats. Cache invalidates on cap change (StatsArgs JSON sidecar).
* `Stats.modes_dropped: bool` flag set when the cap fires; sticky across
  parallel-chunk merges. Post-merge re-checks the cap to catch chunks
  that individually stayed below it but combined cross it.
* `to_record` short-circuits to sentinel emission when `modes_dropped`
  is true, sitting BEFORE the existing weighted/unweighted dispatch.
* HLL-style probabilistic alternatives intentionally rejected: they
  would make exact `cardinality` non-deterministic and break the stats
  cache validity model (downstream commands like `schema` and `tojsonl`
  consume cardinality as exact). See planning notes.

Tests added (all passing):
* default (cap=0) is byte-identical to omitting the flag
* cap trips on a high-cardinality column → sentinels emitted
* cap does not trip on a low-cardinality column → exact values

Help doc regenerated via `qsv --generate-help-md`.

The cache-hint skip half of the original plan (read prior cache to skip
mode-tracking from the start) is deferred — the runtime cap delivers the
memory bound without needing any cross-run state, and profiling can
decide whether the cache-hint path is worth the additional complexity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. TDigestSlot: drop Serialize/Deserialize impls. The Stats::tdigest field
   uses #[serde(skip)] so they were dead code; updated the doc comment to
   explain the arrangement.

2. stats_quantile_method_approx_disables_mad: capture stdout+stderr from a
   single cmd.output() invocation and parse headers from the captured bytes.
   Previously the test ran the command twice (once for stderr, once via
   wrk.read_stdout for stdout), which doubled runtime and split the
   stderr/stdout assertions across two separate runs.

3. Approx percentiles: short-circuit on td.is_empty() up front instead of
   propagating None from any single quantile() failure mid-loop. This
   matches the exact path's contract that None means "no data," not
   "partial data." Added stats_quantile_method_approx_percentiles_empty_numeric_column
   to lock the all-string-column edge case.

4. Stats::merge t-digest comment: clarify that --jobs 1 routes through
   sequential_stats (no merge invoked, fully reproducible) and that
   determinism for --jobs >= 2 is intentionally not guaranteed.

737 stats tests pass (was 736); bin clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(job 2003)

1. Help text: --mode-cardinality-cap now documents the asymmetric semantic
   explicitly. The cap measures total samples added (unweighted) or
   unique-key count (weighted) — both are direct memory bounds on their
   respective tracker, not the same logical quantity.

2. Test coverage: added 3 tests filling the gaps the reviewer flagged:
   - stats_mode_cardinality_cap_weighted_path: --weight + cap exercises
     the HashMap::len() branch in update_modes.
   - stats_mode_cardinality_cap_multi_chunk_post_merge_check: indexes the
     file so qsv routes to parallel_stats, then verifies the post-merge
     cap re-check fires when chunks individually stay below cap but their
     merged result crosses it.
   - stats_mode_cardinality_cap_cardinality_only_emits_two_fields: locks
     the contract that --cardinality without --mode produces 2 sentinel
     fields, not 8.

3. to_record comment: replaced the misleading "fall through to the
   no-tracker path" wording (the modes_dropped branch is mutually
   exclusive with the else-if, so there is no fall-through) with a
   one-line "sentinels emitted; else-if chain is skipped."

4. Stats::merge: replaced the modes_dropped_now temp with the direct
   `self.modes_dropped |= other.modes_dropped`, and rewrote the
   accompanying comment to explain the actual ordering reason (clearing
   modes/weighted_modes happens AFTER the standard Option::merge calls
   so the merge plumbing stays simple) rather than the inaccurate
   "Commute trait happy" remark.

740 stats tests pass (was 737; 3 new); bin clippy clean. Help regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 10, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds three opt-in, performance-focused enhancements to qsv stats: faster UTF-8 handling in output formatting, an approximate quantile engine using t-digest, and an optional cap to bound mode/cardinality tracking memory on high-cardinality columns—while keeping default behavior byte-identical.

Changes:

  • Introduces --quantile-method {exact|approx} with an approx t-digest backend (including cache invalidation via serialized StatsArgs).
  • Adds --mode-cardinality-cap <n> to drop mode tracking once a column exceeds a configured size and emit sentinel outputs.
  • Optimizes stats output string decoding by using SIMD-validated UTF-8 and borrowing via Cow<str> when possible.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_stats.rs Adds integration coverage for --quantile-method approx and --mode-cardinality-cap, including rejection/compatibility cases and sentinel behavior.
src/util.rs Updates the internal stats::Args construction used by stats-cache plumbing to include the new flags with defaults.
src/cmd/stats.rs Implements quantile-method parsing/validation, integrates t-digest, adds mode-cap drop/sentinel logic, and introduces bytes_to_cow_str for faster UTF-8 output.
docs/help/stats.md Regenerates help markdown to document the two new flags.
Cargo.toml Adds the datasketches dependency for t-digest support.
Cargo.lock Locks datasketches and updates the dependency graph accordingly.

Comment thread src/cmd/stats.rs Outdated
Comment thread src/cmd/stats.rs Outdated
Comment thread tests/test_stats.rs Outdated
Comment thread tests/test_stats.rs Outdated
1. update_modes: tighten the cap check to `>= cap` BEFORE insert (was
   `> cap`, which allowed the tracker to briefly hit cap+1 before being
   dropped). Peak retained size is now exactly `cap`. For the weighted
   path, the check moves into the new-key branch so existing-key updates
   (which don't grow the map) skip the cap entirely and the get_mut
   fast-path is preserved.

2. TDigestSlot doc comment: replace the incorrect "Args::run()" reference
   with "module-level run(argv: ...) function" — there is no Args::run()
   method; validation lives in the free run function.

3. stats_quantile_method_approx_quartiles_within_envelope: replace the
   wrk.assert_success + wrk.read_stdout double-run with a single
   cmd.output() capture. Headers and the data row are parsed from
   stdout bytes; success is asserted on the captured output (the
   second run's status was previously unchecked).

4. stats_quantile_method_approx_percentiles_empty_numeric_column: same
   single-invocation refactor.

740 stats tests pass; bin clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit e454200 into master May 10, 2026
17 checks passed
@jqnatividad jqnatividad deleted the stats-microoptimize-202605 branch May 10, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants