perf(stats): three opt-in micro-optimizations (simdutf8 output, t-digest quantiles, mode-cardinality cap)#3839
Merged
Merged
Conversation
Replace `String::from_utf8_lossy` with a `bytes_to_cow_str` helper in the seven hot output paths in `to_record` and `format_antimodes`. The helper uses `simdutf8::basic::from_utf8` (SIMD-validated, already a dep) and returns a borrowed `&str` on the valid-UTF-8 happy path with no allocation, falling back to `from_utf8_lossy` only on invalid UTF-8. Behavior is byte-identical for valid UTF-8 inputs (the entire test universe). On invalid UTF-8 the fallback preserves the previous lossy substitution. Touches mode/antimode/min/max formatting in to_record + format_antimodes; the three remaining `from_utf8_lossy` call sites (header processing, header validation, stderr formatting) are not in the output hot path and are intentionally left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in approximate-quantile engine for `--median`, `--quartiles`, and `--percentiles`, backed by the Apache DataSketches t-digest port (`datasketches` crate, based on Dunning's MergingDigest paper). Why: the exact engine accumulates every numeric value in `Unsorted<f64>` and sorts it during `to_record`, costing O(N) memory and O(N log N) sort per numeric column. On wide numeric files this dominates the runtime. T-digest yields all four quantile readers from O(K) memory (K~200 centroids) with O(1) lookups and ~1% rank error (tighter at the tails). Default behavior unchanged: `--quantile-method exact` is the default and produces byte-identical output to today (verified by a regression test). A new TDigestSlot wrapper provides Clone / PartialEq / skip-Serde for the TDigestMut field on `Stats` without bleeding upstream traits. Restrictions enforced at parse time: * `--quantile-method approx` + `--weight` is rejected. The upstream TDigestMut::update is unweighted only and exposes no public weighted- update API; weighted approx will need an upstream change to land. * `--quantile-method approx` + `--mad` (or `--everything`) prints a warning and disables MAD; MAD requires `median(|x - median|)` which a t-digest cannot produce in a single pass. Determinism note: TDigestMut::merge is associative but not chunk-count- invariant. Output may shift by ~1% across runs with different `--jobs` values. New tests pin `--jobs 1` to avoid flakes. Tests added (all passing): * approx quartiles within ±2% envelope on N=10_000 uniform input * approx + --weight is rejected with clear error * approx + --mad warns and drops MAD column * invalid --quantile-method value is rejected * default and explicit-exact produce byte-identical output `stats.csv.json` cache key includes flag_quantile_method, so changing the engine invalidates the cache. Help doc regenerated via `qsv --generate-help-md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in upper bound on mode-tracking memory for high-cardinality columns. When `--mode-cardinality-cap N` is set (default 0 = unbounded, preserving today's behavior), qsv drops the mode tracker for any column whose `Unsorted::len()` (unweighted) or `HashMap::len()` (weighted) exceeds N during ingestion or merge. Output emits sentinels for the affected column rather than crashing or truncating silently: * mode / antimode columns: "*HIGH_CARDINALITY" * cardinality column: ">=<N>" (the ">=" prefix DOES break downstream parsers expecting a plain integer; the cap is opt-in only). Why: stats currently allocates `Unsorted<Vec<u8>>` of capacity = record_count for every cardinality/mode-tracked column. On wide tables with ID/UUID/timestamp columns this is largely wasted work — the column is unique-per-row, the eventual `*ALL` short-circuit fires at output time, but ingestion has already paid full cost (per-cell `sample.to_vec()` heap-allocs plus the eventual sort during `cardinality()`). With a cap, the tracker is dropped early and the sentinel signals "we gave up; treat as approximate." Implementation: * `flag_mode_cardinality_cap: u64` plumbed through Args / StatsArgs / WhichStats. Cache invalidates on cap change (StatsArgs JSON sidecar). * `Stats.modes_dropped: bool` flag set when the cap fires; sticky across parallel-chunk merges. Post-merge re-checks the cap to catch chunks that individually stayed below it but combined cross it. * `to_record` short-circuits to sentinel emission when `modes_dropped` is true, sitting BEFORE the existing weighted/unweighted dispatch. * HLL-style probabilistic alternatives intentionally rejected: they would make exact `cardinality` non-deterministic and break the stats cache validity model (downstream commands like `schema` and `tojsonl` consume cardinality as exact). See planning notes. Tests added (all passing): * default (cap=0) is byte-identical to omitting the flag * cap trips on a high-cardinality column → sentinels emitted * cap does not trip on a low-cardinality column → exact values Help doc regenerated via `qsv --generate-help-md`. The cache-hint skip half of the original plan (read prior cache to skip mode-tracking from the start) is deferred — the runtime cap delivers the memory bound without needing any cross-run state, and profiling can decide whether the cache-hint path is worth the additional complexity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. TDigestSlot: drop Serialize/Deserialize impls. The Stats::tdigest field uses #[serde(skip)] so they were dead code; updated the doc comment to explain the arrangement. 2. stats_quantile_method_approx_disables_mad: capture stdout+stderr from a single cmd.output() invocation and parse headers from the captured bytes. Previously the test ran the command twice (once for stderr, once via wrk.read_stdout for stdout), which doubled runtime and split the stderr/stdout assertions across two separate runs. 3. Approx percentiles: short-circuit on td.is_empty() up front instead of propagating None from any single quantile() failure mid-loop. This matches the exact path's contract that None means "no data," not "partial data." Added stats_quantile_method_approx_percentiles_empty_numeric_column to lock the all-string-column edge case. 4. Stats::merge t-digest comment: clarify that --jobs 1 routes through sequential_stats (no merge invoked, fully reproducible) and that determinism for --jobs >= 2 is intentionally not guaranteed. 737 stats tests pass (was 736); bin clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(job 2003)
1. Help text: --mode-cardinality-cap now documents the asymmetric semantic
explicitly. The cap measures total samples added (unweighted) or
unique-key count (weighted) — both are direct memory bounds on their
respective tracker, not the same logical quantity.
2. Test coverage: added 3 tests filling the gaps the reviewer flagged:
- stats_mode_cardinality_cap_weighted_path: --weight + cap exercises
the HashMap::len() branch in update_modes.
- stats_mode_cardinality_cap_multi_chunk_post_merge_check: indexes the
file so qsv routes to parallel_stats, then verifies the post-merge
cap re-check fires when chunks individually stay below cap but their
merged result crosses it.
- stats_mode_cardinality_cap_cardinality_only_emits_two_fields: locks
the contract that --cardinality without --mode produces 2 sentinel
fields, not 8.
3. to_record comment: replaced the misleading "fall through to the
no-tracker path" wording (the modes_dropped branch is mutually
exclusive with the else-if, so there is no fall-through) with a
one-line "sentinels emitted; else-if chain is skipped."
4. Stats::merge: replaced the modes_dropped_now temp with the direct
`self.modes_dropped |= other.modes_dropped`, and rewrote the
accompanying comment to explain the actual ordering reason (clearing
modes/weighted_modes happens AFTER the standard Option::merge calls
so the merge plumbing stays simple) rather than the inaccurate
"Commute trait happy" remark.
740 stats tests pass (was 737; 3 new); bin clippy clean. Help regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds three opt-in, performance-focused enhancements to qsv stats: faster UTF-8 handling in output formatting, an approximate quantile engine using t-digest, and an optional cap to bound mode/cardinality tracking memory on high-cardinality columns—while keeping default behavior byte-identical.
Changes:
- Introduces
--quantile-method {exact|approx}with anapproxt-digest backend (including cache invalidation via serializedStatsArgs). - Adds
--mode-cardinality-cap <n>to drop mode tracking once a column exceeds a configured size and emit sentinel outputs. - Optimizes stats output string decoding by using SIMD-validated UTF-8 and borrowing via
Cow<str>when possible.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_stats.rs | Adds integration coverage for --quantile-method approx and --mode-cardinality-cap, including rejection/compatibility cases and sentinel behavior. |
| src/util.rs | Updates the internal stats::Args construction used by stats-cache plumbing to include the new flags with defaults. |
| src/cmd/stats.rs | Implements quantile-method parsing/validation, integrates t-digest, adds mode-cap drop/sentinel logic, and introduces bytes_to_cow_str for faster UTF-8 output. |
| docs/help/stats.md | Regenerates help markdown to document the two new flags. |
| Cargo.toml | Adds the datasketches dependency for t-digest support. |
| Cargo.lock | Locks datasketches and updates the dependency graph accordingly. |
1. update_modes: tighten the cap check to `>= cap` BEFORE insert (was `> cap`, which allowed the tracker to briefly hit cap+1 before being dropped). Peak retained size is now exactly `cap`. For the weighted path, the check moves into the new-key branch so existing-key updates (which don't grow the map) skip the cap entirely and the get_mut fast-path is preserved. 2. TDigestSlot doc comment: replace the incorrect "Args::run()" reference with "module-level run(argv: ...) function" — there is no Args::run() method; validation lives in the free run function. 3. stats_quantile_method_approx_quartiles_within_envelope: replace the wrk.assert_success + wrk.read_stdout double-run with a single cmd.output() capture. Headers and the data row are parsed from stdout bytes; success is asserted on the captured output (the second run's status was previously unchecked). 4. stats_quantile_method_approx_percentiles_empty_numeric_column: same single-invocation refactor. 740 stats tests pass; bin clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independently-shippable performance opportunities for
qsv stats,all opt-in or behavior-preserving on default flags. Each landed as a
discrete commit so they can be reverted or cherry-picked individually.
9e937e50e—simdutf8+Cowin output: replaces 7String::from_utf8_lossysites into_recordandformat_antimodeswith a
bytes_to_cow_strhelper that uses SIMD-validated UTF-8 andreturns a borrowed
&stron the happy path. Byte-identical for validUTF-8 (the entire test universe); only difference is faster validation.
e99fe9b5b—--quantile-method approx(t-digest): opt-inapproximate-quantile engine for
--median,--quartiles, and--percentiles, backed by the Apache DataSketches t-digest port(
datasketchescrate). Trades exact for O(K) memory (~200 centroids)and O(1) quantile reads with ~1% rank error. Default
exactproducesbyte-identical output to today, asserted by a regression test.
Restrictions enforced at parse time:
--weightis rejected (upstreamTDigestMuthas no public weighted-update);--madwarns and isdisabled (algorithmically incompatible with t-digest).
b20a7a407—--mode-cardinality-cap N: opt-in upper bound onmode-tracking memory. When
N > 0, qsv drops the mode tracker for anycolumn whose
Unsorted::len()(unweighted, total samples) orHashMap::len()(weighted, unique keys) exceedsN, and emitssentinels (
*HIGH_CARDINALITY/>=N) instead of crashing ortruncating silently. Default
0= unbounded (today's behavior).Both new flags invalidate the stats cache via
StatsArgsJSON sidecarwhen changed, so cached output never silently mismatches the requested
algorithm.
5e6d92e39andfc91fcba5address LOW review findings on the t-digestand cap commits respectively (review jobs 2002 and 2003) — comment
cleanups, the
--jobs 1determinism note, dropped unused Serde impls,single-invocation test capture, plus added coverage for the
weighted-cap branch, the multi-chunk post-merge re-check, and the
--cardinality-only sentinel-field count.Help docs auto-regenerated via
qsv --generate-help-md.Diff: 972 insertions / 195 deletions across 6 files; the bulk of the
deletions are formatter realignments triggered by widening the field-name
column in
Args/StatsArgs/WhichStats.Test plan
cargo build --locked --bin qsv -F all_featurescargo test --locked stats -F all_features— 740 passed, 0 failed(728 baseline + 5 t-digest + 6 mode-cap + 1 empty-numeric edge case)
cargo clippy --locked --bin qsv -F all_features -- -D warnings— cleanq1/q2/q3 within 2% envelope
*HIGH_CARDINALITYand>=3emitted; cap=100 on the same file: normal output
--weight+--quantile-method approxerrors out with a clear message; invalid
--quantile-method nonsenseerrors out
scripts/benchmarks.shto quantify the perfdelta on the NYC 311 1M-row dataset (didn't run locally because the
script requires the dataset to be present)
--quantile-method approxoutput readssensibly under multi-thread (
--jobs 4) — the test suite pins--jobs 1for deterministic snapshots;--jobs >= 2is documentedas approximate-only
🤖 Generated with Claude Code