feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K#3840
Merged
Merged
Conversation
…lity, Frequent Items top-K
Two new opt-in sketch-backed modes that reuse the existing `datasketches`
crate dependency. Both default to the existing exact behavior.
`stats --cardinality-method approx` routes the cardinality column through a
HyperLogLog sketch (lg_k=12, ~1.5% RSE, ~5KB/column). The `--mode-cardinality-cap`
no longer corrupts the cardinality output under approx — the `>=<cap>` sentinel
is never emitted; mode/antimode columns still respect the cap. `--infer-boolean`
forces exact with a warning (boolean inference needs `cardinality == 2`
exactness). Merging uses `HllUnion` (associative + order-invariant, fully
reproducible across `--jobs`).
`frequency --sketch-method frequent_items` routes through a Misra-Gries
heavy-hitters sketch (`--sketch-map-size`, default 4096). Rejects flag
combinations that don't compose with a streaming top-K sketch: `--asc`,
`--weight`, `--ignore-case`, `--no-trim`, `--frequency-jsonl`, `--stats-filter`,
`--json/--pretty-json/--toon`. Cache is bypassed; counts are estimates from
`frequent_items(NoFalsePositives)`. The "Other" row is computed from
`total_weight − Σ(top_k_estimates)`.
Cache invalidation rides on the `StatsArgs` derived `PartialEq`; old cache
files deserialize cleanly via `get_str_or("flag_cardinality_method", "exact")`.
Tests: +4 in `tests/test_stats.rs` (envelope accuracy, no-cap-sentinel,
infer-boolean fallback, invalid value), +5 in `tests/test_frequency.rs` (Zipfian
top-K accuracy within 5%, --asc/--weight rejections, invalid method, invalid
map size). 744 stats + 165 frequency tests pass; clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…des (job 2007) * (Medium) frequency `run_frequent_items`: render the "Other" row's rank with itoa instead of the zmij float formatter, matching the per-item rank rendering (regression guard added in tests). * (Low) frequency null label: resolve from `self.flag_null_text` directly rather than the `NULL_VAL` OnceLock, so the FI dispatch (which runs before `NULL_VAL.set`) sees the same docopt-default `(NULL)` as the exact path. * (Low) frequency: replace `get_unchecked[_mut]` in `run_frequent_items` with safe indexing — the bounds checks vanish in practice and removing the `unsafe` hides a footgun for future changes. * (Low) stats: hoist the HLL `lg_k=12` magic literal to a module-level `const HLL_LG_K: u8 = 12;` and use it at both `Stats::new` and the `Commute::merge` `HllUnion` site so the two sketch precisions can never silently diverge. * (Low) tests: new `frequency_sketch_method_frequent_items_other_row_emission` exercises the Other row directly, asserting both that its rank renders as the integer `"4"` (the regression guard) and that its count is within 5% of the expected `total - top_k`. 744 stats + 166 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* (Low) tests/test_frequency.rs: replace `.expect("... {parsed:?}")` with
`.unwrap_or_else(|| panic!(...))` so the format string is actually
interpolated on failure — `expect` takes a plain `&str` and would print
the placeholder verbatim.
* (Low) src/cmd/frequency.rs `run_frequent_items`: strengthen the null-label
comment with an explicit invariant — the only `NULL_VAL.set(...)` site in
this module is at the top of `run()` and it sets `NULL_VAL` from
`args.flag_null_text.as_bytes().to_vec()`, so reading `self.flag_null_text`
here is byte-identical to the exact path. The invariant is grep-verifiable.
744 stats + 166 frequency tests pass; clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds two opt-in, sketch-backed execution paths to improve scalability of stats cardinality and frequency top-K on high-cardinality inputs while keeping existing “exact” behavior as the default.
Changes:
stats: introduce--cardinality-method {exact|approx}with an HLL-based approximate cardinality path and HLL union merging.frequency: introduce--sketch-method {exact|frequent_items}with a streaming Frequent Items sketch path and new--sketch-map-sizevalidation.- Add integration tests covering approximate accuracy envelopes, cap interactions, invalid flag values, and sketch-mode rejections.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_stats.rs | Adds tests for HLL approximate cardinality behavior, cap interactions, infer-boolean fallback, and invalid method handling. |
| tests/test_frequency.rs | Adds tests for Frequent Items sketch top-K accuracy, “Other” row behavior, and invalid/unsupported flag combinations. |
| src/util.rs | Extends stats-args defaults to include the new flag_cardinality_method value for stats-record generation. |
| src/cmd/stats.rs | Implements --cardinality-method validation, wiring into WhichStats, HLL sketch tracking, and deterministic union merging; updates CLI help text. |
| src/cmd/schema.rs | Updates the embedded frequency::Args construction to include the newly added sketch-related flags. |
| src/cmd/frequency.rs | Implements --sketch-method dispatch/validation and adds a streaming Frequent Items sketch execution path. |
Four Copilot review comments on PR #3840: * `src/cmd/stats.rs` (--mode-cardinality-cap help): drop the "precise estimate" wording — HLL is approximate (~1.5% RSE). * `src/cmd/stats.rs` (--cardinality-method approx help): the "Results may differ slightly across --jobs values" claim was wrong — `Commute::merge` uses `HllUnion`, which is associative + order-invariant, so chunk completion order does not affect the final estimate. Updated the help to state reproducibility explicitly. * `src/cmd/frequency.rs` (FI mode): `--other-sorted` and `--null-sorted` now error out under `--sketch-method frequent_items` (the sketch always emits Other at the end and ranks NULL inline by estimate). `--rank-strategy`, `--lmt-threshold`, and `--unq-limit` are documented as no-ops in the --sketch-method help (the sketch's natural ordering is top-K by estimate descending). * `src/cmd/frequency.rs` (FI Other row): match the exact path's rank-0 sentinel for synthetic rows, and document the label divergence (FI emits the bare --other-text without the unique-count suffix because the sketch cannot recover the count of distinct items not in the top-K). Test updated accordingly: rank now expected to be "0" instead of "4". Tests: +2 (`--other-sorted` rejection, `--null-sorted` rejection). 744 stats + 168 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two new opt-in sketch-backed modes that reuse the existing
datasketchescrate dependency. Both default to the existing exact behavior — no migration required.stats --cardinality-method approxroutes the cardinality column through a HyperLogLog sketch (HLL_LG_K = 12, ~1.5% RSE, ~5KB/column). The--mode-cardinality-capno longer corrupts cardinality output under approx — the>=<cap>sentinel is never emitted; mode/antimode columns still respect the cap.--infer-booleanforces exact with a warning (boolean inference needscardinality == 2exactness). Merging usesHllUnion(associative + order-invariant, fully reproducible across--jobs).frequency --sketch-method frequent_itemsroutes through a Misra-Gries heavy-hitters sketch (--sketch-map-size, default 4096). Rejects flag combinations that don't compose with a streaming top-K sketch:--asc,--weight,--ignore-case,--no-trim,--frequency-jsonl,--stats-filter,--json/--pretty-json/--toon. Cache is bypassed; counts are estimates fromfrequent_items(NoFalsePositives). The "Other" row is computed fromtotal_weight − Σ(top_k_estimates).Cache invalidation rides on the
StatsArgsderivedPartialEq; old cache files deserialize cleanly viaget_str_or("flag_cardinality_method", "exact").Plan A (HLL) and Plan B (Frequent Items) are sequenced so they could land independently — both go through their own opt-in flag.
Tests
tests/test_stats.rs: +4 tests — envelope accuracy on 100k rows / 10k unique (within 3% of true), no>=capsentinel under cap+approx,--infer-booleanfallback to exact, invalid value rejection.tests/test_frequency.rs: +6 tests — Zipfian top-K identity & 5% accuracy, "Other" row emission with integer rank rendering,--ascrejection,--weightrejection, invalid method rejection, invalid map-size rejection.Test plan
cargo build --locked --bin qsv -F all_featurescargo t stats -F all_features(744 passed, 3 ignored)cargo t frequency -F all_features(166 passed)cargo clippy -F all_features --bin qsv(clean)Survey of remaining sketch opportunities (not in scope)
For follow-up: Theta in
moarstats(set ops on join cardinality), per-group t-digest inpivotp, Bloom filter indedup/extdedup(likely skip — users want exact). Documented in the plan file.🤖 Generated with Claude Code