feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K by jqnatividad · Pull Request #3840 · dathere/qsv

jqnatividad · 2026-05-10T16:00:18Z

Summary

Two new opt-in sketch-backed modes that reuse the existing datasketches crate dependency. Both default to the existing exact behavior — no migration required.

stats --cardinality-method approx routes the cardinality column through a HyperLogLog sketch (HLL_LG_K = 12, ~1.5% RSE, ~5KB/column). The --mode-cardinality-cap no longer corrupts cardinality output under approx — the >=<cap> sentinel is never emitted; mode/antimode columns still respect the cap. --infer-boolean forces exact with a warning (boolean inference needs cardinality == 2 exactness). Merging uses HllUnion (associative + order-invariant, fully reproducible across --jobs).
frequency --sketch-method frequent_items routes through a Misra-Gries heavy-hitters sketch (--sketch-map-size, default 4096). Rejects flag combinations that don't compose with a streaming top-K sketch: --asc, --weight, --ignore-case, --no-trim, --frequency-jsonl, --stats-filter, --json/--pretty-json/--toon. Cache is bypassed; counts are estimates from frequent_items(NoFalsePositives). The "Other" row is computed from total_weight − Σ(top_k_estimates).

Cache invalidation rides on the StatsArgs derived PartialEq; old cache files deserialize cleanly via get_str_or("flag_cardinality_method", "exact").

Plan A (HLL) and Plan B (Frequent Items) are sequenced so they could land independently — both go through their own opt-in flag.

Tests

tests/test_stats.rs: +4 tests — envelope accuracy on 100k rows / 10k unique (within 3% of true), no >=cap sentinel under cap+approx, --infer-boolean fallback to exact, invalid value rejection.
tests/test_frequency.rs: +6 tests — Zipfian top-K identity & 5% accuracy, "Other" row emission with integer rank rendering, --asc rejection, --weight rejection, invalid method rejection, invalid map-size rejection.

Test plan

cargo build --locked --bin qsv -F all_features
cargo t stats -F all_features (744 passed, 3 ignored)
cargo t frequency -F all_features (166 passed)
cargo clippy -F all_features --bin qsv (clean)
CI: full test matrix + lints

Survey of remaining sketch opportunities (not in scope)

For follow-up: Theta in moarstats (set ops on join cardinality), per-group t-digest in pivotp, Bloom filter in dedup/extdedup (likely skip — users want exact). Documented in the plan file.

🤖 Generated with Claude Code

…lity, Frequent Items top-K Two new opt-in sketch-backed modes that reuse the existing `datasketches` crate dependency. Both default to the existing exact behavior. `stats --cardinality-method approx` routes the cardinality column through a HyperLogLog sketch (lg_k=12, ~1.5% RSE, ~5KB/column). The `--mode-cardinality-cap` no longer corrupts the cardinality output under approx — the `>=<cap>` sentinel is never emitted; mode/antimode columns still respect the cap. `--infer-boolean` forces exact with a warning (boolean inference needs `cardinality == 2` exactness). Merging uses `HllUnion` (associative + order-invariant, fully reproducible across `--jobs`). `frequency --sketch-method frequent_items` routes through a Misra-Gries heavy-hitters sketch (`--sketch-map-size`, default 4096). Rejects flag combinations that don't compose with a streaming top-K sketch: `--asc`, `--weight`, `--ignore-case`, `--no-trim`, `--frequency-jsonl`, `--stats-filter`, `--json/--pretty-json/--toon`. Cache is bypassed; counts are estimates from `frequent_items(NoFalsePositives)`. The "Other" row is computed from `total_weight − Σ(top_k_estimates)`. Cache invalidation rides on the `StatsArgs` derived `PartialEq`; old cache files deserialize cleanly via `get_str_or("flag_cardinality_method", "exact")`. Tests: +4 in `tests/test_stats.rs` (envelope accuracy, no-cap-sentinel, infer-boolean fallback, invalid value), +5 in `tests/test_frequency.rs` (Zipfian top-K accuracy within 5%, --asc/--weight rejections, invalid method, invalid map size). 744 stats + 165 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…des (job 2007) * (Medium) frequency `run_frequent_items`: render the "Other" row's rank with itoa instead of the zmij float formatter, matching the per-item rank rendering (regression guard added in tests). * (Low) frequency null label: resolve from `self.flag_null_text` directly rather than the `NULL_VAL` OnceLock, so the FI dispatch (which runs before `NULL_VAL.set`) sees the same docopt-default `(NULL)` as the exact path. * (Low) frequency: replace `get_unchecked[_mut]` in `run_frequent_items` with safe indexing — the bounds checks vanish in practice and removing the `unsafe` hides a footgun for future changes. * (Low) stats: hoist the HLL `lg_k=12` magic literal to a module-level `const HLL_LG_K: u8 = 12;` and use it at both `Stats::new` and the `Commute::merge` `HllUnion` site so the two sketch precisions can never silently diverge. * (Low) tests: new `frequency_sketch_method_frequent_items_other_row_emission` exercises the Other row directly, asserting both that its rank renders as the integer `"4"` (the regression guard) and that its count is within 5% of the expected `total - top_k`. 744 stats + 166 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* (Low) tests/test_frequency.rs: replace `.expect("... {parsed:?}")` with `.unwrap_or_else(|| panic!(...))` so the format string is actually interpolated on failure — `expect` takes a plain `&str` and would print the placeholder verbatim. * (Low) src/cmd/frequency.rs `run_frequent_items`: strengthen the null-label comment with an explicit invariant — the only `NULL_VAL.set(...)` site in this module is at the top of `run()` and it sets `NULL_VAL` from `args.flag_null_text.as_bytes().to_vec()`, so reading `self.flag_null_text` here is byte-identical to the exact path. The invariant is grep-verifiable. 744 stats + 166 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-10T16:01:14Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results

Complexity 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

Adds two opt-in, sketch-backed execution paths to improve scalability of stats cardinality and frequency top-K on high-cardinality inputs while keeping existing “exact” behavior as the default.

Changes:

stats: introduce --cardinality-method {exact|approx} with an HLL-based approximate cardinality path and HLL union merging.
frequency: introduce --sketch-method {exact|frequent_items} with a streaming Frequent Items sketch path and new --sketch-map-size validation.
Add integration tests covering approximate accuracy envelopes, cap interactions, invalid flag values, and sketch-mode rejections.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_stats.rs	Adds tests for HLL approximate cardinality behavior, cap interactions, infer-boolean fallback, and invalid method handling.
tests/test_frequency.rs	Adds tests for Frequent Items sketch top-K accuracy, “Other” row behavior, and invalid/unsupported flag combinations.
src/util.rs	Extends stats-args defaults to include the new `flag_cardinality_method` value for stats-record generation.
src/cmd/stats.rs	Implements `--cardinality-method` validation, wiring into `WhichStats`, HLL sketch tracking, and deterministic union merging; updates CLI help text.
src/cmd/schema.rs	Updates the embedded `frequency::Args` construction to include the newly added sketch-related flags.
src/cmd/frequency.rs	Implements `--sketch-method` dispatch/validation and adds a streaming Frequent Items sketch execution path.

Four Copilot review comments on PR #3840: * `src/cmd/stats.rs` (--mode-cardinality-cap help): drop the "precise estimate" wording — HLL is approximate (~1.5% RSE). * `src/cmd/stats.rs` (--cardinality-method approx help): the "Results may differ slightly across --jobs values" claim was wrong — `Commute::merge` uses `HllUnion`, which is associative + order-invariant, so chunk completion order does not affect the final estimate. Updated the help to state reproducibility explicitly. * `src/cmd/frequency.rs` (FI mode): `--other-sorted` and `--null-sorted` now error out under `--sketch-method frequent_items` (the sketch always emits Other at the end and ranks NULL inline by estimate). `--rank-strategy`, `--lmt-threshold`, and `--unq-limit` are documented as no-ops in the --sketch-method help (the sketch's natural ordering is top-K by estimate descending). * `src/cmd/frequency.rs` (FI Other row): match the exact path's rank-0 sentinel for synthetic rows, and document the label divergence (FI emits the bare --other-text without the unique-count suffix because the sketch cannot recover the count of distinct items not in the top-K). Test updated accordingly: rank now expected to be "0" instead of "4". Tests: +2 (`--other-sorted` rejection, `--null-sorted` rejection). 744 stats + 168 frequency tests pass; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 3 commits May 10, 2026 11:41

jqnatividad requested a review from Copilot May 10, 2026 16:01

Copilot started reviewing on behalf of jqnatividad May 10, 2026 16:01 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread src/cmd/stats.rs Outdated

Comment thread src/cmd/stats.rs Outdated

Comment thread src/cmd/frequency.rs

Comment thread src/cmd/frequency.rs

jqnatividad merged commit debbc2c into master May 10, 2026
17 checks passed

jqnatividad deleted the datasketches-suitewide-202605 branch May 10, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K#3840

feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K#3840
jqnatividad merged 4 commits into
masterfrom
datasketches-suitewide-202605

jqnatividad commented May 10, 2026

Uh oh!

codacy-production Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 10, 2026

Summary

Tests

Test plan

Survey of remaining sketch opportunities (not in scope)

Uh oh!

codacy-production Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented May 10, 2026 •

edited

Loading