Skip to content

feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K#3840

Merged
jqnatividad merged 4 commits into
masterfrom
datasketches-suitewide-202605
May 10, 2026
Merged

feat(stats,frequency): opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K#3840
jqnatividad merged 4 commits into
masterfrom
datasketches-suitewide-202605

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

Two new opt-in sketch-backed modes that reuse the existing datasketches crate dependency. Both default to the existing exact behavior — no migration required.

  • stats --cardinality-method approx routes the cardinality column through a HyperLogLog sketch (HLL_LG_K = 12, ~1.5% RSE, ~5KB/column). The --mode-cardinality-cap no longer corrupts cardinality output under approx — the >=<cap> sentinel is never emitted; mode/antimode columns still respect the cap. --infer-boolean forces exact with a warning (boolean inference needs cardinality == 2 exactness). Merging uses HllUnion (associative + order-invariant, fully reproducible across --jobs).
  • frequency --sketch-method frequent_items routes through a Misra-Gries heavy-hitters sketch (--sketch-map-size, default 4096). Rejects flag combinations that don't compose with a streaming top-K sketch: --asc, --weight, --ignore-case, --no-trim, --frequency-jsonl, --stats-filter, --json/--pretty-json/--toon. Cache is bypassed; counts are estimates from frequent_items(NoFalsePositives). The "Other" row is computed from total_weight − Σ(top_k_estimates).

Cache invalidation rides on the StatsArgs derived PartialEq; old cache files deserialize cleanly via get_str_or("flag_cardinality_method", "exact").

Plan A (HLL) and Plan B (Frequent Items) are sequenced so they could land independently — both go through their own opt-in flag.

Tests

  • tests/test_stats.rs: +4 tests — envelope accuracy on 100k rows / 10k unique (within 3% of true), no >=cap sentinel under cap+approx, --infer-boolean fallback to exact, invalid value rejection.
  • tests/test_frequency.rs: +6 tests — Zipfian top-K identity & 5% accuracy, "Other" row emission with integer rank rendering, --asc rejection, --weight rejection, invalid method rejection, invalid map-size rejection.

Test plan

  • cargo build --locked --bin qsv -F all_features
  • cargo t stats -F all_features (744 passed, 3 ignored)
  • cargo t frequency -F all_features (166 passed)
  • cargo clippy -F all_features --bin qsv (clean)
  • CI: full test matrix + lints

Survey of remaining sketch opportunities (not in scope)

For follow-up: Theta in moarstats (set ops on join cardinality), per-group t-digest in pivotp, Bloom filter in dedup/extdedup (likely skip — users want exact). Documented in the plan file.

🤖 Generated with Claude Code

jqnatividad and others added 3 commits May 10, 2026 11:41
…lity, Frequent Items top-K

Two new opt-in sketch-backed modes that reuse the existing `datasketches`
crate dependency. Both default to the existing exact behavior.

`stats --cardinality-method approx` routes the cardinality column through a
HyperLogLog sketch (lg_k=12, ~1.5% RSE, ~5KB/column). The `--mode-cardinality-cap`
no longer corrupts the cardinality output under approx — the `>=<cap>` sentinel
is never emitted; mode/antimode columns still respect the cap. `--infer-boolean`
forces exact with a warning (boolean inference needs `cardinality == 2`
exactness). Merging uses `HllUnion` (associative + order-invariant, fully
reproducible across `--jobs`).

`frequency --sketch-method frequent_items` routes through a Misra-Gries
heavy-hitters sketch (`--sketch-map-size`, default 4096). Rejects flag
combinations that don't compose with a streaming top-K sketch: `--asc`,
`--weight`, `--ignore-case`, `--no-trim`, `--frequency-jsonl`, `--stats-filter`,
`--json/--pretty-json/--toon`. Cache is bypassed; counts are estimates from
`frequent_items(NoFalsePositives)`. The "Other" row is computed from
`total_weight − Σ(top_k_estimates)`.

Cache invalidation rides on the `StatsArgs` derived `PartialEq`; old cache
files deserialize cleanly via `get_str_or("flag_cardinality_method", "exact")`.

Tests: +4 in `tests/test_stats.rs` (envelope accuracy, no-cap-sentinel,
infer-boolean fallback, invalid value), +5 in `tests/test_frequency.rs` (Zipfian
top-K accuracy within 5%, --asc/--weight rejections, invalid method, invalid
map size). 744 stats + 165 frequency tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…des (job 2007)

* (Medium) frequency `run_frequent_items`: render the "Other" row's rank with
  itoa instead of the zmij float formatter, matching the per-item rank rendering
  (regression guard added in tests).
* (Low) frequency null label: resolve from `self.flag_null_text` directly
  rather than the `NULL_VAL` OnceLock, so the FI dispatch (which runs before
  `NULL_VAL.set`) sees the same docopt-default `(NULL)` as the exact path.
* (Low) frequency: replace `get_unchecked[_mut]` in `run_frequent_items` with
  safe indexing — the bounds checks vanish in practice and removing the
  `unsafe` hides a footgun for future changes.
* (Low) stats: hoist the HLL `lg_k=12` magic literal to a module-level
  `const HLL_LG_K: u8 = 12;` and use it at both `Stats::new` and the
  `Commute::merge` `HllUnion` site so the two sketch precisions can never
  silently diverge.
* (Low) tests: new `frequency_sketch_method_frequent_items_other_row_emission`
  exercises the Other row directly, asserting both that its rank renders as
  the integer `"4"` (the regression guard) and that its count is within 5% of
  the expected `total - top_k`.

744 stats + 166 frequency tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* (Low) tests/test_frequency.rs: replace `.expect("... {parsed:?}")` with
  `.unwrap_or_else(|| panic!(...))` so the format string is actually
  interpolated on failure — `expect` takes a plain `&str` and would print
  the placeholder verbatim.
* (Low) src/cmd/frequency.rs `run_frequent_items`: strengthen the null-label
  comment with an explicit invariant — the only `NULL_VAL.set(...)` site in
  this module is at the top of `run()` and it sets `NULL_VAL` from
  `args.flag_null_text.as_bytes().to_vec()`, so reading `self.flag_null_text`
  here is byte-identical to the exact path. The invariant is grep-verifiable.

744 stats + 166 frequency tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad requested a review from Copilot May 10, 2026 16:01
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 10, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two opt-in, sketch-backed execution paths to improve scalability of stats cardinality and frequency top-K on high-cardinality inputs while keeping existing “exact” behavior as the default.

Changes:

  • stats: introduce --cardinality-method {exact|approx} with an HLL-based approximate cardinality path and HLL union merging.
  • frequency: introduce --sketch-method {exact|frequent_items} with a streaming Frequent Items sketch path and new --sketch-map-size validation.
  • Add integration tests covering approximate accuracy envelopes, cap interactions, invalid flag values, and sketch-mode rejections.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_stats.rs Adds tests for HLL approximate cardinality behavior, cap interactions, infer-boolean fallback, and invalid method handling.
tests/test_frequency.rs Adds tests for Frequent Items sketch top-K accuracy, “Other” row behavior, and invalid/unsupported flag combinations.
src/util.rs Extends stats-args defaults to include the new flag_cardinality_method value for stats-record generation.
src/cmd/stats.rs Implements --cardinality-method validation, wiring into WhichStats, HLL sketch tracking, and deterministic union merging; updates CLI help text.
src/cmd/schema.rs Updates the embedded frequency::Args construction to include the newly added sketch-related flags.
src/cmd/frequency.rs Implements --sketch-method dispatch/validation and adds a streaming Frequent Items sketch execution path.

Comment thread src/cmd/stats.rs Outdated
Comment thread src/cmd/stats.rs Outdated
Comment thread src/cmd/frequency.rs
Comment thread src/cmd/frequency.rs
Four Copilot review comments on PR #3840:

* `src/cmd/stats.rs` (--mode-cardinality-cap help): drop the "precise
  estimate" wording — HLL is approximate (~1.5% RSE).
* `src/cmd/stats.rs` (--cardinality-method approx help): the "Results may
  differ slightly across --jobs values" claim was wrong — `Commute::merge`
  uses `HllUnion`, which is associative + order-invariant, so chunk
  completion order does not affect the final estimate. Updated the help
  to state reproducibility explicitly.
* `src/cmd/frequency.rs` (FI mode): `--other-sorted` and `--null-sorted`
  now error out under `--sketch-method frequent_items` (the sketch always
  emits Other at the end and ranks NULL inline by estimate).
  `--rank-strategy`, `--lmt-threshold`, and `--unq-limit` are documented
  as no-ops in the --sketch-method help (the sketch's natural ordering
  is top-K by estimate descending).
* `src/cmd/frequency.rs` (FI Other row): match the exact path's
  rank-0 sentinel for synthetic rows, and document the label divergence
  (FI emits the bare --other-text without the unique-count suffix because
  the sketch cannot recover the count of distinct items not in the
  top-K). Test updated accordingly: rank now expected to be "0" instead
  of "4".

Tests: +2 (`--other-sorted` rejection, `--null-sorted` rejection).
744 stats + 168 frequency tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit debbc2c into master May 10, 2026
17 checks passed
@jqnatividad jqnatividad deleted the datasketches-suitewide-202605 branch May 10, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants