Skip to content

Add SelectivityTracker adaptive filter cost model#22236

Draft
adriangb wants to merge 3 commits into
apache:mainfrom
adriangb:pr3-selectivity-tracker
Draft

Add SelectivityTracker adaptive filter cost model#22236
adriangb wants to merge 3 commits into
apache:mainfrom
adriangb:pr3-selectivity-tracker

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The cost model that decides where each filter conjunct runs (row-level, post-scan, or dropped) is large enough to review on its own, separate from the scan plumbing that consumes it.

What changes are included in this PR?

  • SelectivityTracker: a cross-file cost model that accumulates per-filter selectivity and throughput statistics and, using a confidence interval, partitions filter conjuncts into row-level / post-scan / dropped buckets.
  • total_compressed_bytes helper in row_filter (column-byte sizing used by the tracker).
  • A criterion benchmark for the tracker.

Nothing wires the tracker into the parquet scan yet — that is the final PR in the stack. A few pub(crate) items only exercised by that integration carry a temporary #[expect(dead_code)], removed in PR 4.

Are these changes tested?

Yes — ~45 unit tests cover the partition / promote / demote / drop logic.

Are there any user-facing changes?

New pub module datafusion-datasource-parquet::selectivity. No behavior change — no production code path uses it yet.


Stacked PR — diff is cumulative against main. Review the top commit "feat: add SelectivityTracker adaptive filter cost model"; the commits below it are PRs #22234 and #22235.

Stack (review/merge in order):

  1. Add OptionalFilterPhysicalExpr wrapper + proto support #22234 — OptionalFilterPhysicalExpr + proto
  2. Per-conjunct pruning statistics for PruningPredicate #22235 — Per-conjunct pruning statistics
  3. this PR — SelectivityTracker cost model
  4. Adaptive parquet scan integration

adriangb and others added 3 commits May 15, 2026 14:12
Introduce `OptionalFilterPhysicalExpr`, a transparent `PhysicalExpr`
wrapper that marks a filter as *optional* — droppable without affecting
query correctness. It delegates every `PhysicalExpr` method to the inner
expression, so it is behavior-neutral until a consumer explicitly checks
for the marker.

This is the foundation for adaptive filter scheduling: a scan can detect
the wrapper and drop a performance-hint filter (e.g. a hash-join dynamic
filter) when it is not cost-effective, knowing correctness is enforced
elsewhere.

Also adds proto serialization (`PhysicalOptionalFilterNode`) so physical
plans containing the wrapper round-trip faithfully.

No caller wraps anything yet — that arrives with the adaptive parquet
scan later in the stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an opt-in way to learn, per individual conjunct, how effective each
predicate was during pruning — without running any extra pruning passes.

- `PruningPredicate::try_new_tagged_conjuncts` builds a predicate from
  AND-conjuncts, each carrying a caller-supplied tag.
- `PruningPredicate::prune_per_conjunct` returns the usual prune mask
  plus per-conjunct `PerConjunctPruneStats` (rows/containers seen vs.
  skipped) as a side effect of the pruning iteration that already runs.
- `RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats`
  and `PagePruningAccessPlanFilter::prune_plan_with_per_conjunct_stats`
  surface those stats for row-group and page-index pruning respectively.

The existing untagged `prune` / `prune_by_statistics` /
`prune_plan_with_page_index` paths are preserved and unchanged; the new
methods return empty stats on the untagged path. No in-tree caller uses
the tagged path yet — the adaptive parquet scan consumes it later in the
stack as a selectivity prior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce `SelectivityTracker`, the cross-file cost model behind adaptive
filter pushdown. It accumulates per-filter selectivity and throughput
statistics and, given a confidence interval, decides whether each filter
conjunct should be evaluated at row level, deferred to post-scan, or
dropped entirely (for optional filters).

This commit adds the module, its `total_compressed_bytes` helper in
`row_filter`, a criterion benchmark, and ~45 unit tests covering the
partitioning / promote / demote / drop logic. Nothing wires it into the
parquet scan yet — that integration is the final commit in this stack.

A couple of `pub(crate)` items (`count_skippable_bytes`, `skip_flag`,
`is_filter_skipped`) are only exercised by that integration and carry a
temporary `#[expect(dead_code)]` until then.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate datasource Changes to the datasource crate labels May 15, 2026
@github-actions
Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  43.475s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.030s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  42.628s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.028s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.223s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  88.790s] datafusion-datasource-parquet
    Building datafusion-physical-expr-common v53.1.0 (current)
       Built [  20.076s] (current)
     Parsing datafusion-physical-expr-common v53.1.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-expr-common v53.1.0 (baseline)
       Built [  19.647s] (baseline)
     Parsing datafusion-physical-expr-common v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-physical-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.278s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  41.207s] datafusion-physical-expr-common
    Building datafusion-proto v53.1.0 (current)
       Built [  54.983s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.147s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  57.029s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.145s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   2.276s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/enum_variant_added.ron

Failed in:
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 116.830s] datafusion-proto
    Building datafusion-pruning v53.1.0 (current)
       Built [  37.877s] (current)
     Parsing datafusion-pruning v53.1.0 (current)
      Parsed [   0.013s] (current)
    Building datafusion-pruning v53.1.0 (baseline)
       Built [  37.354s] (baseline)
     Parsing datafusion-pruning v53.1.0 (baseline)
      Parsed [   0.013s] (baseline)
    Checking datafusion-pruning v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.097s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  76.811s] datafusion-pruning

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant