Per-conjunct pruning statistics for PruningPredicate by adriangb · Pull Request #22235 · apache/datafusion

adriangb · 2026-05-15T21:54:04Z

Which issue does this PR close?

Part of [Experiment] Adaptive filter pushdown #22144 (Adaptive filter pushdown), split into a reviewable stack. This is PR 2 of 4.

Rationale for this change

Adaptive filter scheduling needs to know how effective each individual predicate conjunct was, so it can decide where to place that conjunct. Today pruning only reports an aggregate result. This PR surfaces per-conjunct effectiveness as a free side effect of the pruning pass that already runs — no extra passes.

What changes are included in this PR?

PruningPredicate::try_new_tagged_conjuncts — build a predicate from AND-conjuncts, each carrying a caller-supplied tag.
PruningPredicate::prune_per_conjunct — returns the usual prune mask plus per-conjunct PerConjunctPruneStats.
RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats and PagePruningAccessPlanFilter::prune_plan_with_per_conjunct_stats — surface those stats for row-group and page-index pruning.

Existing untagged prune / prune_by_statistics / prune_plan_with_page_index paths are preserved and unchanged. No in-tree caller uses the tagged path yet.

Are these changes tested?

Yes — unit tests for the tagged-conjunct constructor and per-conjunct stat accounting.

Are there any user-facing changes?

New public API on PruningPredicate. Purely additive; no behavior change.

Stacked PR — diff is cumulative against main. Review the top commit "feat: per-conjunct pruning statistics for PruningPredicate"; the commit below it is PR #22234.

Stack (review/merge in order):

Add OptionalFilterPhysicalExpr wrapper + proto support #22234 — OptionalFilterPhysicalExpr + proto
this PR — Per-conjunct pruning statistics
SelectivityTracker cost model
Adaptive parquet scan integration

Introduce `OptionalFilterPhysicalExpr`, a transparent `PhysicalExpr` wrapper that marks a filter as *optional* — droppable without affecting query correctness. It delegates every `PhysicalExpr` method to the inner expression, so it is behavior-neutral until a consumer explicitly checks for the marker. This is the foundation for adaptive filter scheduling: a scan can detect the wrapper and drop a performance-hint filter (e.g. a hash-join dynamic filter) when it is not cost-effective, knowing correctness is enforced elsewhere. Also adds proto serialization (`PhysicalOptionalFilterNode`) so physical plans containing the wrapper round-trip faithfully. No caller wraps anything yet — that arrives with the adaptive parquet scan later in the stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an opt-in way to learn, per individual conjunct, how effective each predicate was during pruning — without running any extra pruning passes. - `PruningPredicate::try_new_tagged_conjuncts` builds a predicate from AND-conjuncts, each carrying a caller-supplied tag. - `PruningPredicate::prune_per_conjunct` returns the usual prune mask plus per-conjunct `PerConjunctPruneStats` (rows/containers seen vs. skipped) as a side effect of the pruning iteration that already runs. - `RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats` and `PagePruningAccessPlanFilter::prune_plan_with_per_conjunct_stats` surface those stats for row-group and page-index pruning respectively. The existing untagged `prune` / `prune_by_statistics` / `prune_plan_with_page_index` paths are preserved and unchanged; the new methods return empty stats on the untagged path. No in-tree caller uses the tagged path yet — the adaptive parquet scan consumes it later in the stack as a selectivity prior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T22:00:17Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  41.981s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  41.559s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.028s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.189s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  85.520s] datafusion-datasource-parquet
    Building datafusion-physical-expr-common v53.1.0 (current)
       Built [  19.150s] (current)
     Parsing datafusion-physical-expr-common v53.1.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-physical-expr-common v53.1.0 (baseline)
       Built [  18.963s] (baseline)
     Parsing datafusion-physical-expr-common v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-physical-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.266s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  39.577s] datafusion-physical-expr-common
    Building datafusion-proto v53.1.0 (current)
       Built [  52.530s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.142s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  53.624s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.147s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   2.254s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/enum_variant_added.ron

Failed in:
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397
  variant ExprType:OptionalFilter in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:1397

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 111.070s] datafusion-proto
    Building datafusion-pruning v53.1.0 (current)
       Built [  36.108s] (current)
     Parsing datafusion-pruning v53.1.0 (current)
      Parsed [   0.013s] (current)
    Building datafusion-pruning v53.1.0 (baseline)
       Built [  36.494s] (baseline)
     Parsing datafusion-pruning v53.1.0 (baseline)
      Parsed [   0.013s] (baseline)
    Checking datafusion-pruning v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.102s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.652s] datafusion-pruning

asolimando · 2026-05-16T04:51:47Z

This would also be extremely useful for bootstrapping a cold stats in catalog via execution and runtime stats, I hope this feature lands!

adriangb and others added 2 commits May 15, 2026 14:12

github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate datasource Changes to the datasource crate labels May 15, 2026

This was referenced May 15, 2026

Add SelectivityTracker adaptive filter cost model #22236

Draft

Adaptive filter pushdown for the parquet scan #22237

Draft

[Experiment] Adaptive filter pushdown #22144

Draft

github-actions Bot added the auto detected api change Auto detected API change label May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-conjunct pruning statistics for PruningPredicate#22235

Per-conjunct pruning statistics for PruningPredicate#22235
adriangb wants to merge 2 commits into
apache:mainfrom
adriangb:pr2-per-conjunct-pruning-stats

adriangb commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

asolimando commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adriangb commented May 15, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

asolimando commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants