Skip to content

Fix FilterExec converting Absent column stats to Exact(NULL)#20391

Open
fwojciec wants to merge 1 commit intoapache:mainfrom
fwojciec:fix/filterexec-absent-stats-corruption
Open

Fix FilterExec converting Absent column stats to Exact(NULL)#20391
fwojciec wants to merge 1 commit intoapache:mainfrom
fwojciec:fix/filterexec-absent-stats-corruption

Conversation

@fwojciec
Copy link

Which issue does this PR close?

Rationale for this change

collect_new_statistics in FilterExec wraps NULL interval bounds in Precision::Exact, converting what should be Precision::Absent column statistics into Precision::Exact(ScalarValue::Int32(None)). Downstream, estimate_disjoint_inputs treats these as real bounds and incorrectly concludes join inputs are disjoint, forcing Partitioned join mode and disabling dynamic filter pushdown for Parquet row group pruning.

What changes are included in this PR?

Single change to collect_new_statistics in filter.rs: check is_null() on interval bounds before wrapping in Precision, mapping NULL bounds back to Absent.

Are these changes tested?

Yes — includes a regression test (test_filter_statistics_absent_columns_stay_absent) that fails on current main and passes with the fix.

Are there any user-facing changes?

No API changes. Corrects statistics propagation for tables/views with absent column statistics.

Copilot AI review requested due to automatic review settings February 16, 2026 23:34
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect statistics propagation in FilterExec where NULL interval endpoints (used to represent unbounded/unknown) were being wrapped as Precision::Exact(NULL), corrupting downstream join selectivity/disjointness estimation (issue #20388).

Changes:

  • Update collect_new_statistics to treat NULL interval bounds as Precision::Absent rather than Precision::{Exact,Inexact}(NULL).
  • Add a regression test ensuring absent column min/max statistics remain absent after FilterExec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome @fwojciec! Thank you for the detailed bug report and proposed fix. Both the analysis and the proposed fix seem reasonable to me.

I wonder whether we should try to proactively prevent Precision::Exact(ScalarValue::Null) (same for Inexact) from being used for min/max/sum values in ColumnStatistics -- it doesn't seem to make much sense.

Comment on lines 2068 to 2072
/// Regression test: columns with Absent min/max statistics must remain
/// Absent after FilterExec, not be converted to Exact(NULL). The latter
/// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs
/// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing
/// zero cardinality and forcing Partitioned join mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some parts of this comment overfit to the exact scenario you ran into and other implementation details. What about something simpler like:

Suggested change
/// Regression test: columns with Absent min/max statistics must remain
/// Absent after FilterExec, not be converted to Exact(NULL). The latter
/// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs
/// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing
/// zero cardinality and forcing Partitioned join mode.
/// Columns with Absent min/max statistics should remain Absent after
/// FilterExec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (apologies, should have just "Committed suggestion") - the result is the same though.

Comment on lines 774 to 788
let bounds_equal =
!lower.is_null() && !upper.is_null() && lower.eq(&upper);
let min_value = if lower.is_null() {
Precision::Absent
} else if bounds_equal {
Precision::Exact(lower)
} else {
(Precision::Inexact(lower), Precision::Inexact(upper))
Precision::Inexact(lower)
};
let max_value = if upper.is_null() {
Precision::Absent
} else if bounds_equal {
Precision::Exact(upper)
} else {
Precision::Inexact(upper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the revised logic a little hard to read. What about something like:

Suggested change
let bounds_equal =
!lower.is_null() && !upper.is_null() && lower.eq(&upper);
let min_value = if lower.is_null() {
Precision::Absent
} else if bounds_equal {
Precision::Exact(lower)
} else {
(Precision::Inexact(lower), Precision::Inexact(upper))
Precision::Inexact(lower)
};
let max_value = if upper.is_null() {
Precision::Absent
} else if bounds_equal {
Precision::Exact(upper)
} else {
Precision::Inexact(upper)
let is_exact =
!lower.is_null() && !upper.is_null() && lower == upper;
let min_value = interval_bound_to_precision(lower, is_exact);
let max_value = interval_bound_to_precision(upper, is_exact);

Where we define a helper func:

/// Converts an interval bound to a [`Precision`] value. NULL bounds (which
 /// represent "unbounded" in the [`Interval`] type) map to [`Precision::Absent`].
 fn interval_bound_to_precision(
     bound: ScalarValue,
     is_exact: bool,
 ) -> Precision<ScalarValue> {
     if bound.is_null() {
         Precision::Absent
     } else if is_exact {
         Precision::Exact(bound)
     } else {
         Precision::Inexact(bound)
     }
 }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

`collect_new_statistics` wraps NULL interval bounds in
`Precision::Exact`, converting what should be `Precision::Absent`
into `Precision::Exact(ScalarValue::Int32(None))`. This corrupts
downstream join cardinality estimation — `estimate_disjoint_inputs`
treats the NULL as a concrete value and incorrectly concludes join
inputs are disjoint, forcing Partitioned mode and disabling dynamic
filter pushdown for Parquet row group pruning.

The fix checks `is_null()` on interval bounds before wrapping in
`Precision`, mapping NULL bounds back to `Absent`.

Closes apache#20388
@fwojciec fwojciec force-pushed the fix/filterexec-absent-stats-corruption branch from 5c245a8 to 6761dcb Compare February 17, 2026 02:29
@fwojciec
Copy link
Author

fwojciec commented Feb 17, 2026

Welcome @fwojciec! Thank you for the detailed bug report and proposed fix. Both the analysis and the proposed fix seem reasonable to me.

Appreciate the super-fast review @neilconway!

I wonder whether we should try to proactively prevent Precision::Exact(ScalarValue::Null) (same for Inexact) from being used for min/max/sum values in ColumnStatistics -- it doesn't seem to make much sense.

Makes sense to me, would prevent this entire class of bugs. A separate PR (looks like that would be a slightly bigger change, though mostly mechanical)?

@neilconway
Copy link
Contributor

Makes sense to me, would prevent this entire class of bugs. A separate PR (looks like that would be a slightly bigger change, though mostly mechanical)?

Yep, deferring it to a separate PR would be better I'd say. I'm not a committer though, so maybe wait for someone else to weigh in before you spend time on it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FilterExec::collect_new_statistics converts Absent column stats to Exact(NULL), corrupting join cardinality estimation

2 participants

Comments