Fix FilterExec converting Absent column stats to Exact(NULL)#20391
Fix FilterExec converting Absent column stats to Exact(NULL)#20391fwojciec wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect statistics propagation in FilterExec where NULL interval endpoints (used to represent unbounded/unknown) were being wrapped as Precision::Exact(NULL), corrupting downstream join selectivity/disjointness estimation (issue #20388).
Changes:
- Update
collect_new_statisticsto treat NULL interval bounds asPrecision::Absentrather thanPrecision::{Exact,Inexact}(NULL). - Add a regression test ensuring absent column min/max statistics remain absent after
FilterExec.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Welcome @fwojciec! Thank you for the detailed bug report and proposed fix. Both the analysis and the proposed fix seem reasonable to me.
I wonder whether we should try to proactively prevent Precision::Exact(ScalarValue::Null) (same for Inexact) from being used for min/max/sum values in ColumnStatistics -- it doesn't seem to make much sense.
| /// Regression test: columns with Absent min/max statistics must remain | ||
| /// Absent after FilterExec, not be converted to Exact(NULL). The latter | ||
| /// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs | ||
| /// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing | ||
| /// zero cardinality and forcing Partitioned join mode. |
There was a problem hiding this comment.
Some parts of this comment overfit to the exact scenario you ran into and other implementation details. What about something simpler like:
| /// Regression test: columns with Absent min/max statistics must remain | |
| /// Absent after FilterExec, not be converted to Exact(NULL). The latter | |
| /// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs | |
| /// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing | |
| /// zero cardinality and forcing Partitioned join mode. | |
| /// Columns with Absent min/max statistics should remain Absent after | |
| /// FilterExec. |
There was a problem hiding this comment.
Done (apologies, should have just "Committed suggestion") - the result is the same though.
| let bounds_equal = | ||
| !lower.is_null() && !upper.is_null() && lower.eq(&upper); | ||
| let min_value = if lower.is_null() { | ||
| Precision::Absent | ||
| } else if bounds_equal { | ||
| Precision::Exact(lower) | ||
| } else { | ||
| (Precision::Inexact(lower), Precision::Inexact(upper)) | ||
| Precision::Inexact(lower) | ||
| }; | ||
| let max_value = if upper.is_null() { | ||
| Precision::Absent | ||
| } else if bounds_equal { | ||
| Precision::Exact(upper) | ||
| } else { | ||
| Precision::Inexact(upper) |
There was a problem hiding this comment.
I found the revised logic a little hard to read. What about something like:
| let bounds_equal = | |
| !lower.is_null() && !upper.is_null() && lower.eq(&upper); | |
| let min_value = if lower.is_null() { | |
| Precision::Absent | |
| } else if bounds_equal { | |
| Precision::Exact(lower) | |
| } else { | |
| (Precision::Inexact(lower), Precision::Inexact(upper)) | |
| Precision::Inexact(lower) | |
| }; | |
| let max_value = if upper.is_null() { | |
| Precision::Absent | |
| } else if bounds_equal { | |
| Precision::Exact(upper) | |
| } else { | |
| Precision::Inexact(upper) | |
| let is_exact = | |
| !lower.is_null() && !upper.is_null() && lower == upper; | |
| let min_value = interval_bound_to_precision(lower, is_exact); | |
| let max_value = interval_bound_to_precision(upper, is_exact); |
Where we define a helper func:
/// Converts an interval bound to a [`Precision`] value. NULL bounds (which
/// represent "unbounded" in the [`Interval`] type) map to [`Precision::Absent`].
fn interval_bound_to_precision(
bound: ScalarValue,
is_exact: bool,
) -> Precision<ScalarValue> {
if bound.is_null() {
Precision::Absent
} else if is_exact {
Precision::Exact(bound)
} else {
Precision::Inexact(bound)
}
}`collect_new_statistics` wraps NULL interval bounds in `Precision::Exact`, converting what should be `Precision::Absent` into `Precision::Exact(ScalarValue::Int32(None))`. This corrupts downstream join cardinality estimation — `estimate_disjoint_inputs` treats the NULL as a concrete value and incorrectly concludes join inputs are disjoint, forcing Partitioned mode and disabling dynamic filter pushdown for Parquet row group pruning. The fix checks `is_null()` on interval bounds before wrapping in `Precision`, mapping NULL bounds back to `Absent`. Closes apache#20388
5c245a8 to
6761dcb
Compare
Appreciate the super-fast review @neilconway!
Makes sense to me, would prevent this entire class of bugs. A separate PR (looks like that would be a slightly bigger change, though mostly mechanical)? |
Yep, deferring it to a separate PR would be better I'd say. I'm not a committer though, so maybe wait for someone else to weigh in before you spend time on it :) |
Which issue does this PR close?
FilterExec::collect_new_statisticsconverts Absent column stats to Exact(NULL), corrupting join cardinality estimation #20388.Rationale for this change
collect_new_statisticsinFilterExecwraps NULL interval bounds inPrecision::Exact, converting what should bePrecision::Absentcolumn statistics intoPrecision::Exact(ScalarValue::Int32(None)). Downstream,estimate_disjoint_inputstreats these as real bounds and incorrectly concludes join inputs are disjoint, forcing Partitioned join mode and disabling dynamic filter pushdown for Parquet row group pruning.What changes are included in this PR?
Single change to
collect_new_statisticsinfilter.rs: checkis_null()on interval bounds before wrapping inPrecision, mapping NULL bounds back toAbsent.Are these changes tested?
Yes — includes a regression test (
test_filter_statistics_absent_columns_stay_absent) that fails on current main and passes with the fix.Are there any user-facing changes?
No API changes. Corrects statistics propagation for tables/views with absent column statistics.