Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 6, 2025

This improves handling of constant expressions during pruning by trying to evaluate them in the simplifier and the pruning machinery. This is somewhat redundant with #19129 in the simple case of our Parquet implementation but since there may be edge cases where one is hit and not the other, or where users are using them independently I thought it best to implement both approaches.

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 6, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Dec 6, 2025
.await?;
// There will be data: the filter is (null) is not null or a = 24.
// Statistics pruning doesn't handle `null is not null` so it resolves to `true or a = 24` -> `true` so no row groups are pruned
// There should be zero batches
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're now able to prune this 😄

@github-actions github-actions bot added the datasource Changes to the datasource crate label Dec 6, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Dec 7, 2025

@alamb I wonder if there could be any way to express that an expression is "constant" in terms of volatility? We currently have fn is_volatile(expr: &Arc<dyn PhysicalExpression>) -> bool, but that doesn't tell you if it's a "stable" expression (like column > 123) or a "constant" expression (456 > 123).

Looking at the logical expression const evaluator it seems like it does manual matching:

fn can_evaluate(expr: &Expr) -> bool {
// check for reasons we can't evaluate this node
//
// NOTE all expr types are listed here so when new ones are
// added they can be checked for their ability to be evaluated
// at plan time
match expr {
// TODO: remove the next line after `Expr::Wildcard` is removed
#[expect(deprecated)]
Expr::AggregateFunction { .. }
| Expr::ScalarVariable(_, _)
| Expr::Column(_)
| Expr::OuterReferenceColumn(_, _)
| Expr::Exists { .. }
| Expr::InSubquery(_)
| Expr::ScalarSubquery(_)
| Expr::WindowFunction { .. }
| Expr::GroupingSet(_)
| Expr::Wildcard { .. }
| Expr::Placeholder(_) => false,
Expr::ScalarFunction(ScalarFunction { func, .. }) => {
Self::volatility_ok(func.signature().volatility)
}
Expr::Literal(_, _)
| Expr::Alias(..)
| Expr::Unnest(_)
| Expr::BinaryExpr { .. }
| Expr::Not(_)
| Expr::IsNotNull(_)
| Expr::IsNull(_)
| Expr::IsTrue(_)
| Expr::IsFalse(_)
| Expr::IsUnknown(_)
| Expr::IsNotTrue(_)
| Expr::IsNotFalse(_)
| Expr::IsNotUnknown(_)
| Expr::Negative(_)
| Expr::Between { .. }
| Expr::Like { .. }
| Expr::SimilarTo { .. }
| Expr::Case(_)
| Expr::Cast { .. }
| Expr::TryCast { .. }
| Expr::InList { .. } => true,
}
}
.

It looks like we have a Volatility targeted at functions:

pub enum Volatility {
/// Always returns the same output when given the same input.
///
/// DataFusion will inline immutable functions during planning.
///
/// For example, the `abs` function is immutable, so `abs(-1)` will be
/// evaluated and replaced with `1` during planning rather than invoking
/// the function at runtime.
Immutable,
/// May return different values given the same input across different
/// queries but must return the same value for a given input within a query.
///
/// For example, the `now()` function is stable, because the query `select
/// col1, now() from t1`, will return different results each time it is run,
/// but within the same query, the output of the `now()` function has the
/// same value for each output row.
///
/// DataFusion will inline `Stable` functions when possible. For example,
/// `Stable` functions are inlined when planning a query for execution, but
/// not in View definitions or prepared statements.
Stable,
/// May change the return value from evaluation to evaluation.
///
/// Multiple invocations of a volatile function may return different results
/// when used in the same query on different rows. An example of this is the
/// `random()` function.
///
/// DataFusion can not evaluate such functions during planning or push these
/// predicates into scans. In the query `select col1, random() from t1`,
/// `random()` function will be evaluated for each output row, resulting in
/// a unique random value for each row.
Volatile,
}

But it does not have a "constant" option (nor would that really make sense for a function). Constant is not the same as immutable, as per the definition above Immutable:

Always returns the same output when given the same input.

Constant would be:

Always returns the same output regardless of the input.

(Generally because it has no input but I guess you could have an expression that takes an input but returns a constant / ignores the input?)

Which makes me wonder if it wouldn't make sense to an enum along the lines of:

pub enum ExprVolatility {
    Constant,
    Immutable,
    Stable,
    Volatile,
}

And augment PhysicalExpression and Expr with methods to calculate the volatility:

impl Expr {
    pub fn node_volatility(&self) -> ExprVolatility {
        match self { ... }
    }
    pub fn volatility(&self) -> ExprVolatility {
        self.apply(...)  // recursive
    }
}
pub trait PhysicalExpr {
    fn node_volatility(&self) -> ExprVolatility {
        ExprVolatility::Volatilve
    }
    fn volatility(&self) -> ExprVolatility {
        self.apply(...)  // recursive
    }
}

This seems like it would be required for PhysicalExpr to be able to participate in volatility calculations. For Expr it would be more of a matter of having it all in once place that can be re-used by other optimizers, user code, etc.

The main alternative I see to this is to define "input" as "columns/data" (as opposed to e.g. now()) such that "immutable functions that do not have any columns as children are considered constant", but I'm not sure if that is correct enough.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @adriangb

.await?;
// There should be zero batches
assert_eq!(batches.len(), 0);
// On the other hand `b is null and a = 2` should prune only the second row group with stats only pruning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pruned because now b is null gets folded to null is null AND a = 2 --> a = 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@adriangb adriangb added this pull request to the merge queue Dec 7, 2025
Merged via the queue into apache:main with commit cb2f3d2 Dec 7, 2025
31 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Dec 8, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

`cargo fmt` is failing on main. Example
https://github.com/apache/datafusion/actions/runs/20027361509/job/57427769604

I believe it is a logical conflict between:
- #19091
- #19130


## What changes are included in this PR?

Ran `cargo fmt` and committed the result

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants