Skip to content

Introduce the concept of expression volatility #19187

@adriangb

Description

@adriangb

Adapted from #19130 (comment):

It seems to me that we could improve the cohesiveness of the volatility API between logical and physical expressions and make it easier to write optimizer rules, etc. that take this into account if we formally introduced the concept of "expression volatility". It's somewhat implicit as of now, and how it relates to function volatility is not clear. My goal would be to be able to formally express that an expression is "constant" in terms of volatility.

For physical expressions we currently have fn is_volatile(expr: &Arc<dyn PhysicalExpression>) -> bool, but that doesn't tell you if it's a "stable" expression (like column > 123) or a "constant" expression (456 > 123).

Looking at the logical expression const evaluator it seems like it does manual matching:

fn can_evaluate(expr: &Expr) -> bool {
// check for reasons we can't evaluate this node
//
// NOTE all expr types are listed here so when new ones are
// added they can be checked for their ability to be evaluated
// at plan time
match expr {
// TODO: remove the next line after `Expr::Wildcard` is removed
#[expect(deprecated)]
Expr::AggregateFunction { .. }
| Expr::ScalarVariable(_, _)
| Expr::Column(_)
| Expr::OuterReferenceColumn(_, _)
| Expr::Exists { .. }
| Expr::InSubquery(_)
| Expr::ScalarSubquery(_)
| Expr::WindowFunction { .. }
| Expr::GroupingSet(_)
| Expr::Wildcard { .. }
| Expr::Placeholder(_) => false,
Expr::ScalarFunction(ScalarFunction { func, .. }) => {
Self::volatility_ok(func.signature().volatility)
}
Expr::Literal(_, _)
| Expr::Alias(..)
| Expr::Unnest(_)
| Expr::BinaryExpr { .. }
| Expr::Not(_)
| Expr::IsNotNull(_)
| Expr::IsNull(_)
| Expr::IsTrue(_)
| Expr::IsFalse(_)
| Expr::IsUnknown(_)
| Expr::IsNotTrue(_)
| Expr::IsNotFalse(_)
| Expr::IsNotUnknown(_)
| Expr::Negative(_)
| Expr::Between { .. }
| Expr::Like { .. }
| Expr::SimilarTo { .. }
| Expr::Case(_)
| Expr::Cast { .. }
| Expr::TryCast { .. }
| Expr::InList { .. } => true,
}
}
.

It looks like we have a Volatility targeted at functions:

pub enum Volatility {
/// Always returns the same output when given the same input.
///
/// DataFusion will inline immutable functions during planning.
///
/// For example, the `abs` function is immutable, so `abs(-1)` will be
/// evaluated and replaced with `1` during planning rather than invoking
/// the function at runtime.
Immutable,
/// May return different values given the same input across different
/// queries but must return the same value for a given input within a query.
///
/// For example, the `now()` function is stable, because the query `select
/// col1, now() from t1`, will return different results each time it is run,
/// but within the same query, the output of the `now()` function has the
/// same value for each output row.
///
/// DataFusion will inline `Stable` functions when possible. For example,
/// `Stable` functions are inlined when planning a query for execution, but
/// not in View definitions or prepared statements.
Stable,
/// May change the return value from evaluation to evaluation.
///
/// Multiple invocations of a volatile function may return different results
/// when used in the same query on different rows. An example of this is the
/// `random()` function.
///
/// DataFusion can not evaluate such functions during planning or push these
/// predicates into scans. In the query `select col1, random() from t1`,
/// `random()` function will be evaluated for each output row, resulting in
/// a unique random value for each row.
Volatile,
}

But it does not have a "constant" option (nor would that really make sense for a function). Constant is not the same as immutable, as per the definition above Immutable:

Always returns the same output when given the same input.

Constant would be:

Always returns the same output regardless of the input.

(Generally because it has no input but I guess you could have an expression that takes an input but returns a constant / ignores the input?)

Which makes me wonder if it wouldn't make sense to an enum along the lines of:

pub enum ExprVolatility {
    Constant,
    Immutable,
    Stable,
    Volatile,
}

And augment PhysicalExpression and Expr with methods to calculate the volatility:

impl Expr {
    pub fn node_volatility(&self) -> ExprVolatility {
        match self { ... }
    }
    pub fn volatility(&self) -> ExprVolatility {
        self.apply(...)  // recursive
    }
}
pub trait PhysicalExpr {
    fn node_volatility(&self) -> ExprVolatility {
        ExprVolatility::Volatilve
    }
    fn volatility(&self) -> ExprVolatility {
        self.apply(...)  // recursive
    }
}

This seems like it would be required for PhysicalExpr to be able to participate in volatility calculations. For Expr it would be more of a matter of having it all in once place that can be re-used by other optimizers, user code, etc.

The main alternative I see to this is to define "input" as "columns/data" (as opposed to e.g. now()) such that "immutable functions that do not have any columns as children are considered constant", but I'm not sure if that is correct enough. Or we can leave it implicit as it is now w/ logic for constant sub expression detection scattered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions