Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized constant folding #1128

Closed
wants to merge 6 commits into from
Closed

Conversation

pjmore
Copy link
Contributor

@pjmore pjmore commented Oct 16, 2021

Which issue does this PR close?

Closes #1070.

What changes are included in this PR?

Added an evaluate function which evaluates literal scalar expressions using the existing physical expression implementation. Reworked ConstantFolding optimizer using the evaluate function to perform more generalized constant folding. The optimizer now traverses the expression tree depth first returning whether the expression is made up of literal expressions. When an expression that has multiple children is encountered all children expressions are checked to see if they are literals, if so then it simply returns that the expression is literal. If one of the child expressions is not literal then all literal child expressions are evaluated and then non literal is returned. For expressions that can never be evaluated in a scalar manner, such as aggregate or window expressions, the children are traversed but the expression itself will never be evaluated.

Are there any user-facing changes?

The added evaluate function is currently public to allow usage of the function from other crates in the workspace.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Oct 16, 2021
@houqp houqp added the enhancement New feature or request label Oct 17, 2021
@houqp
Copy link
Member

houqp commented Oct 17, 2021

@pjmore is this the last missing piece before you can resume the tokomak work?

@pjmore
Copy link
Contributor Author

pjmore commented Oct 18, 2021

@houqp Yep, that is everything! I'm going to start on that now.

Copy link
Contributor

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting work! Thanks a lot!

}

///Evaluates an expression if it only contains literal expressions. If a non-literal expression or non-immutable function is found then it returns an error.
pub fn evalute_const_expr(expr: &Expr) -> Result<ScalarValue> {
Copy link
Contributor

@rdettai rdettai Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to use this method in a PR soon? Otherwise don't expose this function if it is never used (YAGNI).

Same would apply to expr_is_const() that is only used here 😃.

If you do plan to use it in a follow up PR, I would say that it would be very important to test that expr_is_const() gives consistent results with rewrite_const_expr() (If I understand the meaning of the return value correctly)

}

///Evaluates an expression. Note that this will return incorrect results if the expression contains function calls which change return values call to call, such as Now().
pub fn evaluate_const_expr_unchecked(expr: &Expr) -> Result<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is somewhat covered by the tests in constant_folding.rs, but if it is exposed publicly it should have its own set of tests.

Copy link
Contributor

@rdettai rdettai Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also it is surprising that this method uses fully qualified syntax (like std::sync::Arc) almost everywhere 😃

Copy link
Contributor

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the review in 2 parts 😅.

let dummy_df_schema = DFSchema::empty();
let dummy_input_schema = arrow::datatypes::Schema::new(vec![Field::new(
DUMMY_COL_NAME,
DataType::Float64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe use the Null type here to make it clearer that it doesn't matter?

Suggested change
DataType::Float64,
DataType::Null,

expr
}

fn rewrite_const_expr(&mut self, expr: &mut Expr) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe comment on the meaning of the returned boolean value? If I understand correctly, it indicates whether the expression can be evaluated as a constant expression, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is what it is, I'll add some documentation.

false
}
(false, true) => {
self.const_fold_list_eager(list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we run rewrite_const_expr() twice on each expression, is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well rereading the code it's expected but not intended haha.

}

///Evaluates an expression. Note that this will return incorrect results if the expression contains function calls which change return values call to call, such as Now().
pub fn evaluate_const_expr_unchecked(expr: &Expr) -> Result<ScalarValue> {
Copy link
Contributor

@rdettai rdettai Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also it is surprising that this method uses fully qualified syntax (like std::sync::Arc) almost everywhere 😃

Copy link
Contributor

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another question, sorry for that 😃. In case you are wondering, I am looking into to this PR thoroughly because I need something somewhat similar in #1141 to execute a partial filtering expression on the partition values: https://github.com/apache/arrow-datafusion/blob/ecb158a9fb86c52e97edb01d345d4edcb84dcefe/datafusion/src/datasource/listing.rs#L346-L428

Comment on lines 390 to +414
Expr::ScalarFunction {
fun: BuiltinScalarFunction::Now,
..
} => Expr::Literal(ScalarValue::TimestampNanosecond(Some(
self.execution_props
.query_execution_start_time
.timestamp_nanos(),
))),
Expr::ScalarFunction {
fun: BuiltinScalarFunction::ToTimestamp,
args,
} => {
if !args.is_empty() {
match &args[0] {
Expr::Literal(ScalarValue::Utf8(Some(val))) => {
match string_to_timestamp_nanos(val) {
Ok(timestamp) => Expr::Literal(
ScalarValue::TimestampNanosecond(Some(timestamp)),
),
_ => Expr::ScalarFunction {
fun: BuiltinScalarFunction::ToTimestamp,
args,
},
}
*expr = Expr::Literal(ScalarValue::TimestampNanosecond(Some(
self.execution_props
.query_execution_start_time
.timestamp_nanos(),
)));
true
}
Expr::ScalarFunction { fun, args } => {
if args.is_empty() {
false
} else {
let volatility = fun.volatility();
match volatility {
Volatility::Immutable => self.const_fold_list(args),
_ => {
self.const_fold_list_eager(args);
false
}
_ => Expr::ScalarFunction {
fun: BuiltinScalarFunction::ToTimestamp,
args,
},
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we isolate the case of Now here instead of considering Volatility::Stable ScalarFunctions as foldable? Stable functions are supposed to be constant in the context of the query, isn't that enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have sworn that now() was implemented by rewriting the expression tree but it looks like that is wrong. You're right stable functions should be foldable.

@alamb
Copy link
Contributor

alamb commented Oct 18, 2021

I will review this later today -- thank you @pjmore !

&dummy_input_schema,
&ctx_state,
)?;
let col = {
Copy link
Contributor

@alamb alamb Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use new_null_array here from arrow:
https://docs.rs/arrow/6.0.0/arrow/array/fn.new_null_array.html

}
///This attempts to simplify expressions of the form col(Boolean) = Boolean and col(Boolean) != Boolean
/// e.g. col(Boolean) = Some(true) -> col(Boolean). It also handles == and != between two boolean literals as
/// the binary operator physical expression currently doesn't handle them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move this to the binary operator then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we take the approach on #1145, we can remove this code that handles constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that your approach uses the expression rewriting framework that's already in place

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pjmore -- this is pretty awesome.

One thing I was thinking when looking at this PR, was that the partial evaluation / rewrites and and constant folding were somewhat intertwined.

What would you think of extracting the partial evaluator into a function that could be invoked by the constant folding pass rather than something that is part of the constant folding pass?

I sketched out what this might look like in this PR here #1145 (I am happy to finish that up if you think it is valuable)

I was thinking then we could encode rules like

false OR <expr> --> <expr>
false AND <expr> --> false

as part of the constant folder and have the constant folder invoke the evaluator.

What do you think?

}
}

///Evaluates an expression if it only contains literal expressions. If a non-literal expression or non-immutable function is found then it returns an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the public API should be something like:

partially_evaluate_expr(expr: &Expr)

or

Expr::partially_evaluate(self) -> Self

@pjmore
Copy link
Contributor Author

pjmore commented Oct 19, 2021

Yeah I think that's is a better approach. Those special cases with Expr::Not and Expr::BinaryExpr are not pretty in my implementation. Given that the vast majority of this PR can be deleted if we go that route I'm going to close this.

@pjmore pjmore closed this Oct 19, 2021
@alamb
Copy link
Contributor

alamb commented Oct 20, 2021

Yeah I think that's is a better approach. Those special cases with Expr::Not and Expr::BinaryExpr are not pretty in my implementation. Given that the vast majority of this PR can be deleted if we go that route I'm going to close this.

Cool -- thank you @pjmore for pushing this forward. Would you like to keep working at it, or shall I polish up #1145 to get that merged?

This feature is going to really push DataFusion forward I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement General Purpose Constant Folding with the Expression Evaluator
5 participants