Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

Merged

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Aug 17, 2022

Which issue does this PR close?

part of #3031

Rationale for this change

  • add the framework for this optimization
  • support the case: the data type is signed integer(INT8,INT16,INT32,INT64)

TODO:

other numeric type, for example decimal

support inlist

simplify the binary comparison: like INT32(C1) > INT64(INT32.MAX)

other feature in the spark rule which can be used in datafusion https://github.com/sunchao/spark/blob/1f496fbea688c7082bad7e6280c8a949fbfd31b7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala#L30

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added core Core datafusion crate optimizer Optimizer rules labels Aug 17, 2022
@liukun4515 liukun4515 requested a review from alamb August 17, 2022 05:40
@liukun4515 liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch 2 times, most recently from d5ee16b to 5fe6e26 Compare August 18, 2022 07:45
@liukun4515 liukun4515 marked this pull request as ready for review August 18, 2022 07:53
@liukun4515
Copy link
Contributor Author

@alamb @andygrove PTAL

@@ -271,8 +271,8 @@ async fn csv_explain_plans() {
let expected = vec![
"Explain [plan_type:Utf8, plan:Utf8]",
" Projection: #aggregate_test_100.c1 [c1:Utf8]",
" Filter: #aggregate_test_100.c2 > Int64(10) [c1:Utf8, c2:Int32]",
" TableScan: aggregate_test_100 projection=[c1, c2], partial_filters=[#aggregate_test_100.c2 > Int64(10)] [c1:Utf8, c2:Int32]",
" Filter: #aggregate_test_100.c2 > Int32(10) [c1:Utf8, c2:Int32]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after optimization, the INT64(10) will be cast to INT32(10), because of the left type is INT32

}
// TODO: optimize in list
// Expr::InList { .. } => {}
// TODO: handle other expr type and dfs visit them
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to support other type of Expr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to add more types in this PR or as a follow-on? If the latter, could you file an issue and reference it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue list sub task for this optimizer. #3031 (comment)

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #3185 (8455e3c) into master (89bcfc4) will decrease coverage by 0.03%.
The diff coverage is 73.88%.

@@            Coverage Diff             @@
##           master    #3185      +/-   ##
==========================================
- Coverage   85.85%   85.81%   -0.04%     
==========================================
  Files         291      292       +1     
  Lines       52786    53111     +325     
==========================================
+ Hits        45320    45579     +259     
- Misses       7466     7532      +66     
Impacted Files Coverage Δ
datafusion/core/tests/sql/subqueries.rs 94.32% <ø> (ø)
datafusion/core/tests/provider_filter_pushdown.rs 70.45% <6.25%> (-14.48%) ⬇️
datafusion/core/tests/sql/explain_analyze.rs 83.39% <66.66%> (ø)
...fusion/optimizer/src/pre_cast_lit_in_comparison.rs 83.33% <83.33%> (ø)
datafusion/core/src/execution/context.rs 78.21% <100.00%> (+0.15%) ⬆️
...sical-expr/src/aggregate/approx_percentile_cont.rs 81.54% <0.00%> (-3.42%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
.../physical-expr/src/aggregate/array_agg_distinct.rs 79.41% <0.00%> (-0.78%) ⬇️
datafusion/expr/src/operator.rs 95.23% <0.00%> (-0.77%) ⬇️
datafusion/sql/src/planner.rs 80.44% <0.00%> (-0.76%) ⬇️
... and 25 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@liukun4515 liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 5fe6e26 to 6f34a4a Compare August 19, 2022 02:25
@liukun4515 liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 6f34a4a to d3a38d1 Compare August 19, 2022 02:30
@@ -146,7 +147,20 @@ impl TableProvider for CustomProvider {
match &filters[0] {
Expr::BinaryExpr { right, .. } => {
let int_value = match &**right {
Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want do avoid doing this for NULLs (aka None) values

Something like:

Suggested change
Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64,
Expr::Literal(ScalarValue::Int8(Some(i))) => i as i64,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for you comments, I just follow the original implementation.
But I will follow your nice comments.

@liukun4515 liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 17d871b to eae5133 Compare August 21, 2022 06:56
@liukun4515
Copy link
Contributor Author

@alamb @andygrove is there any comments for this pr?
it's stays too long without updates.

@liukun4515 liukun4515 requested a review from alamb August 22, 2022 10:23
@andygrove
Copy link
Member

@liukun4515 I will start reviewing this PR today. I am also adding a type coercion rule in #3101 and I plan on adding others.

Do you think it makes sense to have one type coercion rule or multiple? It might be more efficient to have one rule that can handle all the cases?

ScalarValue::Int32(Some(v)) => *v as i64,
ScalarValue::Int64(Some(v)) => *v as i64,
other_type => {
panic!("Invalid type and value {:?}", other_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR against this PR to remove these panics and use DataFusionError::Internal instead.

I quite often hit panics in the code that should not be possible, so I think we should use Result where possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check the data type before this method.
The data type can't hit this panic which is guaranteed by is_support_data_type

fn is_support_data_type(data_type: &DataType) -> bool {
    // TODO support decimal with other data type
    matches!(
        data_type,
        DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64
    )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I will change the result type to Result<T> and clean the code with panic.

ScalarValue::Int16(Some(v)) => *v as i64,
ScalarValue::Int32(Some(v)) => *v as i64,
ScalarValue::Int64(Some(v)) => *v,
_ => unimplemented!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns Result so we should return errors here rather than panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! I'm very excited to see datafusion doing this for new code 😃

Comment on lines 224 to 227
if lit_value >= target_min && lit_value <= target_max {
return true;
}
false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if lit_value >= target_min && lit_value <= target_max {
return true;
}
false
lit_value >= target_min && lit_value <= target_max

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@liukun4515
Copy link
Contributor Author

@liukun4515 I will start reviewing this PR today. I am also adding a type coercion rule in #3101 and I plan on adding others.

Do you think it makes sense to have one type coercion rule or multiple? It might be more efficient to have one rule that can handle all the cases?

I go through your #3101, which add an optimizer rule to do the type coercion.
I feels a little wired about the implementation, but I think it works and can migrate the type coercion from physical phase to logical phase.

I want to investigate other system and find out where to do the type coercion for the input expr.

@andygrove @alamb

@liukun4515 liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 6bb9c53 to 8455e3c Compare August 23, 2022 08:41
@liukun4515
Copy link
Contributor Author

I fixed the ci.
If it looks good for you, please help to approve it or merge it.
After it merged, i will go on follow-up issue or task.

@alamb @andygrove

Comment on lines +100 to +101
let left_type = left_type.unwrap();
let right_type = right_type.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the unwrap calls here are safe due to the previous check, but we can still use ? instead.

Suggested change
let left_type = left_type.unwrap();
let right_type = right_type.unwrap();
let left_type = left_type?;
let right_type = right_type?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove got it, and will refine this in the next pr 👍

Comment on lines +128 to +129
)
.unwrap() =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
.unwrap() =>
)? =>

if origin_value.is_null() {
// if the origin value is null, just convert to another type of null value
// The target type must be satisfied `is_support_data_type` method, we can unwrap safely
return Ok(lit(ScalarValue::try_from(target_type).unwrap()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Ok(lit(ScalarValue::try_from(target_type).unwrap()));
return Ok(lit(ScalarValue::try_from(target_type)?));

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @liukun4515. I think it is better to use ? rather than unwrap even if we think unwrap is safe, so left some suggestions but these are not blockers so will go ahead and merge.

@andygrove andygrove merged commit 9ecf277 into apache:master Aug 23, 2022
@ursabot
Copy link

ursabot commented Aug 23, 2022

Benchmark runs are scheduled for baseline = eedc787 and contender = 9ecf277. 9ecf277 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@liukun4515
Copy link
Contributor Author

TM. Thanks @liukun4515. I think it is better to use

Thanks for your time and review.
I will refine this in the follow-up pr

@alamb
Copy link
Contributor

alamb commented Aug 31, 2022

Sorry for my delayed response

In general I think the coercion logic could use some serious ❤️ in DataFusion as I find it confusing. However, it seems to be working and we are making progress so my plan is to just "go with it" until I hit some issue that gives me an excuse to improve things

/// which data type is `target_type`.
/// If this false, do nothing.
///
/// This is inspired by the optimizer rule `UnwrapCastInBinaryComparison` of Spark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these comments 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's should be done by me
Good doc is the start of good project or code
@alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really helps me review code when there are comments that explain the rationale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants