optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

liukun4515 · 2022-08-17T05:30:23Z

Which issue does this PR close?

part of #3031

Rationale for this change

add the framework for this optimization
support the case: the data type is signed integer(INT8,INT16,INT32,INT64)

TODO:

other numeric type, for example decimal

support inlist

simplify the binary comparison: like INT32(C1) > INT64(INT32.MAX)

other feature in the spark rule which can be used in datafusion https://github.com/sunchao/spark/blob/1f496fbea688c7082bad7e6280c8a949fbfd31b7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala#L30

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 · 2022-08-18T08:09:27Z

@alamb @andygrove PTAL

liukun4515 · 2022-08-18T08:10:54Z

datafusion/core/tests/sql/explain_analyze.rs

@@ -271,8 +271,8 @@ async fn csv_explain_plans() {
    let expected = vec![
        "Explain [plan_type:Utf8, plan:Utf8]",
        "  Projection: #aggregate_test_100.c1 [c1:Utf8]",
-        "    Filter: #aggregate_test_100.c2 > Int64(10) [c1:Utf8, c2:Int32]",
-        "      TableScan: aggregate_test_100 projection=[c1, c2], partial_filters=[#aggregate_test_100.c2 > Int64(10)] [c1:Utf8, c2:Int32]",
+        "    Filter: #aggregate_test_100.c2 > Int32(10) [c1:Utf8, c2:Int32]",


after optimization, the INT64(10) will be cast to INT32(10), because of the left type is INT32

liukun4515 · 2022-08-18T08:12:16Z

datafusion/optimizer/src/pre_cast_lit_in_binary_comparison.rs

+        }
+        // TODO: optimize in list
+        // Expr::InList { .. } => {}
+        // TODO: handle other expr type and dfs visit them


need to support other type of Expr.

Is the plan to add more types in this PR or as a follow-on? If the latter, could you file an issue and reference it here

The issue list sub task for this optimizer. #3031 (comment)

codecov-commenter · 2022-08-18T08:13:37Z

Codecov Report

Merging #3185 (8455e3c) into master (89bcfc4) will decrease coverage by 0.03%.
The diff coverage is 73.88%.

@@            Coverage Diff             @@
##           master    #3185      +/-   ##
==========================================
- Coverage   85.85%   85.81%   -0.04%     
==========================================
  Files         291      292       +1     
  Lines       52786    53111     +325     
==========================================
+ Hits        45320    45579     +259     
- Misses       7466     7532      +66

Impacted Files	Coverage Δ
datafusion/core/tests/sql/subqueries.rs	`94.32% <ø> (ø)`
datafusion/core/tests/provider_filter_pushdown.rs	`70.45% <6.25%> (-14.48%)`	⬇️
datafusion/core/tests/sql/explain_analyze.rs	`83.39% <66.66%> (ø)`
...fusion/optimizer/src/pre_cast_lit_in_comparison.rs	`83.33% <83.33%> (ø)`
datafusion/core/src/execution/context.rs	`78.21% <100.00%> (+0.15%)`	⬆️
...sical-expr/src/aggregate/approx_percentile_cont.rs	`81.54% <0.00%> (-3.42%)`	⬇️
datafusion/expr/src/window_frame.rs	`92.43% <0.00%> (-0.85%)`	⬇️
.../physical-expr/src/aggregate/array_agg_distinct.rs	`79.41% <0.00%> (-0.78%)`	⬇️
datafusion/expr/src/operator.rs	`95.23% <0.00%> (-0.77%)`	⬇️
datafusion/sql/src/planner.rs	`80.44% <0.00%> (-0.76%)`	⬇️
... and 25 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

alamb · 2022-08-19T10:49:11Z

datafusion/core/tests/provider_filter_pushdown.rs

@@ -146,7 +147,20 @@ impl TableProvider for CustomProvider {
        match &filters[0] {
            Expr::BinaryExpr { right, .. } => {
                let int_value = match &**right {
+                    Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64,


I think you might want do avoid doing this for NULLs (aka None) values

Something like:

Suggested change

Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64,

Expr::Literal(ScalarValue::Int8(Some(i))) => i as i64,

thanks for you comments, I just follow the original implementation.
But I will follow your nice comments.

liukun4515 · 2022-08-22T01:11:24Z

@alamb @andygrove is there any comments for this pr？
it's stays too long without updates.

andygrove · 2022-08-22T15:00:34Z

@liukun4515 I will start reviewing this PR today. I am also adding a type coercion rule in #3101 and I plan on adding others.

Do you think it makes sense to have one type coercion rule or multiple? It might be more efficient to have one rule that can handle all the cases?

andygrove · 2022-08-22T15:20:47Z

datafusion/optimizer/src/pre_cast_lit_in_binary_comparison.rs

+        ScalarValue::Int32(Some(v)) => *v as i64,
+        ScalarValue::Int64(Some(v)) => *v as i64,
+        other_type => {
+            panic!("Invalid type and value {:?}", other_type);


I created a PR against this PR to remove these panics and use DataFusionError::Internal instead.

I quite often hit panics in the code that should not be possible, so I think we should use Result where possible.

I check the data type before this method.
The data type can't hit this panic which is guaranteed by is_support_data_type

fn is_support_data_type(data_type: &DataType) -> bool { // TODO support decimal with other data type matches!( data_type, DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 ) }

But I will change the result type to Result<T> and clean the code with panic.

andygrove · 2022-08-22T15:22:14Z

datafusion/core/tests/provider_filter_pushdown.rs

+                            ScalarValue::Int16(Some(v)) => *v as i64,
+                            ScalarValue::Int32(Some(v)) => *v as i64,
+                            ScalarValue::Int64(Some(v)) => *v,
+                            _ => unimplemented!(),


This method returns Result so we should return errors here rather than panic.

Love it! I'm very excited to see datafusion doing this for new code 😃

andygrove · 2022-08-22T15:25:05Z

datafusion/optimizer/src/pre_cast_lit_in_binary_comparison.rs

+    if lit_value >= target_min && lit_value <= target_max {
+        return true;
+    }
+    false


Suggested change

if lit_value >= target_min && lit_value <= target_max {

return true;

}

false

lit_value >= target_min && lit_value <= target_max

liukun4515 · 2022-08-23T02:27:44Z

@liukun4515 I will start reviewing this PR today. I am also adding a type coercion rule in #3101 and I plan on adding others.

Do you think it makes sense to have one type coercion rule or multiple? It might be more efficient to have one rule that can handle all the cases?

I go through your #3101, which add an optimizer rule to do the type coercion.
I feels a little wired about the implementation, but I think it works and can migrate the type coercion from physical phase to logical phase.

I want to investigate other system and find out where to do the type coercion for the input expr.

@andygrove @alamb

liukun4515 · 2022-08-23T08:41:50Z

I fixed the ci.
If it looks good for you, please help to approve it or merge it.
After it merged, i will go on follow-up issue or task.

@alamb @andygrove

andygrove · 2022-08-23T13:33:48Z

datafusion/optimizer/src/pre_cast_lit_in_comparison.rs

+            let left_type = left_type.unwrap();
+            let right_type = right_type.unwrap();


I know the unwrap calls here are safe due to the previous check, but we can still use ? instead.

Suggested change

let left_type = left_type.unwrap();

let right_type = right_type.unwrap();

let left_type = left_type?;

let right_type = right_type?;

@andygrove got it, and will refine this in the next pr 👍

andygrove · 2022-08-23T13:35:08Z

datafusion/optimizer/src/pre_cast_lit_in_comparison.rs

+                        )
+                        .unwrap() =>


Suggested change

)

.unwrap() =>

)? =>

andygrove · 2022-08-23T13:35:55Z

datafusion/optimizer/src/pre_cast_lit_in_comparison.rs

+    if origin_value.is_null() {
+        // if the origin value is null, just convert to another type of null value
+        // The target type must be satisfied `is_support_data_type` method, we can unwrap safely
+        return Ok(lit(ScalarValue::try_from(target_type).unwrap()));


Suggested change

return Ok(lit(ScalarValue::try_from(target_type).unwrap()));

return Ok(lit(ScalarValue::try_from(target_type)?));

andygrove

LGTM. Thanks @liukun4515. I think it is better to use ? rather than unwrap even if we think unwrap is safe, so left some suggestions but these are not blockers so will go ahead and merge.

ursabot · 2022-08-23T13:41:42Z

Benchmark runs are scheduled for baseline = eedc787 and contender = 9ecf277. 9ecf277 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liukun4515 · 2022-08-23T14:02:33Z

TM. Thanks @liukun4515. I think it is better to use

Thanks for your time and review.
I will refine this in the follow-up pr

alamb · 2022-08-31T13:25:43Z

Sorry for my delayed response

In general I think the coercion logic could use some serious ❤️ in DataFusion as I find it confusing. However, it seems to be working and we are making progress so my plan is to just "go with it" until I hit some issue that gives me an excuse to improve things

alamb · 2022-08-31T13:26:14Z

datafusion/optimizer/src/pre_cast_lit_in_comparison.rs

+/// which data type is `target_type`.
+/// If this false, do nothing.
+///
+/// This is inspired by the optimizer rule `UnwrapCastInBinaryComparison` of Spark.


Thank you for these comments 👍

It's should be done by me
Good doc is the start of good project or code
@alamb

It really helps me review code when there are comments that explain the rationale

github-actions bot added core Core datafusion crate optimizer Optimizer rules labels Aug 17, 2022

liukun4515 requested a review from alamb August 17, 2022 05:40

liukun4515 mentioned this pull request Aug 17, 2022

optimize/simplify the literal data type and remove unnecessary cast、try_cast #3031

Closed

6 tasks

liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch 2 times, most recently from d5ee16b to 5fe6e26 Compare August 18, 2022 07:45

liukun4515 marked this pull request as ready for review August 18, 2022 07:53

liukun4515 commented Aug 18, 2022

View reviewed changes

liukun4515 requested review from andygrove and xudong963 August 19, 2022 02:12

liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 5fe6e26 to 6f34a4a Compare August 19, 2022 02:25

add rule pre add cast to literal

d3a38d1

liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 6f34a4a to d3a38d1 Compare August 19, 2022 02:30

alamb reviewed Aug 19, 2022

View reviewed changes

address comments and fix clippy

eae5133

liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 17d871b to eae5133 Compare August 21, 2022 06:56

liukun4515 requested a review from alamb August 22, 2022 10:23

andygrove reviewed Aug 22, 2022

View reviewed changes

andygrove mentioned this pull request Aug 22, 2022

Add optimizer rule for type coercion (binary operations only) #3222

Merged

change panic to result

8455e3c

liukun4515 force-pushed the optimize_unwrapcast_binary_literals branch from 6bb9c53 to 8455e3c Compare August 23, 2022 08:41

andygrove reviewed Aug 23, 2022

View reviewed changes

andygrove approved these changes Aug 23, 2022

View reviewed changes

andygrove merged commit 9ecf277 into apache:master Aug 23, 2022

alamb reviewed Aug 31, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

liukun4515 commented Aug 17, 2022 •

edited

liukun4515 commented Aug 18, 2022

liukun4515 Aug 18, 2022 •

edited

liukun4515 Aug 18, 2022

andygrove Aug 22, 2022

liukun4515 Aug 23, 2022

codecov-commenter commented Aug 18, 2022 •

edited

alamb Aug 19, 2022

liukun4515 Aug 20, 2022

liukun4515 commented Aug 22, 2022

andygrove commented Aug 22, 2022

andygrove Aug 22, 2022

liukun4515 Aug 23, 2022 •

edited

liukun4515 Aug 23, 2022

andygrove Aug 22, 2022

liukun4515 Aug 23, 2022

avantgardnerio Aug 23, 2022

andygrove Aug 22, 2022

liukun4515 Aug 23, 2022

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

andygrove Aug 23, 2022

liukun4515 Aug 23, 2022

andygrove Aug 23, 2022

andygrove Aug 23, 2022

andygrove left a comment

ursabot commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

alamb commented Aug 31, 2022

alamb Aug 31, 2022

liukun4515 Sep 1, 2022

alamb Sep 1, 2022

	Expr::Literal(ScalarValue::Int8(i)) => i.unwrap() as i64,
	Expr::Literal(ScalarValue::Int8(Some(i))) => i as i64,

		let left_type = left_type.unwrap();
		let right_type = right_type.unwrap();

	return Ok(lit(ScalarValue::try_from(target_type).unwrap()));
	return Ok(lit(ScalarValue::try_from(target_type)?));

optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

optimizer: add framework for the rule of pre-add cast to the literal in comparison binary #3185

Conversation

liukun4515 commented Aug 17, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 commented Aug 18, 2022

liukun4515 Aug 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 18, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 22, 2022

andygrove commented Aug 22, 2022

Choose a reason for hiding this comment

liukun4515 Aug 23, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

ursabot commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

alamb commented Aug 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 17, 2022 •

edited

liukun4515 Aug 18, 2022 •

edited

codecov-commenter commented Aug 18, 2022 •

edited

liukun4515 Aug 23, 2022 •

edited