support cast/try_cast in prune with signed integer and decimal #3422

liukun4515 · 2022-09-09T14:50:17Z

Which issue does this PR close?

part of #3414
Closes #3377

In order to merge this #3396, we should support the cast/try_cast in the prune first.

cc @andygrove @alamb

Just support cast from unsigned integer or decimal to unsigned integer or decimal.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2022-09-09T15:42:48Z

Codecov Report

Merging #3422 (2d77a8d) into master (e6378f4) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3422      +/-   ##
==========================================
+ Coverage   85.58%   85.66%   +0.07%     
==========================================
  Files         296      296              
  Lines       54252    54544     +292     
==========================================
+ Hits        46432    46723     +291     
- Misses       7820     7821       +1

Impacted Files	Coverage Δ
datafusion/core/src/physical_optimizer/pruning.rs	`95.54% <100.00%> (+0.79%)`	⬆️
datafusion/expr/src/expr_fn.rs	`91.16% <100.00%> (+0.09%)`	⬆️
benchmarks/src/bin/tpch.rs	`37.59% <0.00%> (-3.56%)`	⬇️
datafusion/physical-expr/src/planner.rs	`92.68% <0.00%> (-1.52%)`	⬇️
datafusion/optimizer/src/optimizer.rs	`90.90% <0.00%> (-1.40%)`	⬇️
datafusion/expr/src/window_frame.rs	`92.43% <0.00%> (-0.85%)`	⬇️
datafusion/sql/src/planner.rs	`81.06% <0.00%> (-0.62%)`	⬇️
datafusion/optimizer/src/reduce_outer_join.rs	`98.19% <0.00%> (-0.61%)`	⬇️
datafusion/core/src/physical_plan/planner.rs	`76.87% <0.00%> (-0.58%)`	⬇️
datafusion/proto/src/logical_plan.rs	`17.46% <0.00%> (-0.24%)`	⬇️
... and 20 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

andygrove · 2022-09-09T20:07:12Z

datafusion/core/src/physical_optimizer/pruning.rs

+            ],
+            negated: true,
+        };
+        let expected_expr = "CAST(#c1_min AS Int64) != Int64(1) OR Int64(1) != CAST(#c1_max AS Int64) AND CAST(#c1_min AS Int64) != Int64(2) OR Int64(2) != CAST(#c1_max AS Int64) AND CAST(#c1_min AS Int64) != Int64(3) OR Int64(3) != CAST(#c1_max AS Int64)";


I'm not sure I understand what is happening with the negated case. Could you add a comment here explaining this? Why is there != in here rather than <= or >=?

@andygrove I don't change any logical about the list.
you can get this from #2282 (comment)

I get the meanings from the code context.
if the min_c1 and max_c1 is 1, all the value in the column or chunk must be 1 or null, so the c1 not in (1) can skip this column or chunk.

alamb

Thank you @liukun4515 -- this is great

I will look at this code in the morning as well, but I think we need to handle the case where casting won't preserve sortedness (aka UTF8 -> numeric), as described on #3377

liukun4515 · 2022-09-10T02:44:25Z

Thank you @liukun4515 -- this is great

I will look at this code in the morning as well, but I think we need to handle the case where casting won't preserve sortedness (aka UTF8 -> numeric), as described on #3377

I got your point.

We can add rule for cast/try_cast case.

For example, just support cast/try_cast from numeric to numeric data type, I think the data type of numeric has the order.

We can support other complex case in the future pr.

@alamb

liukun4515 · 2022-09-10T09:46:03Z

cc @alamb
The cast/try_cast just support signed integer/decimal to signed integer/decimal.

datafusion/core/src/physical_optimizer/pruning.rs

alamb · 2022-09-10T11:05:50Z

datafusion/core/src/physical_optimizer/pruning.rs

+            if support_type_for_prune(&from_type, data_type) {
+                let (left, op, right) =
+                    rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;
+                Ok((cast(left, data_type.clone()), op, right))
+            } else {
+                return Err(DataFusionError::Plan(format!(
+                    "Cast with from type {} to type {} is not supported",
+                    from_type, data_type
+                )));
+            }


We could probably avoid the duplication of the error generation by moving the error return into support_type_for_prune like:

Suggested change

if support_type_for_prune(&from_type, data_type) {

let (left, op, right) =

rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;

Ok((cast(left, data_type.clone()), op, right))

} else {

return Err(DataFusionError::Plan(format!(

"Cast with from type {} to type {} is not supported",

from_type, data_type

)));

}

verify_support_type_for_prune(&from_type, data_type)?;

let (left, op, right) = rewrite_expr_to_prunable(expr, op, scalar_expr)?;

Ok((cast(left, data_type.clone()), op, right))

This is not needed for correctness, just a code structure suggestion

alamb · 2022-09-10T11:06:08Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -552,6 +577,25 @@ fn is_compare_op(op: Operator) -> bool {
    )
 }

+// The pruning logic is based on the comparing the min/max bounds.
+// Must make sure the two type has order.
+// For example, casts from string to numbers is not correct.


👍 thank you for these comments

alamb · 2022-09-10T11:07:04Z

datafusion/core/src/physical_optimizer/pruning.rs

+// For example, casts from string to numbers is not correct.
+// Because the "13" is less than "3" with UTF8 comparison order.
+fn support_type_for_prune(from_type: &DataType, to_type: &DataType) -> bool {
+    // TODO: support other data type


I think it would be good to file a follow on ticket listing the other types (you did so in #3414 (comment))

alamb · 2022-09-10T11:09:17Z

datafusion/core/src/physical_optimizer/pruning.rs

+        let result =
+            rewrite_expr_to_prunable(&left_input, Operator::Gt, &right_input, df_schema);
+        assert!(result.is_err());
+    }
 }


I think we should add some tests that exercise the pruning code with actual values too

For example, perhaps you can take prune_int32_cast from #3378?

alamb · 2022-09-10T11:09:30Z

datafusion/core/src/physical_optimizer/pruning.rs

+    #[test]
+    fn test_rewrite_expr_to_prunable_error() {
+        // cast string value to numeric value
+        // this cast is not supported


alamb · 2022-09-10T11:10:45Z

datafusion/core/src/physical_optimizer/pruning.rs

+                .unwrap();
+        assert_eq!(result_left, left_input);
+        assert_eq!(result_right, right_input);
+        // TODO: add test for other case and op


Rather than extending this test for other ops, I think we should add tests which actually run the pruning expressions on values (see my other comments).

Would it be helpful for me to write some more tests? I can do so tomorrow morning

I find i have added the test for executing with real data in the prune_cast_column_scalar function.

You can check them, it just include cast(col, int64) and try_cast(col,int64). If you think it is not enough, you can add more test for it or give some advices for the test.

https://github.com/apache/arrow-datafusion/blob/4b1a4052d079c8149ad232a05e9965cceb05b5fc/datafusion/core/src/physical_optimizer/pruning.rs#L2011

https://github.com/apache/arrow-datafusion/blob/4b1a4052d079c8149ad232a05e9965cceb05b5fc/datafusion/core/src/physical_optimizer/pruning.rs#L2022

https://github.com/apache/arrow-datafusion/blob/4b1a4052d079c8149ad232a05e9965cceb05b5fc/datafusion/core/src/physical_optimizer/pruning.rs#L2028

FYI @alamb

alamb · 2022-09-10T11:29:00Z

To be explicit I think the code in this PR looks good to me, but I would like to see some actual tests with pruning expressions on data prior to merging it. I am happy to help write these tests too

liukun4515 · 2022-09-10T15:03:28Z

To be explicit I think the code in this PR looks good to me, but I would like to see some actual tests with pruning expressions on data prior to merging it. I am happy to help write these tests too

@alamb Thanks for you help.
It's too late for me to add test today, but welcome to add test for this pr, I can help to review your test and other changes tomorrow.
I will never do any changes before merged your test or your suggestion.

Dandandan · 2022-09-12T06:42:40Z

datafusion/core/src/physical_optimizer/pruning.rs

+                rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;
+            Ok((try_cast(left, data_type.clone()), op, right))
+        }
+        // `col > lit()`


This comment should be removed I think?

Done
removed

alamb

I re-reviewed this @liukun4515 and I think it looks good to go. Thank you.

I wrote several more tests which I will put up in a follow on PR but they all pass and I don't think there is any reason to hold this PR.

alamb · 2022-09-12T17:00:38Z

Follow on PR is #3454

ursabot · 2022-09-12T17:44:06Z

Benchmark runs are scheduled for baseline = 3b8a20a and contender = 69d05aa. 69d05aa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liukun4515 · 2022-09-13T05:10:45Z

Thanks for your careful review. @alamb

support cast/try_cast in prune

2d77a8d

liukun4515 marked this pull request as ready for review September 9, 2022 14:50

liukun4515 requested review from alamb and andygrove and removed request for alamb September 9, 2022 15:07

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Sep 9, 2022

andygrove mentioned this pull request Sep 9, 2022

DataFusion 12.0.0 Release #3097

Closed

4 tasks

andygrove reviewed Sep 9, 2022

View reviewed changes

alamb reviewed Sep 9, 2022

View reviewed changes

liukun4515 changed the title ~~support cast/try_cast in prune~~ support cast/try_cast in prune with signed integer and decimal Sep 10, 2022

liukun4515 mentioned this pull request Sep 10, 2022

pruning support cast/try_cast expr #3414

Closed

alamb reviewed Sep 10, 2022

View reviewed changes

liukun4515 mentioned this pull request Sep 11, 2022

support more data type in prune for cast/try_cast #3442

Closed

4 tasks

liukun4515 force-pushed the cast_pruning_#3414 branch from 4b1a405 to eb9a540 Compare September 11, 2022 01:36

liukun4515 requested review from alamb and andygrove September 11, 2022 03:33

Dandandan reviewed Sep 12, 2022

View reviewed changes

add bound for supported data type in the cast/try_cast prune

706357e

liukun4515 force-pushed the cast_pruning_#3414 branch from eb9a540 to 706357e Compare September 12, 2022 12:04

alamb mentioned this pull request Sep 12, 2022

Add additional pruning tests with casts, handle unsupported predicates better #3454

Merged

alamb approved these changes Sep 12, 2022

View reviewed changes

alamb merged commit 69d05aa into apache:master Sep 12, 2022

liukun4515 deleted the cast_pruning_#3414 branch September 13, 2022 06:03

alamb mentioned this pull request Sep 19, 2022

Support casts in expression pruning (WIP) #3378

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support cast/try_cast in prune with signed integer and decimal #3422

support cast/try_cast in prune with signed integer and decimal #3422

liukun4515 commented Sep 9, 2022 •

edited

Loading

codecov-commenter commented Sep 9, 2022

andygrove Sep 9, 2022

liukun4515 Sep 10, 2022

alamb left a comment

liukun4515 commented Sep 10, 2022

liukun4515 commented Sep 10, 2022

alamb Sep 10, 2022

liukun4515 Sep 11, 2022

alamb Sep 10, 2022

alamb Sep 10, 2022

liukun4515 Sep 11, 2022

alamb Sep 10, 2022

alamb Sep 10, 2022

alamb Sep 10, 2022

liukun4515 Sep 11, 2022

liukun4515 Sep 11, 2022

alamb commented Sep 10, 2022

liukun4515 commented Sep 10, 2022 •

edited

Loading

Dandandan Sep 12, 2022

liukun4515 Sep 12, 2022

alamb left a comment

alamb commented Sep 12, 2022

ursabot commented Sep 12, 2022

liukun4515 commented Sep 13, 2022

support cast/try_cast in prune with signed integer and decimal #3422

support cast/try_cast in prune with signed integer and decimal #3422

Conversation

liukun4515 commented Sep 9, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Sep 9, 2022

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

liukun4515 commented Sep 10, 2022

liukun4515 commented Sep 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 10, 2022

liukun4515 commented Sep 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 12, 2022

ursabot commented Sep 12, 2022

liukun4515 commented Sep 13, 2022

liukun4515 commented Sep 9, 2022 •

edited

Loading

liukun4515 commented Sep 10, 2022 •

edited

Loading