Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support cast/try_cast in prune with signed integer and decimal #3422

Merged
merged 2 commits into from
Sep 12, 2022

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Sep 9, 2022

Which issue does this PR close?

part of #3414
Closes #3377

In order to merge this #3396, we should support the cast/try_cast in the prune first.

cc @andygrove @alamb

Just support cast from unsigned integer or decimal to unsigned integer or decimal.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@liukun4515 liukun4515 marked this pull request as ready for review September 9, 2022 14:50
@liukun4515 liukun4515 requested review from alamb and andygrove and removed request for alamb September 9, 2022 15:07
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Sep 9, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #3422 (2d77a8d) into master (e6378f4) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3422      +/-   ##
==========================================
+ Coverage   85.58%   85.66%   +0.07%     
==========================================
  Files         296      296              
  Lines       54252    54544     +292     
==========================================
+ Hits        46432    46723     +291     
- Misses       7820     7821       +1     
Impacted Files Coverage Δ
datafusion/core/src/physical_optimizer/pruning.rs 95.54% <100.00%> (+0.79%) ⬆️
datafusion/expr/src/expr_fn.rs 91.16% <100.00%> (+0.09%) ⬆️
benchmarks/src/bin/tpch.rs 37.59% <0.00%> (-3.56%) ⬇️
datafusion/physical-expr/src/planner.rs 92.68% <0.00%> (-1.52%) ⬇️
datafusion/optimizer/src/optimizer.rs 90.90% <0.00%> (-1.40%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
datafusion/sql/src/planner.rs 81.06% <0.00%> (-0.62%) ⬇️
datafusion/optimizer/src/reduce_outer_join.rs 98.19% <0.00%> (-0.61%) ⬇️
datafusion/core/src/physical_plan/planner.rs 76.87% <0.00%> (-0.58%) ⬇️
datafusion/proto/src/logical_plan.rs 17.46% <0.00%> (-0.24%) ⬇️
... and 20 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@andygrove andygrove mentioned this pull request Sep 9, 2022
4 tasks
],
negated: true,
};
let expected_expr = "CAST(#c1_min AS Int64) != Int64(1) OR Int64(1) != CAST(#c1_max AS Int64) AND CAST(#c1_min AS Int64) != Int64(2) OR Int64(2) != CAST(#c1_max AS Int64) AND CAST(#c1_min AS Int64) != Int64(3) OR Int64(3) != CAST(#c1_max AS Int64)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what is happening with the negated case. Could you add a comment here explaining this? Why is there != in here rather than <= or >=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove I don't change any logical about the list.
you can get this from #2282 (comment)

I get the meanings from the code context.
if the min_c1 and max_c1 is 1, all the value in the column or chunk must be 1 or null, so the c1 not in (1) can skip this column or chunk.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liukun4515 -- this is great

I will look at this code in the morning as well, but I think we need to handle the case where casting won't preserve sortedness (aka UTF8 -> numeric), as described on #3377

@liukun4515
Copy link
Contributor Author

Thank you @liukun4515 -- this is great

I will look at this code in the morning as well, but I think we need to handle the case where casting won't preserve sortedness (aka UTF8 -> numeric), as described on #3377

I got your point.

We can add rule for cast/try_cast case.

For example, just support cast/try_cast from numeric to numeric data type, I think the data type of numeric has the order.

We can support other complex case in the future pr.

@alamb

@liukun4515 liukun4515 changed the title support cast/try_cast in prune support cast/try_cast in prune with signed integer and decimal Sep 10, 2022
@liukun4515
Copy link
Contributor Author

cc @alamb
The cast/try_cast just support signed integer/decimal to signed integer/decimal.

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
Comment on lines 509 to 518
if support_type_for_prune(&from_type, data_type) {
let (left, op, right) =
rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;
Ok((cast(left, data_type.clone()), op, right))
} else {
return Err(DataFusionError::Plan(format!(
"Cast with from type {} to type {} is not supported",
from_type, data_type
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably avoid the duplication of the error generation by moving the error return into support_type_for_prune like:

Suggested change
if support_type_for_prune(&from_type, data_type) {
let (left, op, right) =
rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;
Ok((cast(left, data_type.clone()), op, right))
} else {
return Err(DataFusionError::Plan(format!(
"Cast with from type {} to type {} is not supported",
from_type, data_type
)));
}
verify_support_type_for_prune(&from_type, data_type)?;
let (left, op, right) = rewrite_expr_to_prunable(expr, op, scalar_expr)?;
Ok((cast(left, data_type.clone()), op, right))

This is not needed for correctness, just a code structure suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@@ -552,6 +577,25 @@ fn is_compare_op(op: Operator) -> bool {
)
}

// The pruning logic is based on the comparing the min/max bounds.
// Must make sure the two type has order.
// For example, casts from string to numbers is not correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thank you for these comments

// For example, casts from string to numbers is not correct.
// Because the "13" is less than "3" with UTF8 comparison order.
fn support_type_for_prune(from_type: &DataType, to_type: &DataType) -> bool {
// TODO: support other data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to file a follow on ticket listing the other types (you did so in #3414 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let result =
rewrite_expr_to_prunable(&left_input, Operator::Gt, &right_input, df_schema);
assert!(result.is_err());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some tests that exercise the pruning code with actual values too

For example, perhaps you can take prune_int32_cast from #3378?

#[test]
fn test_rewrite_expr_to_prunable_error() {
// cast string value to numeric value
// this cast is not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.unwrap();
assert_eq!(result_left, left_input);
assert_eq!(result_right, right_input);
// TODO: add test for other case and op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than extending this test for other ops, I think we should add tests which actually run the pruning expressions on values (see my other comments).

Would it be helpful for me to write some more tests? I can do so tomorrow morning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find i have added the test for executing with real data in the prune_cast_column_scalar function.

You can check them, it just include cast(col, int64) and try_cast(col,int64). If you think it is not enough, you can add more test for it or give some advices for the test.

@alamb
Copy link
Contributor

alamb commented Sep 10, 2022

To be explicit I think the code in this PR looks good to me, but I would like to see some actual tests with pruning expressions on data prior to merging it. I am happy to help write these tests too

@liukun4515
Copy link
Contributor Author

liukun4515 commented Sep 10, 2022

To be explicit I think the code in this PR looks good to me, but I would like to see some actual tests with pruning expressions on data prior to merging it. I am happy to help write these tests too

@alamb Thanks for you help.
It's too late for me to add test today, but welcome to add test for this pr, I can help to review your test and other changes tomorrow.
I will never do any changes before merged your test or your suggestion.

rewrite_expr_to_prunable(expr, op, scalar_expr, schema)?;
Ok((try_cast(left, data_type.clone()), op, right))
}
// `col > lit()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be removed I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done
removed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed this @liukun4515 and I think it looks good to go. Thank you.

I wrote several more tests which I will put up in a follow on PR but they all pass and I don't think there is any reason to hold this PR.

@alamb
Copy link
Contributor

alamb commented Sep 12, 2022

Follow on PR is #3454

@alamb alamb merged commit 69d05aa into apache:master Sep 12, 2022
@ursabot
Copy link

ursabot commented Sep 12, 2022

Benchmark runs are scheduled for baseline = 3b8a20a and contender = 69d05aa. 69d05aa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@liukun4515
Copy link
Contributor Author

Thanks for your careful review. @alamb

@liukun4515 liukun4515 deleted the cast_pruning_#3414 branch September 13, 2022 06:03
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting expressions in pruning logic
6 participants