Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional pruning tests with casts, handle unsupported predicates better #3454

Merged
merged 5 commits into from
Sep 20, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 12, 2022

Draft as it builds on #3422 from @liukun4515 (I will rebase once that is merged)

Which issue does this PR close?

re #3377 and #3414

Rationale for this change

I wanted to add more tests in #3422 for pruning with casts -- see #3422 (comment)

While writing these tests, I also ended up cleaning up the handling of predicates without columns (aka ones that got turned into constants)

What changes are included in this PR?

  1. Additional test for pruning with statistics
  2. Support pruning predicates that require no columns (aka false or true) without error

Are there any user-facing changes?

@alamb alamb marked this pull request as draft September 12, 2022 16:59
@alamb alamb changed the title Alamb/cast pruning tests Add more cast pruning tests, handle unsupported predicates better Sep 12, 2022
@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions labels Sep 12, 2022
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Sep 12, 2022
},
// result was a column
ColumnarValue::Scalar(ScalarValue::Boolean(v)) => {
let v = v.unwrap_or(true); // None -> true per comments above
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new code -- to handle boolean exprs (aka true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this code and the code to make record batches with 0 columns below, some of the new negative pruning tests (the ones with invalid casts) error rather than return true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there test case for this case? get a Scalar after evaluating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- this code is covered. I verified this by applying this change locally:

--- a/datafusion/core/src/physical_optimizer/pruning.rs
+++ b/datafusion/core/src/physical_optimizer/pruning.rs
@@ -205,8 +205,7 @@ impl PruningPredicate {
             },
             // result was a column
             ColumnarValue::Scalar(ScalarValue::Boolean(v)) => {
-                let v = v.unwrap_or(true); // None -> true per comments above
-                Ok(vec![v; statistics.num_containers()])
+                unimplemented!();
             }
             other => {
                 Err(DataFusionError::Internal(format!(

And then running this command

cargo test -p datafusion -- pruning
...

failures:

---- physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast stdout ----
thread 'physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast' panicked at 'not implemented', datafusion/core/src/physical_optimizer/pruning.rs:208:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
   3: datafusion::physical_optimizer::pruning::PruningPredicate::prune
             at ./src/physical_optimizer/pruning.rs:208:17
   4: datafusion::physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast
             at ./src/physical_optimizer/pruning.rs:1963:22
   5: datafusion::physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast::{{closure}}
             at ./src/physical_optimizer/pruning.rs:1949:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str stdout ----
thread 'physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str' panicked at 'not implemented', datafusion/core/src/physical_optimizer/pruning.rs:208:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
   3: datafusion::physical_optimizer::pruning::PruningPredicate::prune
             at ./src/physical_optimizer/pruning.rs:208:17
   4: datafusion::physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str
             at ./src/physical_optimizer/pruning.rs:2046:22
   5: datafusion::physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str::{{closure}}
             at ./src/physical_optimizer/pruning.rs:2030:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@@ -390,8 +406,13 @@ fn build_statistics_record_batch<S: PruningStatistics>(
}

let schema = Arc::new(Schema::new(fields));
RecordBatch::try_new(schema, arrays)
.map_err(|err| DataFusionError::Plan(err.to_string()))
// provide the count in case there were no needed statistics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, if the pruning predicate did not call for any columns (b/c it was "true" for example) this would error because in earlier versions of arrow-rs RecordBatches could not have 0 columns -- now that they can, we use that feature to build a 0 column RecordBatch when the predicate has no columns that can be transformed

@@ -237,7 +237,7 @@ async fn prune_int32_scalar_fun() {
test_prune(
Scenario::Int32,
"SELECT * FROM t where abs(i) = 1",
Some(4),
Some(0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file are due to the fact that these predicates no longer error (instead they evaluate successfully to all true, meaning all record groups are kept)

The 3 is the number of rows that pass this predicate -- so no actual answers are changed here

@alamb alamb changed the title Add more cast pruning tests, handle unsupported predicates better Add additional pruning tests with casts, handle unsupported predicates better Sep 12, 2022
@@ -1921,6 +1946,45 @@ mod tests {
assert_eq!(result, expected_ret);
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests in this file are the main reason for this PR

@alamb alamb marked this pull request as ready for review September 12, 2022 20:08
@liukun4515
Copy link
Contributor

I will review it later.

Comment on lines 189 to 197
.as_any()
.downcast_ref::<BooleanArray>()
.ok_or_else(|| {
DataFusionError::Internal(format!(
"Expected pruning predicate evaluation to be BooleanArray, \
but was {:?}",
array
))
})?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just use the downcast_value! macro here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea -- done in 5ad465c

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2022

Unless there are any objections, I plan to merge this PR tomorrow

@alamb
Copy link
Contributor Author

alamb commented Sep 20, 2022

All comments have been addressed, so merging this in

@alamb alamb merged commit c7f3a70 into apache:master Sep 20, 2022
@alamb alamb deleted the alamb/cast_pruning_tests branch September 20, 2022 09:59
@ursabot
Copy link

ursabot commented Sep 20, 2022

Benchmark runs are scheduled for baseline = 67002a0 and contender = c7f3a70. c7f3a70 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants