Add additional pruning tests with casts, handle unsupported predicates better #3454

alamb · 2022-09-12T16:59:40Z

~~Draft as it builds on #3422 from @liukun4515 (I will rebase once that is merged)~~

Which issue does this PR close?

Rationale for this change

I wanted to add more tests in #3422 for pruning with casts -- see #3422 (comment)

While writing these tests, I also ended up cleaning up the handling of predicates without columns (aka ones that got turned into constants)

What changes are included in this PR?

Additional test for pruning with statistics
Support pruning predicates that require no columns (aka false or true) without error

Are there any user-facing changes?

alamb · 2022-09-12T18:55:39Z

datafusion/core/src/physical_optimizer/pruning.rs

+            },
+            // result was a column
+            ColumnarValue::Scalar(ScalarValue::Boolean(v)) => {
+                let v = v.unwrap_or(true); // None -> true per comments above


This is new code -- to handle boolean exprs (aka true)

Without this code and the code to make record batches with 0 columns below, some of the new negative pruning tests (the ones with invalid casts) error rather than return true

is there test case for this case? get a Scalar after evaluating

Yes -- this code is covered. I verified this by applying this change locally:

--- a/datafusion/core/src/physical_optimizer/pruning.rs +++ b/datafusion/core/src/physical_optimizer/pruning.rs @@ -205,8 +205,7 @@ impl PruningPredicate { }, // result was a column ColumnarValue::Scalar(ScalarValue::Boolean(v)) => { - let v = v.unwrap_or(true); // None -> true per comments above - Ok(vec![v; statistics.num_containers()]) + unimplemented!(); } other => { Err(DataFusionError::Internal(format!(

And then running this command

cargo test -p datafusion -- pruning ... failures: ---- physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast stdout ---- thread 'physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast' panicked at 'not implemented', datafusion/core/src/physical_optimizer/pruning.rs:208:17 stack backtrace: 0: rust_begin_unwind at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14 2: core::panicking::panic at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5 3: datafusion::physical_optimizer::pruning::PruningPredicate::prune at ./src/physical_optimizer/pruning.rs:208:17 4: datafusion::physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast at ./src/physical_optimizer/pruning.rs:1963:22 5: datafusion::physical_optimizer::pruning::tests::prune_int32_col_lte_zero_cast::{{closure}} at ./src/physical_optimizer/pruning.rs:1949:5 6: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 7: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ---- physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str stdout ---- thread 'physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str' panicked at 'not implemented', datafusion/core/src/physical_optimizer/pruning.rs:208:17 stack backtrace: 0: rust_begin_unwind at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14 2: core::panicking::panic at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5 3: datafusion::physical_optimizer::pruning::PruningPredicate::prune at ./src/physical_optimizer/pruning.rs:208:17 4: datafusion::physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str at ./src/physical_optimizer/pruning.rs:2046:22 5: datafusion::physical_optimizer::pruning::tests::prune_int32_col_eq_zero_cast_as_str::{{closure}} at ./src/physical_optimizer/pruning.rs:2030:5 6: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 7: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

alamb · 2022-09-12T18:57:30Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -390,8 +406,13 @@ fn build_statistics_record_batch<S: PruningStatistics>(
    }

    let schema = Arc::new(Schema::new(fields));
-    RecordBatch::try_new(schema, arrays)
-        .map_err(|err| DataFusionError::Plan(err.to_string()))
+    // provide the count in case there were no needed statistics


Previously, if the pruning predicate did not call for any columns (b/c it was "true" for example) this would error because in earlier versions of arrow-rs RecordBatches could not have 0 columns -- now that they can, we use that feature to build a 0 column RecordBatch when the predicate has no columns that can be transformed

alamb · 2022-09-12T18:59:16Z

datafusion/core/tests/parquet_pruning.rs

@@ -237,7 +237,7 @@ async fn prune_int32_scalar_fun() {
    test_prune(
        Scenario::Int32,
        "SELECT * FROM t where abs(i) = 1",
-        Some(4),
+        Some(0),


The changes in this file are due to the fact that these predicates no longer error (instead they evaluate successfully to all true, meaning all record groups are kept)

The 3 is the number of rows that pass this predicate -- so no actual answers are changed here

alamb · 2022-09-12T19:01:06Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -1921,6 +1946,45 @@ mod tests {
        assert_eq!(result, expected_ret);
    }

+    #[test]


The new tests in this file are the main reason for this PR

liukun4515 · 2022-09-13T11:14:23Z

I will review it later.

andygrove · 2022-09-13T20:52:59Z

datafusion/core/src/physical_optimizer/pruning.rs

+                    .as_any()
+                    .downcast_ref::<BooleanArray>()
+                    .ok_or_else(|| {
+                        DataFusionError::Internal(format!(
+                            "Expected pruning predicate evaluation to be BooleanArray, \
+                             but was {:?}",
+                            array
+                        ))
+                    })?;


You could probably just use the downcast_value! macro here

Great idea -- done in 5ad465c

…tests

alamb · 2022-09-19T11:13:03Z

Unless there are any objections, I plan to merge this PR tomorrow

alamb · 2022-09-20T09:58:54Z

All comments have been addressed, so merging this in

ursabot · 2022-09-20T10:01:46Z

Benchmark runs are scheduled for baseline = 67002a0 and contender = c7f3a70. c7f3a70 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb marked this pull request as draft September 12, 2022 16:59

alamb changed the title ~~Alamb/cast pruning tests~~ Add more cast pruning tests, handle unsupported predicates better Sep 12, 2022

alamb mentioned this pull request Sep 12, 2022

support cast/try_cast in prune with signed integer and decimal #3422

Merged

github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions labels Sep 12, 2022

alamb force-pushed the alamb/cast_pruning_tests branch from 7d27c57 to c9c475d Compare September 12, 2022 18:55

github-actions bot removed the logical-expr Logical plan and expressions label Sep 12, 2022

alamb commented Sep 12, 2022

View reviewed changes

alamb changed the title ~~Add more cast pruning tests, handle unsupported predicates better~~ Add additional pruning tests with casts, handle unsupported predicates better Sep 12, 2022

alamb commented Sep 12, 2022

View reviewed changes

Add tests for pruning, support pruning with constant expressions

15aa56d

alamb force-pushed the alamb/cast_pruning_tests branch from c9c475d to 15aa56d Compare September 12, 2022 19:28

alamb marked this pull request as ready for review September 12, 2022 20:08

alamb requested a review from liukun4515 September 13, 2022 11:08

andygrove reviewed Sep 13, 2022

View reviewed changes

alamb added 2 commits September 14, 2022 14:33

Merge remote-tracking branch 'apache/master' into alamb/cast_pruning_…

aca4544

…tests

Use downcast_any!

5ad465c

This was referenced Sep 14, 2022

[MINOR] Change downcast_value! macro so it does not need to use use std::any::type_name; #3484

Merged

Ensure the row count is preserved when coalescing over empty records #3439

Merged

Merge remote-tracking branch 'apache/master' into alamb/cast_pruning_…

9b11764

…tests

chore: Remove uneeded use

6bb0e96

alamb merged commit c7f3a70 into apache:master Sep 20, 2022

alamb deleted the alamb/cast_pruning_tests branch September 20, 2022 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional pruning tests with casts, handle unsupported predicates better #3454

Add additional pruning tests with casts, handle unsupported predicates better #3454

alamb commented Sep 12, 2022 •

edited

alamb Sep 12, 2022

alamb Sep 12, 2022

liukun4515 Sep 14, 2022

alamb Sep 14, 2022

alamb Sep 12, 2022

alamb Sep 12, 2022

alamb Sep 12, 2022

liukun4515 commented Sep 13, 2022

andygrove Sep 13, 2022

alamb Sep 14, 2022

alamb commented Sep 19, 2022

alamb commented Sep 20, 2022

ursabot commented Sep 20, 2022

Add additional pruning tests with casts, handle unsupported predicates better #3454

Add additional pruning tests with casts, handle unsupported predicates better #3454

Conversation

alamb commented Sep 12, 2022 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 19, 2022

alamb commented Sep 20, 2022

ursabot commented Sep 20, 2022

alamb commented Sep 12, 2022 •

edited