Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jun 3, 2021

Closes #490

This PR adds support for pruning of boolean predicates such as flag_col, and not flag_col so that they can be used to prune row groups from parquet files and other predicates

It does not add code to handle flag_col = true and flag_col != false (which currently error and continue to do so) as those are simplified in the ConstantEvaluation pass.

This ended up being a larger change than I wanted because the logic to create col_min and col_max references was intertwined in PruningExpressionBuilder

Rationale for this change

See #490

What changes are included in this PR?

Major changes:

  1. Encapsulate stat_column_req into a new RequiredStatColumns struct
  2. Move expression reference and rewriting logic to StatisticsColumns
  3. Add rules for boolean columns

Are there any user-facing changes?

Additional predicates can be used to prune

@alamb alamb changed the title Add support for boolean columns in pruning logic d10273a Add support for boolean columns in pruning logic Jun 3, 2021
self.scalar_expr
}

fn is_stat_column_missing(&self, statistics_type: StatisticsType) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was just refactored into RequiredStatColumns so I could reuse it

use crate::logical_plan;
let field = schema.field_with_name(column_name).ok()?;

if matches!(field.data_type(), &DataType::Boolean) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the actual logic / rules

let result = p.prune(&statistics).unwrap_err();
assert!(
result.to_string().contains(
"Data type Boolean not supported for scalar operation on dyn array"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't great messages, but they are what happens on master today, and I figured I would document them for posterity (and maybe inspire people to help fix them)

Expr::BinaryExpr { left, op, right } => (left, *op, right),
Expr::Column(name) => {
if let Some(expr) =
build_single_column_expr(&name, schema, required_columns, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of pattern probably can be written a bit shorter with some combinators. Something like:

build_single_column_expr(&name, schema, required_columns).ok().or(Ok(unhandled))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent idea -- I will do so

@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2021

Codecov Report

Merging #500 (5ceb541) into master (28b0dad) will increase coverage by 0.09%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
+ Coverage   75.92%   76.02%   +0.09%     
==========================================
  Files         154      154              
  Lines       26195    26421     +226     
==========================================
+ Hits        19889    20087     +198     
- Misses       6306     6334      +28     
Impacted Files Coverage Δ
datafusion/src/physical_optimizer/pruning.rs 91.52% <93.75%> (+1.44%) ⬆️
datafusion/src/optimizer/utils.rs 48.22% <0.00%> (-1.78%) ⬇️
...ta/rust/core/src/serde/physical_plan/from_proto.rs 38.79% <0.00%> (-0.85%) ⬇️
...sta/rust/core/src/serde/logical_plan/from_proto.rs 35.96% <0.00%> (-0.22%) ⬇️
datafusion/src/logical_plan/builder.rs 90.04% <0.00%> (-0.05%) ⬇️
datafusion/src/physical_plan/planner.rs 80.32% <0.00%> (ø)
datafusion/src/optimizer/projection_push_down.rs 98.46% <0.00%> (+<0.01%) ⬆️
datafusion/src/logical_plan/expr.rs 84.60% <0.00%> (+0.07%) ⬆️
...lista/rust/core/src/serde/logical_plan/to_proto.rs 62.48% <0.00%> (+0.15%) ⬆️
datafusion/src/sql/planner.rs 84.37% <0.00%> (+0.26%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 28b0dad...5ceb541. Read the comment docs.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature + great refactoring

@alamb alamb force-pushed the alamb/prune_bool branch from 5ceb541 to 8059dbb Compare June 4, 2021 16:51
@alamb
Copy link
Contributor Author

alamb commented Jun 4, 2021

I rebased this PR and added a few more tests. The code is unchanged

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. 👍 Very well documented also 💯

@alamb alamb merged commit 964f494 into apache:master Jun 4, 2021
@houqp houqp added the enhancement New feature or request label Jul 31, 2021
@alamb alamb deleted the alamb/prune_bool branch October 6, 2022 18:30
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
* change proto msg

* QueryPlanSerde with eval mode

* Move eval mode

* Add abs in planner

* CometAbsFunc wrapper

* Add error management

* Add tests

* Add license

* spotless apply

* format

* Fix clippy

* error msg for all spark versions

* Fix benches

* Use enum to ansi mode

* Fix format

* Add more tests

* Format

* Refactor

* refactor

* fix merge

* fix merge
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support pruning for boolean columns

5 participants