[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

cloud-fan · 2022-11-04T14:57:41Z

What changes were proposed in this pull request?

Today, Spark does column pruning in 3 steps:

The rule PushDownPredicates pushes down Filters as closer to the scan node as possible.
The rule ColumnPruning generates Project below many operators, to prune columns before evaluating these operators. One exception is Filter. We do not generate Project below Filter as it conflicts with PushDownPredicates.
After the above 2 steps, we should have a plan pattern like Project(..., Filter(..., Relation)), and we have rules (DS v1 and v2 have different rules) to match this pattern using PhysicalOperation, then apply filter pushdown and column pruning.

This works fine in most cases, but we can not always combine adjacent Filters into one, due to non-deterministic predicates. For example, Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation))). PhysicalOperation can only match Filter(rand() < 0.8), Relation) and we can't do column pruning today.

This PR fixes this problem by adding a variant of PhysicalOperation: ScanOperation. It keeps all the adjacent Filters, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the Filters w.r.t. to their original order in the query plan.

Why are the changes needed?

Apply column pruning in more cases.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

cloud-fan · 2022-11-09T07:16:58Z

cc @viirya @sigmod @hvanhovell

cloud-fan · 2022-11-10T16:38:45Z

also cc @wangyum @ulysses-you

sigmod · 2022-11-11T00:35:16Z

cc @rkkorlapati-db

gengliangwang · 2022-11-15T19:55:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -85,15 +72,25 @@ object PhysicalOperation extends AliasHelper with PredicateHelper {
        // projects. We need to meet the following conditions to do so:
        //   1) no Project collected so far or the collected Projects are all deterministic
        //   2) the collected filters and this filter are all deterministic, or this is the
-        //      first collected filter.
+        //      first collected filter. This condition can be relaxed if `canKeepMultipleFilters` is


TBH, the comment here is hard to understand..

gengliangwang · 2022-11-15T20:43:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

      l @ LogicalRelation(fsRelation: HadoopFsRelation, _, table, _)) =>
+      // We can only push down the bottom-most filter to the relation, as `ScanOperation` decided to
+      // not merge these filters and we need to keep their evaluation order.
+      val filters = allFilters.lastOption.getOrElse(Nil)


So for filter pushdown, we will use the last filter. For schema pruning, we will use all the filters.
I wonder if we should return both allFilters and pushdownFilters to make the syntax clear.

gengliangwang · 2022-11-15T20:44:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

-    val (fields, filters, child, _) = collectProjectsAndFilters(plan, alwaysInline)
-    Some((fields.getOrElse(child.output), filters, child))
-  }
+  protected def canKeepMultipleFilters: Boolean


Nit: add a simple comment

viirya · 2022-11-16T09:04:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+          val canIncludeThisFilter = filters.isEmpty || {
+            filters.length == 1 && filters.head.forall(_.deterministic) && condition.deterministic
+          }


Previously, this is filters.forall(_.deterministic), why it is relaxed here too? I think it is not under canKeepMultipleFilters condition below.

This is the core change of this PR. PhysicalOperation returns a single filter condition, which means it combines filters, and we have to make sure all the filters are deterministic. ScanOperation returns multiple filter conditions and does not have this restriction.

viirya · 2022-11-16T09:18:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala

+        val projectedFilters = filters.map(_.map(_.transformDown {
          case projectionOverSchema(expr) => expr
-        })
-        val newFilterCondition = projectedFilters.reduce(And)
-        Filter(newFilterCondition, leafNode)
+        }))
+        val newFilterConditions = projectedFilters.map(_.reduce(And))
+        newFilterConditions.foldRight[LogicalPlan](leafNode)((cond, plan) => Filter(cond, plan))


Hmm, is that same as before?

This constructs a new Filter with projected predicates (reduced by And). But this change reduces all projected predicates from all adjoining Filters which can be non-deterministic?

cloud-fan · 2022-11-16T15:55:42Z

pushed a refactor to make the code easier to understand, please take another look, thanks!

wangyum · 2022-11-17T06:43:11Z

Merged to master.

### What changes were proposed in this pull request? This is a followup of #38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by #38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tic Filters ### What changes were proposed in this pull request? Today, Spark does column pruning in 3 steps: 1. The rule `PushDownPredicates` pushes down `Filter`s as closer to the scan node as possible. 2. The rule `ColumnPruning` generates `Project` below many operators, to prune columns before evaluating these operators. One exception is `Filter`. We do not generate `Project` below `Filter` as it conflicts with `PushDownPredicates`. 3. After the above 2 steps, we should have a plan pattern like `Project(..., Filter(..., Relation))`, and we have rules (DS v1 and v2 have different rules) to match this pattern using `PhysicalOperation`, then apply filter pushdown and column pruning. This works fine in most cases, but we can not always combine adjacent `Filter`s into one, due to non-deterministic predicates. For example, `Project(a, Filter(rand() > 0.5, Filter(rand() < 0.8), Relation)))`. `PhysicalOperation` can only match `Filter(rand() < 0.8), Relation)` and we can't do column pruning today. This PR fixes this problem by adding a variant of `PhysicalOperation`: `ScanOperation`. It keeps all the adjacent `Filter`s, so that it can match more plan patterns and do column pruning better. The caller sides are also updated to restore the `Filter`s w.r.t. to their original order in the query plan. ### Why are the changes needed? Apply column pruning in more cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38511 from cloud-fan/column-pruning. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

### What changes were proposed in this pull request? This is a followup of apache#38511 to fix a mistake: we should respect the original `Filter` operator order when re-constructing the query plan. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38684 from cloud-fan/column-pruning. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nondeterministic predicates ### What changes were proposed in this pull request? This PR fixes a regression caused by apache#38511 . For `FROM t WHERE rand() > 0.5 AND col = 1`, we can still push down `col = 1` because we don't guarantee the predicates evaluation order within a `Filter`. This PR updates `ScanOperation` to consider this case and bring back the previous pushdown behavior. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#38746 from cloud-fan/filter. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This pr aims to upgrade Arrow from 14.0.2 to 15.0.0, this version fixes the compatibility issue with Netty 4.1.104.Final(GH-39265). Additionally, since the `arrow-vector` module uses `eclipse-collections` to replace `netty-common` as a compile-level dependency, Apache Spark has added a dependency on `eclipse-collections` after upgrading to use Arrow 15.0.0. ### Why are the changes needed? The new version brings the following major changes: Bug Fixes GH-34610 - [Java] Fix valueCount and field name when loading/transferring NullVector GH-38242 - [Java] Fix incorrect internal struct accounting for DenseUnionVector#getBufferSizeFor GH-38254 - [Java] Add reusable buffer getters to char/binary vectors GH-38366 - [Java] Fix Murmur hash on buffers less than 4 bytes GH-38387 - [Java] Fix JDK8 compilation issue with TestAllTypes GH-38614 - [Java] Add VarBinary and VarCharWriter helper methods to more writers GH-38725 - [Java] decompression in Lz4CompressionCodec.java does not set writer index New Features and Improvements GH-38511 - [Java] Add getTransferPair(Field, BufferAllocator, CallBack) for StructVector and MapVector GH-14936 - [Java] Remove netty dependency from arrow-vector GH-38990 - [Java] Upgrade to flatc version 23.5.26 GH-39265 - [Java] Make it run well with the netty newest version 4.1.104 The full release notes as follows: - https://arrow.apache.org/release/15.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44797 from LuciferYang/SPARK-46718. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added the SQL label Nov 4, 2022

cloud-fan force-pushed the column-pruning branch 2 times, most recently from db83295 to d12ecb5 Compare November 8, 2022 06:23

cloud-fan changed the title ~~[WIP][SPARK-41017][SQL] Do not push Filter through reference-only Project~~ [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters Nov 8, 2022

cloud-fan force-pushed the column-pruning branch from d12ecb5 to 33a3706 Compare November 8, 2022 15:02

gengliangwang reviewed Nov 15, 2022

View reviewed changes

viirya reviewed Nov 16, 2022

View reviewed changes

cloud-fan force-pushed the column-pruning branch from 33a3706 to 651e44d Compare November 16, 2022 15:55

Support column pruning with multiple nondeterministic Filters

b8ef19f

cloud-fan force-pushed the column-pruning branch from 651e44d to b8ef19f Compare November 17, 2022 01:53

wangyum approved these changes Nov 17, 2022

View reviewed changes

viirya approved these changes Nov 17, 2022

View reviewed changes

gengliangwang approved these changes Nov 17, 2022

View reviewed changes

wangyum closed this in f3ad94d Nov 17, 2022

cloud-fan mentioned this pull request Nov 17, 2022

[SPARK-41017][SQL][FOLLOWUP] Respect the original Filter operator order #38684

Closed

cloud-fan mentioned this pull request Nov 21, 2022

[SPARK-41017][SQL][FOLLOWUP] Push Filter with both deterministic and nondeterministic predicates #38746

Closed

LuciferYang mentioned this pull request Jan 23, 2024

[SPARK-46718][BUILD] Upgrade Arrow to 15.0.0 #44797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

cloud-fan commented Nov 4, 2022 •

edited

cloud-fan commented Nov 9, 2022

cloud-fan commented Nov 10, 2022

sigmod commented Nov 11, 2022

gengliangwang Nov 15, 2022

gengliangwang Nov 15, 2022

gengliangwang Nov 15, 2022

viirya Nov 16, 2022

cloud-fan Nov 16, 2022

viirya Nov 16, 2022

cloud-fan commented Nov 16, 2022

wangyum commented Nov 17, 2022

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

[SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters #38511

Conversation

cloud-fan commented Nov 4, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Nov 9, 2022

cloud-fan commented Nov 10, 2022

sigmod commented Nov 11, 2022

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

gengliangwang Nov 15, 2022

Choose a reason for hiding this comment

viirya Nov 16, 2022

Choose a reason for hiding this comment

cloud-fan Nov 16, 2022

Choose a reason for hiding this comment

viirya Nov 16, 2022

Choose a reason for hiding this comment

cloud-fan commented Nov 16, 2022

wangyum commented Nov 17, 2022

cloud-fan commented Nov 4, 2022 •

edited