[SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer #32602

ulysses-you · 2021-05-20T05:58:11Z

What changes were proposed in this pull request?

remove EliminateUnnecessaryJoin, using AQEPropagateEmptyRelation instead.
eliminate join, aggregate, limit, repartition, sort, generate which is beneficial.

Why are the changes needed?

Make EliminateUnnecessaryJoin available with more case.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add test.

ulysses-you · 2021-05-20T06:00:37Z

cc @maropu @cloud-fan @c21

cloud-fan · 2021-05-20T06:36:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateUnnecessaryJoin.scala

@@ -59,19 +66,33 @@ object EliminateUnnecessaryJoin extends Rule[LogicalPlan] {
        case Some(count) => hasRow == (count > 0)
        case _ => false
      }
+
+    case LocalRelation(_, data, isStreaming) if !isStreaming =>


if we want to handle LocalRelation, then it's not AQE specific and we can do it in the normal optimizer?

yes, but currently we have no chance to do normal optimizer at AQE side. Maybe we can let some rules which in Optimizer also available at AQEOptimizer in future ?

Since AQE is on by default, it's not a big issue but more about code cleanness. How about this:

EliminateUnnecessaryJoin should only deal with LocalRelation, and appears in both the normal optimizer and AQE optimizer

AQE optimizer adds a new rule to turn empty query stage into empty LocalRelation

AQE optimizer adds a new rule to deal with non empty query stage for left semi/anti joins

sgtm, and just find a exists rule PropagateEmptyRelation.

How about this updating ?

Add a rule ConvertToLocalRelation to turn empty query stage into empty LocalRelation

Make PropagateEmptyRelation appears in AQE optimizer

Reduce EliminateUnnecessaryJoin to only handle naaj/semi/anti joins

SparkQA · 2021-05-20T06:44:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43252/

SparkQA · 2021-05-20T08:14:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43252/

SparkQA · 2021-05-20T09:50:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43258/

SparkQA · 2021-05-20T10:14:17Z

Test build #138730 has finished for PR 32602 at commit 5097247.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-20T10:26:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43258/

cloud-fan · 2021-05-20T13:04:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ConvertToLocalRelation.scala

+ */
+object ConvertToLocalRelation extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case l @ LogicalQueryStage(_, stage: QueryStageExec) if stage.resultOption.get().isDefined &&


Sorry I made the wrong decision. This may change the output partitioning and is not always safe/beneficial (we may add extra shuffles in the planning phase later).

Looking at the optimizations for empty local relations, some of them are likely beneficial and we can always do: eliminate join, aggregate, limit, repartition, sort, generate

Some may not that beneficial and we shouldn't do: simplify union, eliminate project/filter/sample.

My new idea:

Create a new rule PropagateEmptyRelationBasic, which deals with local relaion only, and is not very beneficial (eliminate project, filter, etc), and runs in the normal optimizer only

Create a new rule PropagateEmptyRelationAdvanced, which deals with both local relation and query stage, and is very beneficial (eliminate join, aggregate, etc.), and runs in both normal and AQE optimizer

The old EliminateUnnecessaryJoin and PropagateEmptyRelation rules should be removed and merged into the new rules.

PropagateEmptyRelationAdvanced may not be able to access QueryStageExec which is in sql/core. We can let this rule take checkRowCount functiton as a parameter.

I see the issue. If we worry about the LocalRelation output partitioning, we can just mark LocalRelationExec output partitioning as SinglePartition to avoid extra shuffle. But it doesn't work with optimization like non empty left semi/anti elimintaion and multi-join case.

we can always do: eliminate join, aggregate

Not sure we can always do this if we don't want to introduce extra shuffle. And the issue it has already existed in current EliminateUnnecessaryJoin, like this plan

Aggregate (same key with join) Join Inner LocalRelation xxx

An another idea, if we plan to support extra shuffle later and don't expect introduce shuffle at AQE optimzier side, then is it better to check the physical plan requiredChildDistribution ? We can only allow one node which has a valid requiredChildDistribution (not UnspecifiedDistribution) in one query stage, and skip optimize if one query stage has two or more valid requiredChildDistribution nodes. Thus we can run PropagateEmptyRelation safely.

If we turn a broadcast stage into local relation without changing other plan parts, seems we will broadcast the local relation again. SinglePartition can't help here.

If it's hard to avoid introduce shuffle at AQE optimizer, how about add extra shuffle check between AQE optimizer and stage preparation ? Then it won't affect the extra shuffle in stage preparation.

We can do that, but it seems not worth the complexity. It's not very helpful to turn query stage to local relation if we can't use it to eliminate expensive operators like join, agg, sort, etc.

I think my proposal is simpler and effective enough. The EliminateUnnecessaryJoin today does not consider extra shuffles either and blindly eliminate joins (so as the query stages) if possible

SparkQA · 2021-05-20T13:18:02Z

Test build #138736 has finished for PR 32602 at commit 165077b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-21T07:35:15Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala

+ *     - Aggregate with all empty children and at least one grouping expression.
+ *     - Generate(Explode) with all empty children. Others like Hive UDTF may return results.
+ *
+ * @param checkRowCount At AQE side, we use the query stage stats to check the check.


At AQE side, we use this function to check if a plan has output rows or not

cloud-fan · 2021-05-21T07:41:47Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala

+    plan.output.map{ a => Alias(cast(Literal(null), a.dataType), a.name)(a.exprId) }
+
+  // We can not use transformUpWithPruning here since this rule is used by both normal Optimizer
+  // and AQE Optimizer. And this may only effective at AQE side.


ah good point. I think there is a way to overcome it:

Create an abstract class PropagateEmptyRelationBase that contains util functions and optimizes expensive operators such as join, aggregate, etc.

Create a rule PropagateEmptyRelation extends PropagateEmptyRelationBase that additionally optimzes project, filter, etc.

Create a rule AQEPropagateEmptyRelation extends PropagateEmptyRelationBase that overrides some util functions like isEmptyPlan.

Then these two rules can define their transformation prunning separatedly.

SparkQA · 2021-05-21T08:18:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43316/

SparkQA · 2021-05-21T08:54:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43316/

ulysses-you · 2021-05-21T10:31:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala

+  }
+
+  // TODO we need use transformUpWithPruning instead of transformUp
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {


@cloud-fan , if we want to use transformUpWithPruning at AQE optimizer side, we need to some more work like add pattern at LogicalQueryStage. So this PR does not do the change, just use transformUp. Do you think it's OK ?

yea it's OK as it's not a regression. cc @sigmod

@cloud-fan seems we don't need to use transformUpWithPruning here since the AQE Optimizer always run once rather than fixed point ?

SparkQA · 2021-05-21T11:45:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43325/

SparkQA · 2021-05-21T11:56:46Z

Test build #138793 has finished for PR 32602 at commit 7b80db0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-21T12:17:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43325/

SparkQA · 2021-05-21T12:40:35Z

Test build #138803 has finished for PR 32602 at commit 48dd92a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSupport

cloud-fan · 2021-05-24T10:35:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala

@@ -27,7 +28,9 @@ import org.apache.spark.util.Utils
 */
 class AQEOptimizer(conf: SQLConf) extends RuleExecutor[LogicalPlan] {
  private val defaultBatches = Seq(
-    Batch("Eliminate Unnecessary Join", Once, EliminateUnnecessaryJoin),
+    Batch("Propagate Empty LocalRelation", Once,


LocalRelation -> Relations?

cloud-fan · 2021-05-24T10:38:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+    withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
+      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+      Seq(
+        // left semi join and empty left side


We can't optimize this before this PR?

yeah, we cann't. Before we only check right side with LeftSemi/LeftAnti.

And the test should use different column to do filter and join in case of InferFiltersFromConstraints make right side empty. Updated it.

SparkQA · 2021-05-24T10:43:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43389/

SparkQA · 2021-05-24T13:11:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43392/

SparkQA · 2021-05-24T13:42:36Z

Test build #138867 has finished for PR 32602 at commit 767dd92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-24T13:46:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43392/

cloud-fan · 2021-05-24T14:15:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

@@ -1307,6 +1308,69 @@ class AdaptiveQueryExecSuite
    }
  }

+  test("SPARK-35455: Enhance EliminateUnnecessaryJoin - single join") {


let's update the test name and PR title: Unify empty relation optimization between normal and AQE optimizer

Updated it and also updated the PR title.

cloud-fan · 2021-05-24T14:15:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+    }
+  }
+
+  test("SPARK-35455: Enhance EliminateUnnecessaryJoin - multi join") {


SparkQA · 2021-05-24T16:34:36Z

Test build #138870 has finished for PR 32602 at commit 0754936.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-25T02:14:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43421/

SparkQA · 2021-05-25T02:50:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43421/

c21

LGTM

SparkQA · 2021-05-25T05:40:37Z

Test build #138899 has finished for PR 32602 at commit 624e45e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-25T08:59:58Z

thanks, merging to master!

ulysses-you · 2021-05-25T10:05:36Z

@cloud-fan thank you for the discussion and merging ! @c21 thank you for the review !

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala

cloud-fan · 2022-01-07T16:28:25Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala

+ *    - Project/Filter/Sample with all empty children.
+ *
+ * The reason why we don't apply this rule at AQE optimizer side is: the benefit is not big enough
+ * and it may introduce extra exchanges.


After more thought, I think this is a big performance issue if we can't propagate empty relations through project/filter which are quite common. The risk of introducing new shuffles is relatively small compared to this.

@ulysses-you can we move all the logic to PropagateEmptyRelationBase? PropagateEmptyRelation should not have any extra logic.

…nd AQE optimizer ### What changes were proposed in this pull request? * remove `EliminateUnnecessaryJoin`, using `AQEPropagateEmptyRelation` instead. * eliminate join, aggregate, limit, repartition, sort, generate which is beneficial. ### Why are the changes needed? Make `EliminateUnnecessaryJoin` available with more case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes apache#32602 from ulysses-you/SPARK-35455. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

ulysses-you added 2 commits May 20, 2021 13:42

Enhance EliminateUnnecessaryJoin

4514d27

fix

5097247

github-actions bot added the SQL label May 20, 2021

cloud-fan reviewed May 20, 2021

View reviewed changes

LocalRelation early

8df68d9

fix

165077b

github-actions bot added the BUILD label May 20, 2021

ulysses-you force-pushed the SPARK-35455 branch from e49b8b3 to 165077b Compare May 20, 2021 08:36

cloud-fan reviewed May 20, 2021

View reviewed changes

ulysses-you added 2 commits May 21, 2021 12:37

split PropagateEmptyRelation

2f5fa20

fix

7b80db0

ulysses-you force-pushed the SPARK-35455 branch from c27ff28 to 7b80db0 Compare May 21, 2021 06:38

cloud-fan reviewed May 21, 2021

View reviewed changes

split PropagateEmptyRelationBase

05e074c

fix

48dd92a

ulysses-you commented May 21, 2021

View reviewed changes

cloud-fan reviewed May 24, 2021

View reviewed changes

ulysses-you added 4 commits May 24, 2021 19:23

test

47e0c3a

comment

d9ca6da

Relations

2fead86

indentation

0754936

cloud-fan reviewed May 24, 2021

View reviewed changes

cloud-fan approved these changes May 24, 2021

View reviewed changes

test name

a6213ea

ulysses-you changed the title ~~[SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin~~ [SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer May 25, 2021

nit

624e45e

c21 approved these changes May 25, 2021

View reviewed changes

cloud-fan closed this in 631077d May 25, 2021

ulysses-you deleted the SPARK-35455 branch May 25, 2021 10:05

cloud-fan reviewed May 31, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala Show resolved Hide resolved

cloud-fan reviewed Jan 7, 2022

View reviewed changes

cloud-fan mentioned this pull request Jan 7, 2022

[SPARK-35442][SQL] Support propagate empty relation through aggregate #35135

Closed

[SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer #32602

[SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer #32602

Conversation

ulysses-you commented May 20, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

ulysses-you commented May 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan May 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 20, 2021

SparkQA commented May 20, 2021

SparkQA commented May 20, 2021

SparkQA commented May 20, 2021

SparkQA commented May 20, 2021

Choose a reason for hiding this comment

ulysses-you May 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 20, 2021

Choose a reason for hiding this comment

cloud-fan May 21, 2021 • edited

Choose a reason for hiding this comment

SparkQA commented May 21, 2021

SparkQA commented May 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 21, 2021

SparkQA commented May 21, 2021

SparkQA commented May 21, 2021

SparkQA commented May 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 24, 2021

SparkQA commented May 24, 2021

SparkQA commented May 24, 2021

SparkQA commented May 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 24, 2021

SparkQA commented May 25, 2021

SparkQA commented May 25, 2021

c21 left a comment

Choose a reason for hiding this comment

SparkQA commented May 25, 2021

cloud-fan commented May 25, 2021

ulysses-you commented May 25, 2021

Choose a reason for hiding this comment

ulysses-you commented May 20, 2021 •

edited

cloud-fan May 20, 2021 •

edited

ulysses-you May 20, 2021 •

edited

cloud-fan May 21, 2021 •

edited