[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

cloud-fan · 2022-11-11T16:17:04Z

What changes were proposed in this pull request?

This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the Once batch), but running the rule OptimizeSubqueries twice breaks it as it optimizes subqueries twice.

This PR partially reverts #38557 to still invoke OptimizeSubqueries in RowLevelOperationRuntimeGroupFiltering. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

Why are the changes needed?

Fix the optimizer.

Does this PR introduce any user-facing change?

No

How was this patch tested?

N/A

cloud-fan · 2022-11-11T16:17:38Z

cc @aokolnychyi @viirya

viirya · 2022-11-12T02:14:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

+      // We can't run `OptimizeSubqueries` in this batch, as it will optimize the subqueries
+      // twice which may break some optimizer rules that can only be applied once. The rule below
+      // only invokes `OptimizeSubqueries` to optimize newly added subqueries.


Hm? This batch has only PartitionPruning and RowLevelOperationRuntimeGroupFiltering. What some optimizer rules are? PartitionPruning?

Oh, you mean other Once batches in SparkOptimizer.defaultBatches?

But in Optimizer where OptimizeSubqueries also runs, there are also other Once batches but seems fine?

All the optimizer batches are optimizing the same query plan. If OptimizeSubqueries appears twice, it means the subqueries are optimized twice.

Note that, most optimizer rules don't optimize subqueries, they need OptimizeSubqueries to invoke the entire optimizer to optimize subqueries recursively.

inspired by #38619 , maybe we don't need to invoke the entire optimizer, but just a few rules to optimize this subquery.

All the optimizer batches are optimizing the same query plan. If OptimizeSubqueries appears twice, it means the subqueries are optimized twice.

Oh, got it, you actually mean OptimizeSubqueries is applied twice (here and Optimizer). I thought that by running OptimizeSubqueries itself here breaks some rules which cannot run twice.

viirya

This partially revert looks good. Maybe we can consider https://github.com/apache/spark/pull/38626/files#r1021139971 too later.

cloud-fan · 2022-11-14T08:04:56Z

thanks for review, merging to master!

cloud-fan · 2022-11-14T08:05:51Z

The failed test is known to be flaky:

SPARK-37555: spark-sql should pass last unclosed comment to backend *** FAILED *** (2 minutes, 10 seconds)
[info]   =======================
[info]   CliSuite failure output
[info]   =======================
[info]   Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/runner/work/spark/spark/target/tmp/spark-1a51c443-7a22-4f29-884f-1a3f1d02221b;create=true --hiveconf hive.exec.scratchdir=/home/runner/work/spark/spark/target/tmp/spark-10ac614b-3692-4289-b430-246582674024 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/home/runner/work/spark/spark/target/tmp/spark-68e14c83-bf1a-4e12-a08e-b04a3255f8a6
[info]   Exception: java.util.concurrent.TimeoutException: Futures timed out after [2 minutes]

### What changes were proposed in this pull request? This is a followup of apache#38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts apache#38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert apache#38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38626 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Do not optimize subqueries twice

46b9ad1

github-actions bot added the SQL label Nov 11, 2022

viirya reviewed Nov 12, 2022

View reviewed changes

viirya approved these changes Nov 14, 2022

View reviewed changes

cloud-fan closed this in 632784d Nov 14, 2022

LuciferYang mentioned this pull request Nov 14, 2022

[SPARK-41109][CORE][FOLLOWUP] Re-order error class to fix SparkThrowableSuite #38658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

cloud-fan commented Nov 11, 2022

cloud-fan commented Nov 11, 2022

viirya Nov 12, 2022

viirya Nov 12, 2022

viirya Nov 12, 2022

cloud-fan Nov 14, 2022

cloud-fan Nov 14, 2022

viirya Nov 14, 2022

viirya left a comment

cloud-fan commented Nov 14, 2022

cloud-fan commented Nov 14, 2022

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

Conversation

cloud-fan commented Nov 11, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Nov 11, 2022

viirya Nov 12, 2022

Choose a reason for hiding this comment

viirya Nov 12, 2022

Choose a reason for hiding this comment

viirya Nov 12, 2022

Choose a reason for hiding this comment

cloud-fan Nov 14, 2022

Choose a reason for hiding this comment

cloud-fan Nov 14, 2022

Choose a reason for hiding this comment

viirya Nov 14, 2022

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Nov 14, 2022

cloud-fan commented Nov 14, 2022