Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the Once batch), but running the rule OptimizeSubqueries twice breaks it as it optimizes subqueries twice.

This PR partially reverts #38557 to still invoke OptimizeSubqueries in RowLevelOperationRuntimeGroupFiltering. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

Why are the changes needed?

Fix the optimizer.

Does this PR introduce any user-facing change?

No

How was this patch tested?

N/A

@github-actions github-actions bot added the SQL label Nov 11, 2022
@cloud-fan
Copy link
Contributor Author

cc @aokolnychyi @viirya

Comment on lines +54 to +56
// We can't run `OptimizeSubqueries` in this batch, as it will optimize the subqueries
// twice which may break some optimizer rules that can only be applied once. The rule below
// only invokes `OptimizeSubqueries` to optimize newly added subqueries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? This batch has only PartitionPruning and RowLevelOperationRuntimeGroupFiltering. What some optimizer rules are? PartitionPruning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean other Once batches in SparkOptimizer.defaultBatches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in Optimizer where OptimizeSubqueries also runs, there are also other Once batches but seems fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the optimizer batches are optimizing the same query plan. If OptimizeSubqueries appears twice, it means the subqueries are optimized twice.

Note that, most optimizer rules don't optimize subqueries, they need OptimizeSubqueries to invoke the entire optimizer to optimize subqueries recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspired by #38619 , maybe we don't need to invoke the entire optimizer, but just a few rules to optimize this subquery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the optimizer batches are optimizing the same query plan. If OptimizeSubqueries appears twice, it means the subqueries are optimized twice.

Oh, got it, you actually mean OptimizeSubqueries is applied twice (here and Optimizer). I thought that by running OptimizeSubqueries itself here breaks some rules which cannot run twice.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially revert looks good. Maybe we can consider https://github.com/apache/spark/pull/38626/files#r1021139971 too later.

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan
Copy link
Contributor Author

The failed test is known to be flaky:

SPARK-37555: spark-sql should pass last unclosed comment to backend *** FAILED *** (2 minutes, 10 seconds)
[info]   =======================
[info]   CliSuite failure output
[info]   =======================
[info]   Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/runner/work/spark/spark/target/tmp/spark-1a51c443-7a22-4f29-884f-1a3f1d02221b;create=true --hiveconf hive.exec.scratchdir=/home/runner/work/spark/spark/target/tmp/spark-10ac614b-3692-4289-b430-246582674024 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/home/runner/work/spark/spark/target/tmp/spark-68e14c83-bf1a-4e12-a08e-b04a3255f8a6
[info]   Exception: java.util.concurrent.TimeoutException: Futures timed out after [2 minutes]

@cloud-fan cloud-fan closed this in 632784d Nov 14, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This is a followup of apache#38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice.

This PR partially reverts apache#38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert apache#38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

### Why are the changes needed?

Fix the optimizer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes apache#38626 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants