Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch PartitionPruning should optimize subqueries #38557

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup to #36304 to simplify RowLevelOperationRuntimeGroupFiltering. It does 3 things:

  1. run OptimizeSubqueries in the batch PartitionPruning, so that RowLevelOperationRuntimeGroupFiltering does not need to invoke it manually.
  2. skip dpp subquery in OptimizeSubqueries, to avoid the issue fixed by [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664
  3. RowLevelOperationRuntimeGroupFiltering creates InSubquery instead of DynamicPruningSubquery, so that it can be optimized by OptimizeSubqueries later. This also avoids unnecessary planning overhead of DynamicPruningSubquery, as there is no join and we can only run it as a subquery.

Why are the changes needed?

code simplification

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @aokolnychyi @viirya @wangyum

DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false)
}
dynamicPruningSubqueries.reduce(And)
val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any downsides of rewriting DynamicPruningSubquery into DynamicPruningExpression directly instead of relying on PlanDynamicPruningFilters and PlanAdaptiveDynamicPruningFilters?

I see some special branches for exchange reuse in those rules that would not apply now.

Copy link
Contributor Author

@cloud-fan cloud-fan Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any downside. We can only reuse broadcast if the DPP filter is derived from a join, which doesn't apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I was originally worried we could miss some future optimizations given that dynamic pruning for row-level operations would go through a different route compared to the normal DPP.

One alternative could be to extend DynamicPruningSubquery with a flag whether it should be optimized or not. Up to you, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale is, what we really need is a subquery here. This is completely different from dynamic partition pruning. One limitation is DS v2 runtime filter pushdown only applies to DynamicPruningExpression. We can probably fix that and accept normal non-correlated subqueries as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, DS v2 runtime filtering framework is fairly limited at this point.

Comment on lines +323 to +325
// Do not optimize DPP subquery, as it was created from optimized plan and we should not
// optimize it again, to save optimization time and avoid breaking broadcast/subquery reuse.
case d: DynamicPruningSubquery => d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Just wondering that is this particularly related to SPARK-38959?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because this PR adds OptimizeSubqueries to the batch PartitionPruning and we should not break #33664

@@ -66,7 +65,7 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
}

// optimize subqueries to rewrite them as joins and trigger job planning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be removed.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing an explicit reference to OptimizeSubqueries. I am a bit worried we would plan dynamic pruning for row-level operations differently compared regular DPP. However, that seems safe at this point.

Thanks for looking into this, @cloud-fan!

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in 865a3de Nov 10, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM. Thank you all.

cloud-fan added a commit that referenced this pull request Nov 14, 2022
### What changes were proposed in this pull request?

This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice.

This PR partially reverts #38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

### Why are the changes needed?

Fix the optimizer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #38626 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…d optimize subqueries

### What changes were proposed in this pull request?

This is a followup to apache#36304 to simplify `RowLevelOperationRuntimeGroupFiltering`. It does 3 things:
1. run `OptimizeSubqueries` in the batch `PartitionPruning`, so that `RowLevelOperationRuntimeGroupFiltering` does not need to invoke it manually.
2. skip dpp subquery in `OptimizeSubqueries`, to avoid the issue fixed by apache#33664
3. `RowLevelOperationRuntimeGroupFiltering` creates `InSubquery` instead of `DynamicPruningSubquery`, so that it can be optimized by `OptimizeSubqueries` later. This also avoids unnecessary planning overhead of `DynamicPruningSubquery`, as there is no join and we can only run it as a subquery.

### Why are the changes needed?

code simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#38557 from cloud-fan/help.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This is a followup of apache#38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice.

This PR partially reverts apache#38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert apache#38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

### Why are the changes needed?

Fix the optimizer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes apache#38626 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants