fix: descend into QueryStageExec when detecting DPP scans for shuffle fallback#3981
Closed
andygrove wants to merge 1 commit intoapache:mainfrom
Closed
fix: descend into QueryStageExec when detecting DPP scans for shuffle fallback#3981andygrove wants to merge 1 commit intoapache:mainfrom
andygrove wants to merge 1 commit intoapache:mainfrom
Conversation
…-prep passes stageContainsDPPScan used a plain s.child.exists(...) to find a FileSourceScanExec with a PlanExpression partition filter. Under AQE, once a child stage materializes, its subtree is replaced by a ShuffleQueryStageExec (a LeafExecNode whose children is Seq.empty), and .exists cannot descend through it. The DPP scan becomes invisible on the stage-prep pass, so the same shuffle that correctly fell back to Spark at initial planning gets converted to Comet the second time the rule runs — producing plan-shape inconsistencies across the two passes. Walk the tree explicitly and descend into QueryStageExec.plan so both passes see the same subtree and reach the same decision. Adds CometDppFallbackConsistencySuite which wraps the DPP shuffle in a real ShuffleQueryStageExec (exactly the wrapper AQE produces) and asserts the fallback decision stays the same.
Member
Author
|
better fix in #3982 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to #3949.
Rationale for this change
`stageContainsDPPScan` in `CometShuffleExchangeExec` (introduced by #3879) uses a plain `s.child.exists(...)` walk to decide whether the shuffle subtree contains a DPP scan. Under AQE, once a child stage materializes, its subtree is replaced by a `ShuffleQueryStageExec` — a `LeafExecNode` whose `children` is `Seq.empty`. `.exists` cannot descend through it, so the DPP scan becomes invisible on the stage-prep pass.
The consequence: the same shuffle that correctly fell back to Spark at initial planning gets converted to Comet the second time the rule runs, because `stageContainsDPPScan` returns `false` once the inner stage has materialized. That flip produces plan-shape inconsistencies across the two passes — the suspected mechanism behind the `ColumnarToRowExec` canonicalization assertion in #3949.
What changes are included in this PR?
How are these changes tested?
I have not been able to reproduce the full #3949 crash even with the flip demonstrated, so this PR is positioned as a correctness fix for the decision-stability invariant. It may or may not close #3949 on its own.