Skip to content

fix: make shuffle fallback decisions sticky across planning passes#3982

Merged
andygrove merged 11 commits intoapache:mainfrom
andygrove:fix-sticky-dpp-fallback-via-tag
Apr 17, 2026
Merged

fix: make shuffle fallback decisions sticky across planning passes#3982
andygrove merged 11 commits intoapache:mainfrom
andygrove:fix-sticky-dpp-fallback-via-tag

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 17, 2026

Which issue does this PR close?

Closes #3949
Closes #3870

There is a follow on issue #3984 to clean this up. This was implemented quite hastily.

Rationale for this change

Comet's shuffle-support predicates (nativeShuffleSupported, columnarShuffleSupported in CometShuffleExchangeExec) are called at both initial planning and AQE stage-prep. Some fallback decisions depend on the surrounding plan shape — for example, the presence of a DPP scan below a shuffle. Between the two passes, AQE wraps already-materialized child stages in ShuffleQueryStageExec, a LeafExecNode whose children is Seq.empty, so a naive re-evaluation can flip the decision.

Concrete mechanism (the trigger for #3949): stageContainsDPPScan walks s.child.exists(...) looking for a FileSourceScanExec with a PlanExpression partition filter. When the DPP subtree sits under a materialized ShuffleQueryStageExec, .exists stops at the wrapper and the DPP scan becomes invisible. The same shuffle that correctly fell back to Spark at initial planning is converted to Comet at stage prep. The resulting plan has a CometColumnarExchange above a materialized row-mode stage whose subtree still contains a Spark-fallback DPP scan — a boundary that breaks during BroadcastExchangeExec.doCanonicalize with AssertionError at ColumnarToRowExec.<init>(Columnar.scala:70) (because child.supportsColumnar is false after canonicalization).

What changes are included in this PR?

  • New CometFallback object with markForFallback / isMarkedForFallback. Distinct from CometExplainInfo.EXTENSION_INFO on purpose: the explain tag accumulates informational reasons (including rolled-up child reasons) and treating any presence as a fallback signal is too coarse — it breaks legitimate conversions (e.g. a shuffle tagged "Comet native shuffle not enabled" should still be eligible for columnar shuffle). The fallback tag exists only for decisions that must remain sticky.
  • nativeShuffleSupported and columnarShuffleSupported short-circuit on isMarkedForFallback(s) at the top.
  • The DPP branch of columnarShuffleSupported now uses markForFallback instead of withInfo, so the decision persists across AQE replanning.

Design rule (worth noting for future contributors): markForFallback must only be used for decisions that mean the whole stage falls back regardless of shuffle mode (DPP qualifies). Per-mode reasons — e.g. "unsupported data type for native only" — must keep using withInfo, because the native check runs before the columnar check and a sticky marker set in native would prevent columnar from getting a shot.

How are these changes tested?

Two new test suites:

  1. CometShuffleFallbackStickinessSuite — unit-level invariant:

    • Marked shuffle: both nativeShuffleSupported and columnarShuffleSupported return false.
    • Negative case: explain info alone (e.g. "Comet native shuffle not enabled") must not imply the sticky marker.
    • End-to-end-on-one-shuffle: plan a real DPP query, run the support check once to set the marker, swap the shuffle's child for an opaque LeafExecNode that hides the DPP subtree (mimicking a materialized stage), and assert the second call still returns false.
  2. CometDppFallbackRepro3949Suite — end-to-end reproduction of the crash:

    • mechanism: builds a real DPP plan and asserts the sticky marker survives an AQE-style child wrap (withNewChildren preserves tree-node tags).
    • end-to-end: runs five DPP-flavored queries across three AQE variants; captures any collect() failure and any Comet shuffle in the final plan whose subtree still contains a DPP scan.
      • Verified on main (without the fix): q4 — a UNION ALL of three DPP-using subqueries with an outer rollup aggregate — crashes under smj+aqe and smj+aqe+coalesce with the exact [INTERNAL_ERROR] The "collect" action failed. #3949 stack trace (AssertionError through ColumnarToRowExec.<init>(Columnar.scala:70) during BroadcastExchangeExec.doCanonicalize).
      • With this PR applied: both suites pass.

Existing DPP fallback and DPP fallback avoids inefficient Comet shuffle (#3874) tests in CometExecSuite continue to pass.

…ain tags

columnarShuffleSupported and nativeShuffleSupported now short-circuit to false
if the shuffle already carries a Comet fallback tag from a prior rule pass.
This mirrors the established pattern in CometNativeScan.isSupported and
preserves the earlier decision instead of re-deriving it from the current
plan shape.

Background: Comet's shuffle-support checks run at both initial planning and
AQE stage-prep. Between those passes, AQE wraps completed child stages in
ShuffleQueryStageExec (a LeafExecNode whose children is Seq.empty). A naive
re-evaluation can therefore flip the decision — e.g., stageContainsDPPScan
uses s.child.exists(...) to find a FileSourceScanExec with a PlanExpression
partition filter, but that walk stops at the stage wrapper and the DPP scan
becomes invisible. The same shuffle then falls back to Spark at initial
planning and gets converted to Comet at stage prep, producing plan-shape
inconsistencies across the two passes.

Adds CometShuffleFallbackStickinessSuite:
- direct: tag a synthetic shuffle and assert both support predicates return
  false
- end-to-end: build a DPP query, observe pass 1 falls back and tags the
  shuffle, then swap the shuffle's child for an opaque leaf (mimicking a
  materialized stage) and assert pass 2 still falls back
Initial version used hasExplainInfo as the short-circuit condition, but that
also matches informational reasons from earlier checks (e.g. 'Comet native
shuffle not enabled' left behind by nativeShuffleSupported). That caused
legitimate columnar shuffle conversions to be blocked, regressing the
columnar shuffle suite (e.g. 'columnar shuffle on struct including nulls').

Introduce a dedicated CometFallback tag distinct from the explain-info tag:
- markForFallback(node, reason) records the decision and also writes the
  reason to the explain channel for visibility.
- isMarkedForFallback(node) is what the shuffle-support predicates check.

nativeShuffleSupported and columnarShuffleSupported now short-circuit on
isMarkedForFallback, and the DPP branch uses markForFallback instead of
withInfo so the decision sticks across AQE stage-prep passes.

Updated CometShuffleFallbackStickinessSuite to cover both the positive case
(marked node must fall back) and the regression case (informational explain
info must NOT force fallback).
Adds CometDppFallbackRepro3949Suite with two tests:

1. mechanism: builds a DPP plan, observes the initial columnarShuffleSupported
   decision is "fall back" (DPP visible), then wraps the shuffle's child in an
   opaque LeafExecNode mirroring how ShuffleQueryStageExec presents to .exists
   walks, and asserts the decision stays "fall back" because the sticky
   CometFallback marker carries over via withNewChildren.

2. end-to-end: runs five DPP-flavored queries across three AQE variants and
   looks for either a collect() failure with the apache#3949 stack-trace signature
   (AssertionError through ColumnarToRowExec.<init> during
   BroadcastExchangeExec.doCanonicalize) or a Comet shuffle in the final plan
   whose subtree still contains a DPP scan.

Verified on main (without the sticky-marker fix): q4 (UNION ALL of three
DPP-using subqueries with outer rollup aggregate) crashes under smj+aqe and
smj+aqe+coalesce with the exact apache#3949 stack trace, and the end-to-end test
fails. With this branch's fix applied, both tests pass.
@andygrove andygrove marked this pull request as ready for review April 17, 2026 19:05
Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Some minor comments

* re-deriving the decision from the (possibly reshaped) subtree. Also records the reason in the
* usual explain channel so it surfaces in extended explain output.
*/
def markForFallback[T <: TreeNode[_]](node: T, reason: String): T = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better than using explain info


if (CometConf.COMET_DPP_FALLBACK_ENABLED.get() && stageContainsDPPScan(s)) {
withInfo(s, "Stage contains a scan with Dynamic Partition Pruning")
markForFallback(s, "Stage contains a scan with Dynamic Partition Pruning")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the reason to explainInfo?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markForFallback does call withInfo so we still get the explain info recorded

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having two approaches seems a bit hacky though. I filed follow on issue to clean this up. #3984

}

// scalastyle:off println
println("=== mechanism check ===")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log insted of println, or perhaps you meant to remove this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the debug println

@andygrove andygrove merged commit a9e219c into apache:main Apr 17, 2026
175 of 177 checks passed
@andygrove andygrove deleted the fix-sticky-dpp-fallback-via-tag branch April 17, 2026 23:44
@andygrove
Copy link
Copy Markdown
Member Author

Merged. Thanks @parthchandra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[INTERNAL_ERROR] The "collect" action failed. DPP fallback not working correct in awslabs TPC-DS benchmark

2 participants