Skip to content

[SPARK-56694][SQL] Fix DynamicPruningSubquery canonicalization for buildKeys#55644

Closed
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-56694-fix-dynamicpruningsubquery-canonicalization
Closed

[SPARK-56694][SQL] Fix DynamicPruningSubquery canonicalization for buildKeys#55644
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-56694-fix-dynamicpruningsubquery-canonicalization

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

DynamicPruningSubquery.canonicalized now normalizes buildKeys relative to buildQuery.output using QueryPlan.normalizeExpressions instead of calling .canonicalized on each key expression independently.

Why are the changes needed?

The previous implementation called buildKeys.map(_.canonicalized), which canonicalized each key expression in isolation and therefore preserved the original ExprId values of attribute references. When two DynamicPruningSubquery instances referenced the same logical build query (e.g. different copies of a CTE branch) but with different ExprIds, their canonical buildKeys differed even though the queries were semantically identical.

QueryPlan.normalizeExpressions(key, buildQuery.output) replaces each attribute reference with ExprId(ordinal) where ordinal is the attribute's position in buildQuery.output. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonical buildKeys become identical regardless of the original ExprId values.

Does this PR introduce any user-facing change?

No. This is an internal canonicalization fix. It may improve query plans by enabling PlanMerger to deduplicate more DynamicPruningSubquery expressions, but does not change observable query results.

How was this patch tested?

Added a unit test in DynamicPruningSubquerySuite that constructs two DynamicPruningSubquery instances with identical build query structure but fresh (distinct) ExprIds, and asserts that their canonicalized forms are equal.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

…ildKeys

### What changes were proposed in this pull request?

`DynamicPruningSubquery.canonicalized` now normalizes `buildKeys` relative to `buildQuery.output` using `QueryPlan.normalizeExpressions` instead of calling `.canonicalized` on each key expression independently.

### Why are the changes needed?

The previous implementation called `buildKeys.map(_.canonicalized)`, which canonicalized each key expression in isolation and therefore preserved the original `ExprId` values of attribute references. When two `DynamicPruningSubquery` instances referenced the same logical build query (e.g. different copies of a CTE branch) but with different `ExprId`s, their canonical `buildKeys` differed even though the queries were semantically identical.

`QueryPlan.normalizeExpressions(key, buildQuery.output)` replaces each attribute reference with `ExprId(ordinal)` where `ordinal` is the attribute's position in `buildQuery.output`. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonical `buildKeys` become identical regardless of the original `ExprId` values.

### Does this PR introduce _any_ user-facing change?

No. This is an internal canonicalization fix. It may improve query plans by enabling `PlanMerger` to deduplicate more `DynamicPruningSubquery` expressions, but does not change observable query results.

### How was this patch tested?

Added a unit test in `DynamicPruningSubquerySuite` that constructs two `DynamicPruningSubquery` instances with identical build query structure but fresh (distinct) `ExprId`s, and asserts that their `canonicalized` forms are equal.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @peter-toth !

: +- ReusedExchange (114)
+- ReusedExchange (116)
: : +- BroadcastExchange (41)
: : +- * Project (40)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change in the golden plan that the aggregate calculation is moved to a subquery of Project (40) is unnecessary and must be related to some kind of PlanMerger issue.
Since both plans are semantically correct, let me investigate this plan change in a separate ticket and keep this PR as a canonicalization bugfix.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Merged to master for Apache Spark 4.2.0 as a kind of improvement.

However, feel free to make a backporting PR if you want to have this in old release branches, @peter-toth .

@peter-toth
Copy link
Copy Markdown
Contributor Author

Merged to master for Apache Spark 4.2.0 as a kind of improvement.

Thank you @dongjoon-hyun for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants