[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset by shrirangmhalgi · Pull Request #55905 · apache/spark

shrirangmhalgi · 2026-05-15T16:31:42Z

What changes were proposed in this pull request?

Reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the "Replace Operators" optimizer batch.

Why are the changes needed?

dropDuplicates("id", "name").exceptAll(other) throws INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. The root cause is that RewriteExceptAll captures attribute references from left.output before ReplaceDeduplicateWithAggregate has replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match what RewriteExceptAll baked into its Generate node.

Does this PR introduce any user-facing change?

Yes. exceptAll (and intersectAll) now work correctly after dropDuplicates with a column subset.

How was this patch tested?

Added a test in DataFrameSetOperationsSuite verifying exceptAll, except, and intersectAll after dropDuplicates(subset).

Was this patch authored or co-authored using generative AI tooling?

Yes.

ReplaceDeduplicateWithAggregate replaces Deduplicate with an Aggregate using First() for non-key columns, creating new attribute exprIds. When RewriteExceptAll ran first in the same optimizer batch, it captured the original exprIds in its Generate node. After ReplaceDeduplicateWithAggregate rewrote the Deduplicate, the Generate still referenced the old exprIds, causing INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. Fix: reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the Replace Operators batch so Deduplicate is already an Aggregate when RewriteExceptAll processes the plan.

shrirangmhalgi · 2026-05-15T16:40:29Z

@holdenk / @dongjoon-hyun Could you please review

shrirangmhalgi · 2026-05-21T21:56:56Z

@holdenk Addressed your feedback to add the dependency comment and strengthened test assertions. Could you please review whenever you get a chance

shrirangmhalgi · 2026-05-23T23:29:57Z

@cloud-fan Would you mind taking a look at this when you get a chance? It's a one-line rule reordering fix in the optimizer - review feedback from @holdenk and @acruise has been addressed. Thanks!

holdenk reviewed May 15, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

acruise reviewed May 15, 2026

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala Outdated

shrirangmhalgi added 2 commits May 15, 2026 12:15

Address review: Add dependency comment and strengthen test assertions

5f80a43

Use unique test data to avoid non-deterministic First() behavior

b38ca02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates

shrirangmhalgi commented May 15, 2026

Uh oh!

shrirangmhalgi commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

shrirangmhalgi commented May 21, 2026

Uh oh!

shrirangmhalgi commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shrirangmhalgi commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

shrirangmhalgi commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

shrirangmhalgi commented May 21, 2026

Uh oh!

shrirangmhalgi commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants