Skip to content

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905

Open
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates
Open

[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-51262-except-all-not-working-with-drop-duplicates

Conversation

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the "Replace Operators" optimizer batch.

Why are the changes needed?

dropDuplicates("id", "name").exceptAll(other) throws INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. The root cause is that RewriteExceptAll captures attribute references from left.output before ReplaceDeduplicateWithAggregate has replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match what RewriteExceptAll baked into its Generate node.

Does this PR introduce any user-facing change?

Yes. exceptAll (and intersectAll) now work correctly after dropDuplicates with a column subset.

How was this patch tested?

Added a test in DataFrameSetOperationsSuite verifying exceptAll, except, and intersectAll after dropDuplicates(subset).

Was this patch authored or co-authored using generative AI tooling?

Yes.

ReplaceDeduplicateWithAggregate replaces Deduplicate with an Aggregate using First() for non-key columns, creating new attribute exprIds. When RewriteExceptAll ran first in the same optimizer batch, it captured the original exprIds in its Generate node. After ReplaceDeduplicateWithAggregate rewrote the Deduplicate, the Generate still referenced the old exprIds, causing INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time.

Fix: reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the Replace Operators batch so Deduplicate is already an Aggregate when RewriteExceptAll processes the plan.
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@holdenk / @dongjoon-hyun Could you please review

Comment thread sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala Outdated
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@holdenk Addressed your feedback to add the dependency comment and strengthened test assertions. Could you please review whenever you get a chance

@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@cloud-fan Would you mind taking a look at this when you get a chance? It's a one-line rule reordering fix in the optimizer - review feedback from @holdenk and @acruise has been addressed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants