[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905
Open
shrirangmhalgi wants to merge 3 commits into
Open
[SPARK-51262][SQL] Fix exceptAll after dropDuplicates with subset#55905shrirangmhalgi wants to merge 3 commits into
shrirangmhalgi wants to merge 3 commits into
Conversation
ReplaceDeduplicateWithAggregate replaces Deduplicate with an Aggregate using First() for non-key columns, creating new attribute exprIds. When RewriteExceptAll ran first in the same optimizer batch, it captured the original exprIds in its Generate node. After ReplaceDeduplicateWithAggregate rewrote the Deduplicate, the Generate still referenced the old exprIds, causing INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND at execution time. Fix: reorder ReplaceDeduplicateWithAggregate before RewriteExceptAll in the Replace Operators batch so Deduplicate is already an Aggregate when RewriteExceptAll processes the plan.
Contributor
Author
|
@holdenk / @dongjoon-hyun Could you please review |
holdenk
reviewed
May 15, 2026
acruise
reviewed
May 15, 2026
Contributor
Author
|
@holdenk Addressed your feedback to add the dependency comment and strengthened test assertions. Could you please review whenever you get a chance |
Contributor
Author
|
@cloud-fan Would you mind taking a look at this when you get a chance? It's a one-line rule reordering fix in the optimizer - review feedback from @holdenk and @acruise has been addressed. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Reorder
ReplaceDeduplicateWithAggregatebeforeRewriteExceptAllin the "Replace Operators" optimizer batch.Why are the changes needed?
dropDuplicates("id", "name").exceptAll(other)throwsINTERNAL_ERROR_ATTRIBUTE_NOT_FOUNDat execution time. The root cause is thatRewriteExceptAllcaptures attribute references fromleft.outputbeforeReplaceDeduplicateWithAggregatehas replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match whatRewriteExceptAllbaked into its Generate node.Does this PR introduce any user-facing change?
Yes.
exceptAll (and intersectAll)now work correctly afterdropDuplicateswith a column subset.How was this patch tested?
Added a test in
DataFrameSetOperationsSuiteverifyingexceptAll,except, andintersectAllafterdropDuplicates(subset).Was this patch authored or co-authored using generative AI tooling?
Yes.