Skip to content

[SPARK-55848][SQL] Fix incorrect dedup results with SPJ partial clustering#54679

Closed
naveenp2708 wants to merge 4 commits intoapache:masterfrom
naveenp2708:fix/spj-partial-clustering-dedup
Closed

[SPARK-55848][SQL] Fix incorrect dedup results with SPJ partial clustering#54679
naveenp2708 wants to merge 4 commits intoapache:masterfrom
naveenp2708:fix/spj-partial-clustering-dedup

Conversation

@naveenp2708
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR fixes a data correctness bug where SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates and Window-based row_number dedup).

The root cause: KeyGroupedPartitioning.satisfies0() delegates to super.satisfies0() (from HashPartitioningLike), which also matches ClusteredDistribution and returns true — short-circuiting the isPartiallyClustered guard. This means EnsureRequirements never inserts an Exchange before downstream dedup operators when partial clustering is active.

The fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then automatically inserts the necessary Exchange.

Why are the changes needed?

Without this fix, any Spark user running SPJ with partial clustering enabled who applies dropDuplicates() or Window-based dedup on the join output gets silently inflated results. This affects production pipelines using Iceberg, Delta Lake, or any DataSource V2 connector with bucketed tables.

Does this PR introduce any user-facing change?

Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results. An additional shuffle may be introduced for these specific query patterns. Queries without post-join dedup are unaffected.

How was this patch tested?

  • New tests in KeyGroupedPartitioningSuite: SPARK-54378 dropDuplicates, SPARK-54378 Window dedup, SPARK-55848 dropDuplicates
  • All 69 tests in KeyGroupedPartitioningSuite pass
  • All 26 tests in EnsureRequirementsSuite pass
  • All 98 tests in DistributionAndOrderingSuite pass

Was this patch authored or co-authored using generative AI tooling?

No.

@naveenp2708 naveenp2708 changed the title Fix/spj partial clustering dedup [SPARK-55848][SQL] Fix incorrect dedup results with SPJ partial clustering Mar 8, 2026
@naveenp2708
Copy link
Copy Markdown
Contributor Author

@peter-toth @szehon-ho @cloud-fan @chirag-s-db

This PR addresses the correctness issue discussed in #54378. Test cases included as requested by @peter-toth.

@peter-toth
Copy link
Copy Markdown
Contributor

Can you please revert formatting changes? Now this PR shows +2,876 -1,961 lines of diff, but most of them seems unnecessary.

…ering

When SPJ partial clustering splits a partition across multiple tasks,
post-join dedup operators (dropDuplicates, Window row_number) produce
incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly
reports satisfaction of ClusteredDistribution via super.satisfies0()
short-circuiting the isPartiallyClustered guard.

This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and
restructures satisfies0() to check ClusteredDistribution first, returning
false when partially clustered. EnsureRequirements then inserts the
necessary Exchange. Plain SPJ joins without dedup are unaffected.

Closes apache#54378
@naveenp2708
Copy link
Copy Markdown
Contributor Author

@peter-toth Done — rebased and cleaned up the formatting changes. The diff should now show only the actual fix (+192 -35).

@peter-toth
Copy link
Copy Markdown
Contributor

@peter-toth Done — rebased and cleaned up the formatting changes. The diff should now show only the actual fix (+192 -35).

Thank you. Let me review this later today or tomorrow morning.

Copy link
Copy Markdown
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add new test case similar to the existing SPARK-53322: checkpointed scans avoid shuffles for aggregates, because after this fix there should be a shuffle added when a checkpointed KeyGroupedPartitioning partitioning is isPartiallyClustered, but we have a node above the checkpoint that requires ClusteredDistribution.

}
}

test("[SPARK-54378] dropDuplicates after SPJ with partial clustering should give correct " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please start the new test names with SPARK-55848:


Seq(true, false).foreach { partiallyClustered =>
withSQLConf(
SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> false.toString,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to turn REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION off for this test?

V2_BUCKETING_PUSH_PART_VALUES_ENABLED is enabled by default so we can omit it.

|FROM testcat.ns.$items i
|JOIN testcat.ns.$purchases p ON i.id = p.item_id
|""".stripMargin)
checkAnswer(df, Seq(Row(1), Row(2), Row(3)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check the presence of shuffles and the number of partitons of scans are the expected?

}
}

test("[SPARK-54378] Window dedup after SPJ with partial clustering should give correct " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

}
}

test("SPARK-55848: dropDuplicates after SPJ with partial clustering should produce " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test different to the first one?

// requires all rows with the same key to be co-located in a single task. Without this
// guard, downstream operators such as dropDuplicates or Window functions would skip
// their required shuffle and produce incorrect results.
// See SPARK-54378 / SPARK-55848.
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SPARK-54378 related to this issue?

@peter-toth
Copy link
Copy Markdown
Contributor

peter-toth commented Mar 9, 2026

@naveenp2708, I've merged #54330 to master which causes lots of conflicts in this PR.
I would suggest splitting this PR into 2 PRs. The first one should target master and add the regression tests, the second one should target branch-4.1 and add both the fix and the tests as well.

@naveenp2708 naveenp2708 requested a review from peter-toth March 10, 2026 00:13
@naveenp2708
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @peter-toth! I'll address all feedback and split into two PRs:

PR 1 (targeting master):

PR 2 (targeting branch-4.1):

  • Fix (isPartiallyClustered) + all tests

Closing this PR and opening the new ones shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants