Skip to content

Conversation

@ivoson
Copy link
Contributor

@ivoson ivoson commented Dec 2, 2025

What changes were proposed in this pull request?

Rollback shuffle map stages when shuffle checksum mismatch detected:

  • cancel and resubmit the stage if it's running;
  • clean up the shuffle status to ensure it'll be resubmitted;
  • mark rollback attemptId and ignore the results from these elder attempts which may consume inconsistent data;

Why are the changes needed?

To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT added.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Dec 2, 2025
@ivoson ivoson marked this pull request as ready for review December 2, 2025 01:53
@ivoson
Copy link
Contributor Author

ivoson commented Dec 9, 2025

cc @cloud-fan @mridulm

* @return Shuffle map stages which need and can be rolled back
*/
private def abortStageWithInvalidRollBack(stagesToRollback: HashSet[Stage]): HashSet[Stage] = {
private def abortStagesUnableToRollback(stagesToRollback: HashSet[Stage]): HashSet[Stage] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about filterAndAbortUnrollbackableStages? It returns rollbackable stages which is like .filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Updated. Thanks.

log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.")
return
}
if (!sms.shuffleDep.checksumMismatchFullRetryEnabled && stage.isIndeterminate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to merge the old indeterminate stage code path into the new framework? I think this is just a special case where we know checksum always mismatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check.

@ivoson
Copy link
Contributor Author

ivoson commented Dec 15, 2025

gentle ping @mridulm in case you missed this

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.1!

@cloud-fan cloud-fan closed this in 0da9e05 Dec 19, 2025
cloud-fan pushed a commit that referenced this pull request Dec 19, 2025
…le checksum mismatch detected

### What changes were proposed in this pull request?
Rollback shuffle map stages when shuffle checksum mismatch detected:

- cancel and resubmit the stage if it's running;
- clean up the shuffle status to ensure it'll be resubmitted;
- mark rollback attemptId and ignore the results from these elder attempts which may consume inconsistent data;

### Why are the changes needed?
To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT added.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53274 from ivoson/SPARK-54556.

Authored-by: Tengfei Huang <tengfei.h@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0da9e05)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants