-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54556][CORE] Rollback succeeding shuffle map stages when shuffle checksum mismatch detected #53274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Outdated
Show resolved
Hide resolved
| * @return Shuffle map stages which need and can be rolled back | ||
| */ | ||
| private def abortStageWithInvalidRollBack(stagesToRollback: HashSet[Stage]): HashSet[Stage] = { | ||
| private def abortStagesUnableToRollback(stagesToRollback: HashSet[Stage]): HashSet[Stage] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about filterAndAbortUnrollbackableStages? It returns rollbackable stages which is like .filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Updated. Thanks.
| log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.") | ||
| return | ||
| } | ||
| if (!sms.shuffleDep.checksumMismatchFullRetryEnabled && stage.isIndeterminate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to merge the old indeterminate stage code path into the new framework? I think this is just a special case where we know checksum always mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check.
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
|
gentle ping @mridulm in case you missed this |
|
thanks, merging to master/4.1! |
…le checksum mismatch detected ### What changes were proposed in this pull request? Rollback shuffle map stages when shuffle checksum mismatch detected: - cancel and resubmit the stage if it's running; - clean up the shuffle status to ensure it'll be resubmitted; - mark rollback attemptId and ignore the results from these elder attempts which may consume inconsistent data; ### Why are the changes needed? To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53274 from ivoson/SPARK-54556. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0da9e05) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Rollback shuffle map stages when shuffle checksum mismatch detected:
Why are the changes needed?
To ensure all the succeeding stages will be re-submitted and fully-retry when there is shuffle checksum mismatch detected.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT added.
Was this patch authored or co-authored using generative AI tooling?
No