[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994
Open
sunchao wants to merge 2 commits into
Open
[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994sunchao wants to merge 2 commits into
sunchao wants to merge 2 commits into
Conversation
(cherry picked from commit 81dae0f3fdedb15c232adc34ccdd7bbd468d18d2)
Member
Author
|
cc @dongjoon-hyun @cloud-fan @peter-toth @viirya could you check if this PR make sense? |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR preserves the original heartbeat-timeout loss reason when Spark replaces an executor.
When
HeartbeatReceiverexpires an executor, Spark already creates:ExecutorProcessLost("Executor heartbeat timed out ...")The replacement request can later be reported back by the backend as generic
ExecutorKilled, which can replace the heartbeat-timeout reason before the scheduler recordsthe final executor loss. This PR keeps the heartbeat-timeout reason through that replacement
flow.
The preserved reason is used only when the backend reports generic
ExecutorKilled.If the backend provides a more specific reason, such as
ExecutorExited, Spark still keepsthat backend reason. The pending preserved reason is also cleared if the kill request is
rejected or fails.
Why are the changes needed?
Without this change, a heartbeat-timeout executor loss can be turned into generic
ExecutorKilledwhen the backend reports the replacement.That changes scheduler behavior, not just the log message. Spark treats
ExecutorKilledas anexecutor loss that was not caused by the application, so task failures on that executor do not
count toward the task failure limit. For a real heartbeat timeout, that is the wrong
classification: the executor became unresponsive while running the application, and repeated
losses should still count toward failing the affected tasks.
If the heartbeat-timeout reason is dropped, Spark can keep retrying tasks after repeated
timeout-driven executor losses instead of eventually failing the stage. This PR preserves the
ExecutorProcessLost("Executor heartbeat timed out ...")reason that Spark already knows at thetime of replacement, unless the backend later provides a more specific reason.
This fixes SPARK-56952.
Does this PR introduce any user-facing change?
Yes.
Executor loss reporting is more specific for heartbeat-timeout removals. Cases that previously
appeared as generic
ExecutorKilledcan now retain:ExecutorProcessLost("Executor heartbeat timed out ...")If the backend provides a concrete loss reason, Spark still keeps that backend reason instead.
How was this patch tested?
Unit tests cover:
ExecutorKilled,Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex