Skip to content

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994

Open
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/heartbeat-timeout-loss-reason-oss
Open

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/heartbeat-timeout-loss-reason-oss

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented May 19, 2026

What changes were proposed in this pull request?

This PR preserves the original heartbeat-timeout loss reason when Spark replaces an executor.

When HeartbeatReceiver expires an executor, Spark already creates:

ExecutorProcessLost("Executor heartbeat timed out ...")

The replacement request can later be reported back by the backend as generic
ExecutorKilled, which can replace the heartbeat-timeout reason before the scheduler records
the final executor loss. This PR keeps the heartbeat-timeout reason through that replacement
flow.

The preserved reason is used only when the backend reports generic ExecutorKilled.
If the backend provides a more specific reason, such as ExecutorExited, Spark still keeps
that backend reason. The pending preserved reason is also cleared if the kill request is
rejected or fails.

Why are the changes needed?

Without this change, a heartbeat-timeout executor loss can be turned into generic
ExecutorKilled when the backend reports the replacement.

That changes scheduler behavior, not just the log message. Spark treats ExecutorKilled as an
executor loss that was not caused by the application, so task failures on that executor do not
count toward the task failure limit. For a real heartbeat timeout, that is the wrong
classification: the executor became unresponsive while running the application, and repeated
losses should still count toward failing the affected tasks.

If the heartbeat-timeout reason is dropped, Spark can keep retrying tasks after repeated
timeout-driven executor losses instead of eventually failing the stage. This PR preserves the
ExecutorProcessLost("Executor heartbeat timed out ...") reason that Spark already knows at the
time of replacement, unless the backend later provides a more specific reason.

This fixes SPARK-56952.

Does this PR introduce any user-facing change?

Yes.

Executor loss reporting is more specific for heartbeat-timeout removals. Cases that previously
appeared as generic ExecutorKilled can now retain:

ExecutorProcessLost("Executor heartbeat timed out ...")

If the backend provides a concrete loss reason, Spark still keeps that backend reason instead.

How was this patch tested?

Unit tests cover:

  • preserving the heartbeat-timeout reason when the backend reports ExecutorKilled,
  • preserving a concrete backend-provided reason instead of overriding it,
  • clearing the pending timeout reason when executor kill is rejected.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex

(cherry picked from commit 81dae0f3fdedb15c232adc34ccdd7bbd468d18d2)
@sunchao sunchao changed the title [SPARK-56952] Preserve heartbeat timeout executor loss reason [SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason May 19, 2026
@sunchao
Copy link
Copy Markdown
Member Author

sunchao commented May 22, 2026

cc @dongjoon-hyun @cloud-fan @peter-toth @viirya could you check if this PR make sense?

@cloud-fan
Copy link
Copy Markdown
Contributor

cc @Ngone51 @jiangxb1987

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants