[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason by sunchao · Pull Request #55994 · apache/spark

sunchao · 2026-05-19T17:40:17Z

What changes were proposed in this pull request?

This PR preserves the original heartbeat-timeout loss reason when Spark replaces an executor.

When HeartbeatReceiver expires an executor, Spark already creates:

ExecutorProcessLost("Executor heartbeat timed out ...")

The replacement request can later be reported back by the backend as generic
ExecutorKilled, which can replace the heartbeat-timeout reason before the scheduler records
the final executor loss. This PR keeps the heartbeat-timeout reason through that replacement
flow.

The preserved reason is used only when the backend reports generic ExecutorKilled.
If the backend provides a more specific reason, such as ExecutorExited, Spark still keeps
that backend reason. The pending preserved reason is also cleared if the kill request is
rejected or fails.

Why are the changes needed?

Without this change, a heartbeat-timeout executor loss can be turned into generic
ExecutorKilled when the backend reports the replacement.

That changes scheduler behavior, not just the log message. Spark treats ExecutorKilled as an
executor loss that was not caused by the application, so task failures on that executor do not
count toward the task failure limit. For a real heartbeat timeout, that is the wrong
classification: the executor became unresponsive while running the application, and repeated
losses should still count toward failing the affected tasks.

If the heartbeat-timeout reason is dropped, Spark can keep retrying tasks after repeated
timeout-driven executor losses instead of eventually failing the stage. This PR preserves the
ExecutorProcessLost("Executor heartbeat timed out ...") reason that Spark already knows at the
time of replacement, unless the backend later provides a more specific reason.

This fixes SPARK-56952.

Does this PR introduce any user-facing change?

Yes.

Executor loss reporting is more specific for heartbeat-timeout removals. Cases that previously
appeared as generic ExecutorKilled can now retain:

ExecutorProcessLost("Executor heartbeat timed out ...")

If the backend provides a concrete loss reason, Spark still keeps that backend reason instead.

How was this patch tested?

Unit tests cover:

preserving the heartbeat-timeout reason when the backend reports ExecutorKilled,
preserving a concrete backend-provided reason instead of overriding it,
clearing the pending timeout reason when executor kill is rejected.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex

(cherry picked from commit 81dae0f3fdedb15c232adc34ccdd7bbd468d18d2)

sunchao · 2026-05-22T03:33:45Z

cc @dongjoon-hyun @cloud-fan @peter-toth @viirya could you check if this PR make sense?

cloud-fan · 2026-05-22T11:51:38Z

cc @Ngone51 @jiangxb1987

Preserve heartbeat timeout executor loss reason

1e144f4

(cherry picked from commit 81dae0f3fdedb15c232adc34ccdd7bbd468d18d2)

sunchao changed the title ~~[SPARK-56952] Preserve heartbeat timeout executor loss reason~~ [SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason May 19, 2026

Tighten heartbeat timeout reason regression tests

e2d32dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/heartbeat-timeout-loss-reason-oss

sunchao commented May 19, 2026 •

edited

Loading

Uh oh!

sunchao commented May 22, 2026

Uh oh!

cloud-fan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunchao commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao commented May 22, 2026

Uh oh!

cloud-fan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sunchao commented May 19, 2026 •

edited

Loading