Skip to content

fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance#8420

Merged
reiabreu merged 1 commit intoapache:masterfrom
DiogoP98:fix-heartbeat-cache
Mar 6, 2026
Merged

fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance#8420
reiabreu merged 1 commit intoapache:masterfrom
DiogoP98:fix-heartbeat-cache

Conversation

@DiogoP98
Copy link
Contributor

@DiogoP98 DiogoP98 commented Mar 5, 2026

What is the purpose of the change

Issue: #8419

After a Nimbus restart, healthy topologies with the correct number of running workers are repeatedly rescheduled. In the Nimbus logs, the following message appears for executors that are actually running:

Executor <topo-id>:<executor> not alive

This causes nimbus to detect workers down, which continuously triggers rescheduling for those topologies.

Root Cause

In HeartbeatCache, the internal liveness timestamp (nimbusTimeSecs) was only refreshed when the heartbeat's TIME_SECS value changed between consecutive calls.

For RPC-based heartbeats, TIME_SECS represents the worker's wall-clock send time. If two heartbeats are processed within the same second, TIME_SECS is identical, nimbusTimeSecs is not refreshed, and after nimbus.task.timeout.secs (default 30s) the executor is falsely considered dead — even though it is actively heartbeating.

Fix

The heartbeat update logic is split into two separate methods to fix the RPC path while preserving backwards compatibility with legacy ZK-based topologies:

  • updateFromRpcHb — always refreshes nimbusTimeSecs on every heartbeat, so idle-but-alive executors are never falsely timed out.
  • updateFromZkHb — retains the original behaviour, only refreshing nimbusTimeSecs when TIME_SECS advances. This preserves zombie detection for legacy topologies where TIME_SECS is stats-based and genuinely stops advancing when an executor is stuck.

How was the change tested

Compiled the project, and tested the new version of the code in Nimbus machines, forcing Nimbus restarts and leadership changes

@DiogoP98 DiogoP98 force-pushed the fix-heartbeat-cache branch from 697819b to 165557f Compare March 5, 2026 15:05
@DiogoP98 DiogoP98 force-pushed the fix-heartbeat-cache branch from 165557f to 76a9856 Compare March 5, 2026 15:06
Copy link
Contributor

@paxadax paxadax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the contribution to this bug.

@reiabreu
Copy link
Contributor

reiabreu commented Mar 5, 2026

Thanks for the contribution , will try to review it asap

@reiabreu reiabreu requested a review from rzo1 March 5, 2026 16:55
@reiabreu reiabreu modified the milestone: 2.8.5 Mar 6, 2026
@reiabreu
Copy link
Contributor

reiabreu commented Mar 6, 2026

Thanks for the fix. Will be merging it now.

@reiabreu reiabreu merged commit 14efb2e into apache:master Mar 6, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants