fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance#8420
Merged
reiabreu merged 1 commit intoapache:masterfrom Mar 6, 2026
Merged
Conversation
697819b to
165557f
Compare
…t TIME_SECS does not advance
165557f to
76a9856
Compare
paxadax
approved these changes
Mar 5, 2026
Contributor
paxadax
left a comment
There was a problem hiding this comment.
LGTM, thank you for the contribution to this bug.
dovalealves
approved these changes
Mar 5, 2026
Contributor
|
Thanks for the contribution , will try to review it asap |
reiabreu
approved these changes
Mar 6, 2026
rzo1
approved these changes
Mar 6, 2026
Contributor
|
Thanks for the fix. Will be merging it now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Issue: #8419
After a Nimbus restart, healthy topologies with the correct number of running workers are repeatedly rescheduled. In the Nimbus logs, the following message appears for executors that are actually running:
This causes nimbus to detect workers down, which continuously triggers rescheduling for those topologies.
Root Cause
In
HeartbeatCache, the internal liveness timestamp (nimbusTimeSecs) was only refreshed when the heartbeat'sTIME_SECSvalue changed between consecutive calls.For RPC-based heartbeats,
TIME_SECSrepresents the worker's wall-clock send time. If two heartbeats are processed within the same second,TIME_SECSis identical,nimbusTimeSecsis not refreshed, and afternimbus.task.timeout.secs(default 30s) the executor is falsely considered dead — even though it is actively heartbeating.Fix
The heartbeat update logic is split into two separate methods to fix the RPC path while preserving backwards compatibility with legacy ZK-based topologies:
updateFromRpcHb— always refreshes nimbusTimeSecs on every heartbeat, so idle-but-alive executors are never falsely timed out.updateFromZkHb— retains the original behaviour, only refreshing nimbusTimeSecs when TIME_SECS advances. This preserves zombie detection for legacy topologies where TIME_SECS is stats-based and genuinely stops advancing when an executor is stuck.How was the change tested
Compiled the project, and tested the new version of the code in Nimbus machines, forcing Nimbus restarts and leadership changes