fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance by DiogoP98 · Pull Request #8420 · apache/storm

DiogoP98 · 2026-03-05T15:03:12Z

What is the purpose of the change

After a Nimbus restart, healthy topologies with the correct number of running workers are repeatedly rescheduled. In the Nimbus logs, the following message appears for executors that are actually running:

Executor <topo-id>:<executor> not alive

This causes nimbus to detect workers down, which continuously triggers rescheduling for those topologies.

Root Cause

In HeartbeatCache, the internal liveness timestamp (nimbusTimeSecs) was only refreshed when the heartbeat's TIME_SECS value changed between consecutive calls.

For RPC-based heartbeats, TIME_SECS represents the worker's wall-clock send time. If two heartbeats are processed within the same second, TIME_SECS is identical, nimbusTimeSecs is not refreshed, and after nimbus.task.timeout.secs (default 30s) the executor is falsely considered dead — even though it is actively heartbeating.

Fix

The heartbeat update logic is split into two separate methods to fix the RPC path while preserving backwards compatibility with legacy ZK-based topologies:

updateFromRpcHb — always refreshes nimbusTimeSecs on every heartbeat, so idle-but-alive executors are never falsely timed out.
updateFromZkHb — retains the original behaviour, only refreshing nimbusTimeSecs when TIME_SECS advances. This preserves zombie detection for legacy topologies where TIME_SECS is stats-based and genuinely stops advancing when an executor is stuck.

How was the change tested

Compiled the project, and tested the new version of the code in Nimbus machines, forcing Nimbus restarts and leadership changes

…t TIME_SECS does not advance

paxadax

LGTM, thank you for the contribution to this bug.

reiabreu · 2026-03-05T16:54:29Z

Thanks for the contribution , will try to review it asap

reiabreu · 2026-03-06T11:13:22Z

Thanks for the fix. Will be merging it now.

DiogoP98 force-pushed the fix-heartbeat-cache branch from 697819b to 165557f Compare March 5, 2026 15:05

fix(HeartbeatCache): Falsely timing out alive executors when heartbea…

76a9856

…t TIME_SECS does not advance

DiogoP98 force-pushed the fix-heartbeat-cache branch from 165557f to 76a9856 Compare March 5, 2026 15:06

paxadax approved these changes Mar 5, 2026

View reviewed changes

dovalealves approved these changes Mar 5, 2026

View reviewed changes

reiabreu requested a review from rzo1 March 5, 2026 16:55

reiabreu added bug labels Mar 6, 2026

reiabreu modified the milestone: 2.8.5 Mar 6, 2026

reiabreu approved these changes Mar 6, 2026

View reviewed changes

rzo1 approved these changes Mar 6, 2026

View reviewed changes

reiabreu merged commit 14efb2e into apache:master Mar 6, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance#8420

fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance#8420
reiabreu merged 1 commit intoapache:masterfrom
DiogoP98:fix-heartbeat-cache

DiogoP98 commented Mar 5, 2026 •

edited

Loading

Uh oh!

paxadax left a comment

Uh oh!

reiabreu commented Mar 5, 2026 •

edited

Loading

Uh oh!

reiabreu commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DiogoP98 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Root Cause

Fix

How was the change tested

Uh oh!

paxadax left a comment

Choose a reason for hiding this comment

Uh oh!

reiabreu commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reiabreu commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DiogoP98 commented Mar 5, 2026 •

edited

Loading

reiabreu commented Mar 5, 2026 •

edited

Loading