Skip to content

Nimbus deployment causing Topologies rebalancing #8419

@DiogoP98

Description

@DiogoP98

Description

After a Nimbus deployment or restart, healthy topologies with the correct number of running workers are being rescheduled. This manifests as workers being killed and restarted unnecessarily, causing intermittent processing disruption until the cluster stabilizes.

To Reproduce

  1. Have one or more topologies running with low/idle tuple throughput.
  2. Deploy Nimbus and leadership change.
  3. Observe repeated "Executor X not alive" log entries in Nimbus for executors that are actually running.
  4. Observe topologies being rescheduled despite having the correct number of workers.

Expected behavior

After Nimbus restarts and reconnects to the cluster, healthy topologies should remain stable with no rescheduling.

Actual behavior

Nimbus marks alive executors as timed out, causing their workers to be excluded from the assignment count. Since numAssignedWorkers < numDesiredWorkers, Nimbus continuously triggers rescheduling for those topologies.

Identified Root cause

In HeartbeatCache.updateFromHb(), the internal liveness timestamp (nimbusTimeSecs) is only refreshed when the executor's heartbeat stats timestamp (TIME_SECS) changes between consecutive heartbeats. For RPC-based heartbeats, TIME_SECS is the worker's wall-clock send time — if two heartbeats are processed within the same second, the value is identical and nimbusTimeSecs is not updated. After nimbus.task.timeout.secs (default 30s) without a nimbusTimeSecs refresh, the executor is considered dead even though it is actively heartbeating.

This is most visible after a Nimbus restart when the cache is empty and all executors need to re-establish their liveness within the timeout window.

Proposed fix

Update nimbusTimeSecs on every received heartbeat rather than only when TIME_SECS changes, so that liveness is correctly tied to heartbeat arrival rather than stats freshness.

Environment

  • Apache Storm 2.8.0
  • Java 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions