Skip to content

/raft/node endpoint lags behind actual raft leadership state #3265

@auricom

Description

@auricom

Summary

The /raft/node HTTP endpoint returns stale leader information after a leadership change. The actual Raft election completes and the new leader begins producing blocks, but the endpoint continues to report the old leader (or unknown) for several seconds.

Observed behavior

When monitoring a 3-node cluster under cyclic leader-kill tests, the following pattern repeats consistently:

  • A leader node is killed (SIGTERM or SIGKILL)
  • New blocks appear on surviving nodes within 2–3 seconds (confirmed by querying the block store endpoint)
  • /raft/node on surviving nodes still returns the old leader ID or {"is_leader": false} / unknown
  • /raft/node reflects the new leader only 10–30 seconds later (observed up to 148s in a P2P disruption scenario)

Example from a SIGTERM test (108-cycle, 9-hour run):

t=0s    leader killed
t=3s    new block at height N+1 observed on surviving nodes  ← election already done
t=13s   /raft/node first returns new leader node_id          ← 10s lag

The gap between first new block and first correct /raft/node response was 10–28 seconds across 108 consecutive cycles. In P2P disruption tests it reached 148 seconds in one case.

Expected behavior

/raft/node should reflect the current Raft leader promptly after an election completes — ideally within one or two heartbeat intervals. Since block production already reflects the new leader, the endpoint is clearly trailing behind the internal Raft FSM state.

Impact

  • External monitoring and orchestration tools that rely on /raft/node to detect leadership changes will see false negatives for 10–150 seconds after a failover.
  • Health checks or load balancers routing writes based on /raft/node may continue sending traffic to a dead or demoted node long after the election.
  • In our test harness, this manifests as a systematic "block/leader gap" artifact where blocks are being produced under a new leader that the endpoint doesn't yet acknowledge.

Reproduction

Run a cyclic SIGTERM or hard-reboot test against a 3-node cluster. Poll /raft/node and a block store endpoint (e.g. StoreService/GetBlock) on all nodes at 2s intervals. Compare the timestamp of the first new block after a kill versus the timestamp when /raft/node first returns the new node_id.

# Example polling loop
while true; do
  echo "=== $(date) ==="
  for node in poc-ha-1 poc-ha-2 poc-ha-3; do
    echo -n "$node raft: "
    curl -s "http://$node:7331/raft/node" | jq -r '.node_id + " leader=" + (.is_leader|tostring)'
  done
  sleep 2
done

Environment

  • Cluster: 3-node Raft (ev-node)
  • Tests: SIGTERM cyclic (108 cycles), hard reboot (84 cycles), P2P disruption (60 cycles)
  • Poll interval: 2s
  • Consistent across all test types; worst case under P2P disruption (+148s)

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions