/raft/node endpoint lags behind actual raft leadership state

## Summary

The `/raft/node` HTTP endpoint returns stale leader information after a leadership change. The actual Raft election completes and the new leader begins producing blocks, but the endpoint continues to report the old leader (or `unknown`) for several seconds.

## Observed behavior

When monitoring a 3-node cluster under cyclic leader-kill tests, the following pattern repeats consistently:

- A leader node is killed (SIGTERM or SIGKILL)
- New blocks appear on surviving nodes within **2–3 seconds** (confirmed by querying the block store endpoint)
- `/raft/node` on surviving nodes still returns the old leader ID or `{"is_leader": false}` / `unknown`
- `/raft/node` reflects the new leader only **10–30 seconds later** (observed up to 148s in a P2P disruption scenario)

Example from a SIGTERM test (108-cycle, 9-hour run):

```
t=0s    leader killed
t=3s    new block at height N+1 observed on surviving nodes  ← election already done
t=13s   /raft/node first returns new leader node_id          ← 10s lag
```

The gap between first new block and first correct `/raft/node` response was **10–28 seconds** across 108 consecutive cycles. In P2P disruption tests it reached **148 seconds** in one case.

## Expected behavior

`/raft/node` should reflect the current Raft leader promptly after an election completes — ideally within one or two heartbeat intervals. Since block production already reflects the new leader, the endpoint is clearly trailing behind the internal Raft FSM state.

## Impact

- External monitoring and orchestration tools that rely on `/raft/node` to detect leadership changes will see false negatives for 10–150 seconds after a failover.
- Health checks or load balancers routing writes based on `/raft/node` may continue sending traffic to a dead or demoted node long after the election.
- In our test harness, this manifests as a systematic "block/leader gap" artifact where blocks are being produced under a new leader that the endpoint doesn't yet acknowledge.

## Reproduction

Run a cyclic SIGTERM or hard-reboot test against a 3-node cluster. Poll `/raft/node` and a block store endpoint (e.g. `StoreService/GetBlock`) on all nodes at 2s intervals. Compare the timestamp of the first new block after a kill versus the timestamp when `/raft/node` first returns the new `node_id`.

```bash
# Example polling loop
while true; do
  echo "=== $(date) ==="
  for node in poc-ha-1 poc-ha-2 poc-ha-3; do
    echo -n "$node raft: "
    curl -s "http://$node:7331/raft/node" | jq -r '.node_id + " leader=" + (.is_leader|tostring)'
  done
  sleep 2
done
```

## Environment

- Cluster: 3-node Raft (ev-node)
- Tests: SIGTERM cyclic (108 cycles), hard reboot (84 cycles), P2P disruption (60 cycles)
- Poll interval: 2s
- Consistent across all test types; worst case under P2P disruption (+148s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/raft/node endpoint lags behind actual raft leadership state #3265

Summary

Observed behavior

Expected behavior

Impact

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

/raft/node endpoint lags behind actual raft leadership state #3265

Description

Summary

Observed behavior

Expected behavior

Impact

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions