Better failure detector coverage by making heartbeat target side active #25969
Labels
1 - triaged
Tickets that are safe to pick up for contributing in terms of likeliness of being accepted
t:cluster
Originally proposed by @jrudolph
Each node in an Akka Cluster will pick 5 (configurable) other nodes that it will monitor for failure detection. The heartbeat messages are request-response. The request messages are scheduled with a fixed periodic interval (1 sec) and when the response is received the failure detector is updated on the sender side. The target side is passive, only replying to the heartbeat requests.
By introducing a 3rd message there could be a failure detector instance on the target side also.
Heartbeat
HeartbeatRsp
HeartbeatRsp2
This would give better failure detector coverage in larger clusters, and it would increase the likelihood that both sides of a network partition have the same view.
Perhaps the 3rd message isn't even necessary? The failure detector observations on the target side can be driven by the heartbeat request messages, which are supposed to arrive with a fixed interval. The failure detector is only tracking message inter-arrival times, not request/response latency.
The difficult part of adding this is the dynamic aspects of the cluster membership. When a node decides to stop monitoring another node because members have been added/removed the failure detector instance must be removed, also on the target side. This would require some message exchange/acknowledgment before stopping sending heartbeats.
Heartbeat
HeartbeatRsp
Heartbeat
HeartbeatRsp
Heartbeat
HeartbeatRsp
LastHeartbeat
LastHeartbeatRsp
LastHeartbeat
The text was updated successfully, but these errors were encountered: