Better failure detector coverage by making heartbeat target side active #25969

patriknw · 2018-11-23T07:11:21Z

Originally proposed by @jrudolph

Each node in an Akka Cluster will pick 5 (configurable) other nodes that it will monitor for failure detection. The heartbeat messages are request-response. The request messages are scheduled with a fixed periodic interval (1 sec) and when the response is received the failure detector is updated on the sender side. The target side is passive, only replying to the heartbeat requests.

By introducing a 3rd message there could be a failure detector instance on the target side also.

A -> B: Heartbeat
B -> A: HeartbeatRsp
A -> B: HeartbeatRsp2

This would give better failure detector coverage in larger clusters, and it would increase the likelihood that both sides of a network partition have the same view.

Perhaps the 3rd message isn't even necessary? The failure detector observations on the target side can be driven by the heartbeat request messages, which are supposed to arrive with a fixed interval. The failure detector is only tracking message inter-arrival times, not request/response latency.

The difficult part of adding this is the dynamic aspects of the cluster membership. When a node decides to stop monitoring another node because members have been added/removed the failure detector instance must be removed, also on the target side. This would require some message exchange/acknowledgment before stopping sending heartbeats.

A -> B: Heartbeat
B -> A: HeartbeatRsp
A -> B: Heartbeat
B -> A: HeartbeatRsp
A -> B: Heartbeat
B -> A: HeartbeatRsp
...
A -> B: LastHeartbeat
- when B receives this it will remove the FD instance
B -> A: LastHeartbeatRsp
- when A receives this it will remove the FD instance, otherwise it will continue sending LastHeartbeat

The text was updated successfully, but these errors were encountered:

patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster labels Nov 23, 2018

patriknw mentioned this issue Dec 10, 2018

Akka Sprint Plan 2018-11-19 akka/akka-meta#88

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better failure detector coverage by making heartbeat target side active #25969

Better failure detector coverage by making heartbeat target side active #25969

patriknw commented Nov 23, 2018

Better failure detector coverage by making heartbeat target side active #25969

Better failure detector coverage by making heartbeat target side active #25969

Comments

patriknw commented Nov 23, 2018