Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better failure detector coverage by making heartbeat target side active #25969

Open
patriknw opened this issue Nov 23, 2018 · 0 comments
Open
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster

Comments

@patriknw
Copy link
Member

Originally proposed by @jrudolph

Each node in an Akka Cluster will pick 5 (configurable) other nodes that it will monitor for failure detection. The heartbeat messages are request-response. The request messages are scheduled with a fixed periodic interval (1 sec) and when the response is received the failure detector is updated on the sender side. The target side is passive, only replying to the heartbeat requests.

By introducing a 3rd message there could be a failure detector instance on the target side also.

  • A -> B: Heartbeat
  • B -> A: HeartbeatRsp
  • A -> B: HeartbeatRsp2

This would give better failure detector coverage in larger clusters, and it would increase the likelihood that both sides of a network partition have the same view.

Perhaps the 3rd message isn't even necessary? The failure detector observations on the target side can be driven by the heartbeat request messages, which are supposed to arrive with a fixed interval. The failure detector is only tracking message inter-arrival times, not request/response latency.

The difficult part of adding this is the dynamic aspects of the cluster membership. When a node decides to stop monitoring another node because members have been added/removed the failure detector instance must be removed, also on the target side. This would require some message exchange/acknowledgment before stopping sending heartbeats.

  • A -> B: Heartbeat
  • B -> A: HeartbeatRsp
  • A -> B: Heartbeat
  • B -> A: HeartbeatRsp
  • A -> B: Heartbeat
  • B -> A: HeartbeatRsp
  • ...
  • A -> B: LastHeartbeat
    • when B receives this it will remove the FD instance
  • B -> A: LastHeartbeatRsp
    • when A receives this it will remove the FD instance, otherwise it will continue sending LastHeartbeat
@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster
Projects
None yet
Development

No branches or pull requests

1 participant