Copyright (c) 2017 Tucker Barbour
Authors: Tucker Barbour (tucker.barbour@gmail.com
).
An Erlang implementation of the "The Phi Accrual Failure Detector" (Hayashibara, et al., 2004). This implementation is based on the implementation in Akka and Cassandra.
Add to a rebar3 project via rebar.conf
{deps, [{phi_failure_detector, {git, "https://github.com/ctbarbour/phi_failure_detector.git", {branch, master}}}]}
Add to a erlang.mk project via Makefile
DEPS = phi_failure_detector
dep_phi_failure_detector = git https://github.com/ctbarbour/phi_failure_detector.git master
To start detecting failures for a service endpoint, start the OTP application and start a new failure detector with a service label and identifier. In this case our service label is http
and our identifier is {192,168,10,1}
.
application:ensure_all_started(phi_failure_detector),
phi_failure_detector:new(http, {192,168,10,1})
Start adding samples to the failure detector when you get a successful heartbeat from the service endpoint.
phi_failure_detector:heartbeat(http, {192,168,10,1}).
Check the φ value of the service.
phi_failure_detector:phi(http, {192,168,10,1}).
Get the φ of all endpoints with the same service label.
phi_failure_detector:phi(http).
Phi Accrual Failure Detector is a failure detection algorithm that scales a level of suspicion dynamically based on network conditions over time rather than outputting a binary Up or Down result. For more detailed information I recommend reading the paper, or at least the abstract.
To dynamically scale the suspicion level of an endpoint the Phi Accrual Failure Detector records successful heartbeats from a node and builds a distribution of the interarrival times. With this distribution we can calculate the probability that a heartbeat will arrive some time in the future. As network conditions change over time so does the distribution of interarrival times. A node's suspicion is now continuous and not just a binary value. We can make decisions based on how likely it is that a node has failed rather than if thinking in terms of failed or not failed. An application using a Phi Accrual Failure Detector can take precautionary measures when the likelihood of failure has reached a certain threshold and take more drastic measures when the likelihood of failure has reached a higher threshold.
$ rebar3 do xref, dialyzer
$ rebar3 eunit
Bug reports and pull requests are welcome on GitHub at https://github.com/ctbarbour/phi_failure_detector.
pfd_app |
pfd_monitor |
pfd_samples |
pfd_service |
pfd_service_sup |
pfd_sup |
phi_failure_detector |