Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lifeguard] concept and why? #34

Closed
zeroFruit opened this issue Nov 8, 2018 · 1 comment · Fixed by #37
Closed

[lifeguard] concept and why? #34

zeroFruit opened this issue Nov 8, 2018 · 1 comment · Fixed by #37
Labels
proposal propose new feature or works to do suspicion

Comments

@zeroFruit
Copy link
Member

zeroFruit commented Nov 8, 2018

Before move on Lifeguard, why Suspicion mechanism

SWIM added this suspicion mechanism to solve flapping problem and to make protocol more robust to slow processes

Flapping problem?

  • Healthy node being marked as failed
  • Logs show it was never actually unhealthy but other nodes think this node as failed

However, although suspicion mechanism was shown to address slow nodes, still vulnerable to slow nodes because:

  • before suspected node's alive message sent back to node which originating the probe or suspicion, alive message sent from suspicion could be timed out

Lifeguard comes to rescue

Absence of expecting message can be a signal that the local member(self) may be slow node (slow processing messages, slow network): "No messages... maybe I'm in trouble?!"

Lifeguard Components

Lifeguard plays in three situations

L1: Dynamic Fault Detector Timeouts: "Self-Awareness"

Dynamically adjust the fault detector timeouts. Starts timeouts low. and increase in response to absence of replies. If the acks are not sent back from probe or indirect probe, it judges it is local problem
So local node (node itself) vary two things:

  • Probe timeout: How long long probed node has to respond
  • Probe Interval: Time between successive probes

Introduce Node Self-Awareness(NSA) conter

Higher the value the worse we think we're doing.

ProbeTimeout = BaseTimeout * (NSA+1)
ProbeInterval = BaseInterval * (NSA+1)
  • Node Self-Awareness
    • Failed probe (no Ack): +1
    • Probe with missed Nack: +1
    • Refute suspicion about self: +1
    • Successful probe (get Ack): -1
    • Max NSA = 8

Nack for more information

Nack is not part of SWIM protocol but since we are now interested in discovering information about absence of messages, hashicorp thought that it was good idea to add nack messages. Then what is for nack?:
Without the nack if sender doesn't hear back from an indirect ping of target, it doesn't know the cause of problem is with A, B, C or D. But for the nack we can get hint with this problem. For example, if A receive Nack from B, C we can tell that A, B, C can talk each other and probabily D isolated. On the other hand if A don't receive nack back, A maybe isolated

L2: Dynamic Suspicion Timeouts: "Dogpile"

Its principle is similar with L1

  • node start with a high suspicion timeout
  • node low timeout as receive more suspect messages (This means as node receive more suspect messages we can presume that it is healthy, on the other hand if we sent back less messages, this node may be slow)

L3: More Timely Refutation: "Buddy System"

The motivation for L3 is the fact that only the suspected node can refute suspicion, in other words, only suspicion node can increment incarnation number.
For example, in the network there are many "suspect" message, but to refute those suspicion, only suspect node can refute its own suspicion by increment incarnation number(because higher incarnation number message can override other messages). So if node knows about suspect, then give probe priority to suspect: "Let him(suspect) know that he is suspicion"

@zeroFruit
Copy link
Member Author

I wrote this issue because I want to suggest to integrate this Lifeguard concept to our SWIM project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal propose new feature or works to do suspicion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant