[lifeguard] concept and why? #34

zeroFruit · 2018-11-08T13:34:29Z

Before move on Lifeguard, why Suspicion mechanism

SWIM added this suspicion mechanism to solve flapping problem and to make protocol more robust to slow processes

Flapping problem?

Healthy node being marked as failed
Logs show it was never actually unhealthy but other nodes think this node as failed

However, although suspicion mechanism was shown to address slow nodes, still vulnerable to slow nodes because:

before suspected node's alive message sent back to node which originating the probe or suspicion, alive message sent from suspicion could be timed out

Lifeguard comes to rescue

Absence of expecting message can be a signal that the local member(self) may be slow node (slow processing messages, slow network): "No messages... maybe I'm in trouble?!"

Lifeguard Components

Lifeguard plays in three situations

L1: Dynamic Fault Detector Timeouts: "Self-Awareness"

Dynamically adjust the fault detector timeouts. Starts timeouts low. and increase in response to absence of replies. If the acks are not sent back from probe or indirect probe, it judges it is local problem
So local node (node itself) vary two things:

Probe timeout: How long long probed node has to respond
Probe Interval: Time between successive probes

Introduce Node Self-Awareness(NSA) conter

Higher the value the worse we think we're doing.

ProbeTimeout = BaseTimeout * (NSA+1)
ProbeInterval = BaseInterval * (NSA+1)

Node Self-Awareness
- Failed probe (no Ack): +1
- Probe with missed Nack: +1
- Refute suspicion about self: +1
- Successful probe (get Ack): -1
- Max NSA = 8

Nack for more information

Nack is not part of SWIM protocol but since we are now interested in discovering information about absence of messages, hashicorp thought that it was good idea to add nack messages. Then what is for nack?:
Without the nack if sender doesn't hear back from an indirect ping of target, it doesn't know the cause of problem is with A, B, C or D. But for the nack we can get hint with this problem. For example, if A receive Nack from B, C we can tell that A, B, C can talk each other and probabily D isolated. On the other hand if A don't receive nack back, A maybe isolated

L2: Dynamic Suspicion Timeouts: "Dogpile"

Its principle is similar with L1

node start with a high suspicion timeout
node low timeout as receive more suspect messages (This means as node receive more suspect messages we can presume that it is healthy, on the other hand if we sent back less messages, this node may be slow)

L3: More Timely Refutation: "Buddy System"

The motivation for L3 is the fact that only the suspected node can refute suspicion, in other words, only suspicion node can increment incarnation number.
For example, in the network there are many "suspect" message, but to refute those suspicion, only suspect node can refute its own suspicion by increment incarnation number(because higher incarnation number message can override other messages). So if node knows about suspect, then give probe priority to suspect: "Let him(suspect) know that he is suspicion"

The text was updated successfully, but these errors were encountered:

zeroFruit · 2018-11-08T13:45:57Z

I wrote this issue because I want to suggest to integrate this Lifeguard concept to our SWIM project

zeroFruit added proposal propose new feature or works to do suspicion labels Nov 8, 2018

This was referenced Nov 8, 2018

[docs] writing more about suspicion mechanism #32

Closed

[doc] Edit docs about Lifeguard part #37

Merged

junbeomlee closed this as completed in #37 Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lifeguard] concept and why? #34

[lifeguard] concept and why? #34

zeroFruit commented Nov 8, 2018 •

edited

Loading

zeroFruit commented Nov 8, 2018

[lifeguard] concept and why? #34

[lifeguard] concept and why? #34

Comments

zeroFruit commented Nov 8, 2018 • edited Loading

Before move on Lifeguard, why Suspicion mechanism

Flapping problem?

Lifeguard comes to rescue

Lifeguard Components

L1: Dynamic Fault Detector Timeouts: "Self-Awareness"

Introduce Node Self-Awareness(NSA) conter

Nack for more information

L2: Dynamic Suspicion Timeouts: "Dogpile"

L3: More Timely Refutation: "Buddy System"

zeroFruit commented Nov 8, 2018

zeroFruit commented Nov 8, 2018 •

edited

Loading