You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SWIM added this suspicion mechanism to solve flapping problem and to make protocol more robust to slow processes
Flapping problem?
Healthy node being marked as failed
Logs show it was never actually unhealthy but other nodes think this node as failed
However, although suspicion mechanism was shown to address slow nodes, still vulnerable to slow nodes because:
before suspected node's alive message sent back to node which originating the probe or suspicion, alive message sent from suspicion could be timed out
Lifeguard comes to rescue
Absence of expecting message can be a signal that the local member(self) may be slow node (slow processing messages, slow network): "No messages... maybe I'm in trouble?!"
Dynamically adjust the fault detector timeouts. Starts timeouts low. and increase in response to absence of replies. If the acks are not sent back from probe or indirect probe, it judges it is local problem
So local node (node itself) vary two things:
Probe timeout: How long long probed node has to respond
Nack is not part of SWIM protocol but since we are now interested in discovering information about absence of messages, hashicorp thought that it was good idea to add nack messages. Then what is for nack?:
Without the nack if sender doesn't hear back from an indirect ping of target, it doesn't know the cause of problem is with A, B, C or D. But for the nack we can get hint with this problem. For example, if A receive Nack from B, C we can tell that A, B, C can talk each other and probabily D isolated. On the other hand if A don't receive nack back, A maybe isolated
L2: Dynamic Suspicion Timeouts: "Dogpile"
Its principle is similar with L1
node start with a high suspicion timeout
node low timeout as receive more suspect messages (This means as node receive more suspect messages we can presume that it is healthy, on the other hand if we sent back less messages, this node may be slow)
L3: More Timely Refutation: "Buddy System"
The motivation for L3 is the fact that only the suspected node can refute suspicion, in other words, only suspicion node can increment incarnation number.
For example, in the network there are many "suspect" message, but to refute those suspicion, only suspect node can refute its own suspicion by increment incarnation number(because higher incarnation number message can override other messages). So if node knows about suspect, then give probe priority to suspect: "Let him(suspect) know that he is suspicion"
The text was updated successfully, but these errors were encountered:
Before move on Lifeguard, why Suspicion mechanism
SWIM added this suspicion mechanism to solve flapping problem and to make protocol more robust to slow processes
Flapping problem?
However, although suspicion mechanism was shown to address slow nodes, still vulnerable to slow nodes because:
Lifeguard comes to rescue
Absence of expecting message can be a signal that the local member(self) may be slow node (slow processing messages, slow network): "No messages... maybe I'm in trouble?!"
Lifeguard Components
Lifeguard plays in three situations
L1: Dynamic Fault Detector Timeouts: "Self-Awareness"
Dynamically adjust the fault detector timeouts. Starts timeouts low. and increase in response to absence of replies. If the acks are not sent back from probe or indirect probe, it judges it is local problem
So local node (node itself) vary two things:
Introduce Node Self-Awareness(NSA) conter
Higher the value the worse we think we're doing.
Nack for more information
Nack is not part of SWIM protocol but since we are now interested in discovering information about absence of messages, hashicorp thought that it was good idea to add nack messages. Then what is for nack?:
Without the nack if sender doesn't hear back from an indirect ping of target, it doesn't know the cause of problem is with A, B, C or D. But for the nack we can get hint with this problem. For example, if A receive Nack from B, C we can tell that A, B, C can talk each other and probabily D isolated. On the other hand if A don't receive nack back, A maybe isolated
L2: Dynamic Suspicion Timeouts: "Dogpile"
Its principle is similar with L1
L3: More Timely Refutation: "Buddy System"
The motivation for L3 is the fact that only the suspected node can refute suspicion, in other words, only suspicion node can increment incarnation number.
For example, in the network there are many "suspect" message, but to refute those suspicion, only suspect node can refute its own suspicion by increment incarnation number(because higher incarnation number message can override other messages). So if node knows about suspect, then give probe priority to suspect: "Let him(suspect) know that he is suspicion"
The text was updated successfully, but these errors were encountered: