Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode error, then cause to Member::Rename? #35

Closed
Manphil opened this issue May 29, 2024 · 9 comments
Closed

Decode error, then cause to Member::Rename? #35

Manphil opened this issue May 29, 2024 · 9 comments

Comments

@Manphil
Copy link

Manphil commented May 29, 2024

I run a cluster with 10 nodes, and I found the error Decode(Found an Option discriminant that wasn't 0 or 1) occurs every 10-20minutes in several nodes, then other nodes received the Member::Rename(i am not sure if that errors causes it), but the Member::Rename is more frequent even if there no error.

Here is my foca config:

probe-period: 2s
probe-rtt: 1s
num_indirect_probes: 3
remove_down_after: 24h
max_packet_size: 1400
max-transmissions: 5
notify_down_members: true
suspect-to-down-after: 6s
periodic-announce:
    frequency: 30s
    num-members: 1
periodic-announce-to-down-members:
    frequency: 65s
    num-members: 2
@caio
Copy link
Owner

caio commented May 30, 2024

Interesting!

Assuming everything is running the same version this looks like it's either a problem with:

It's hard to diagnose Decode errors from afar- what you can do is store the data that caused the error and try to inspect it manually (it's a simple format, described here)

Capturing the debug logs (at least RUST_LOG=foca=debug) might help figuring this out, but I'm not sure it will as it doesn't sound like it's from within foca- But maybe your scenario is triggering a bug on send_message and making foca craft a broken packet somehow.


Now, a high rate of Notification::Rename is interesting in itself. If you're not expecting it to happen (i.e.: you're not restarting nodes, you're not on a flaky network, nodes are not overloaded all the time), it may be a sign of a problem.

I would try to figure out if it's a particular node that is constantly changing its identity or if it's generalized throughout the cluster. A generalized problem could mean you need to tune the foca config (maybe probe period is too short, maybe you need to give more time on suspect-to-down, maybe enable periodic gossiping)

@Manphil
Copy link
Author

Manphil commented May 31, 2024

My Identity struct is like

struct FatIdentity {
  pub addr: SocketAddrV4,
  pub node: Option<Node>
  pub version: u64,
}

struct Node {
  pub addr: String,
  pub id: u64,
  pub extra: Vec<String>
}

My Codec is PostcardCodec.
The decode error is occurs on somewhere like

tracing::error!(error = tracing::field::debug(error), "Ignored error");

what you can do is store the data that caused the error and try to inspect it manually (it's a simple format, described here)

It's good idea, i will debug it.

I runs the cluster in k8s, i am not sure if the network is stable. I will confirm whether it is caused by packet loss.


I am a bit confused about periodic_gossip, which speed the cluster update. If i enable that and when Rename occured, would the Rename be faster and more frequent in cluster?

The probe period is a bit misunderstanding, Intuitively, a smaller period means more frequent keepalives, a lower probability of false positive downtime(Rename or Down)? I have a question, if node1 send probe1 to node2, and node1 not receive the reponse from node2 at the end of the period, the node1 mark node2 as suspect, at the next period, node1 send probe2 to node2 again, and node1 receive the response of probe2 from node2, at this moment, will node1 still mark node2 as suspect? Or change to Up?

@caio
Copy link
Owner

caio commented Jun 1, 2024

Thanks for the info! As I suspected, the Option is within your identity. I'd try some easy to craft tests to figure out where the issue lies. Things like:

  • Does it still happen if you use a simpler id?
  • What about changing the codec?

The fact that it only happens occasionally is really strange which is why I leaned on this being something to do with the network (assuming udp). Not so much packet loss (you wouldn't see errors), but corruption: the checksum udp gives is just 16bits, can be turned off and gets messed with at borders all the time, you should have some for the data.

But then... if it were corruption you wouldn't be seeing errors only about decoding an option heh. Quite puzzling without seeing the whole code and setup


The probe period is a bit misunderstanding, Intuitively, a smaller period means more frequent keepalives, a lower probability of false positive downtime(Rename or Down)?

I'd say it really depends on what's the cause for these false positives and how you're looking at the numbers. If there's always a chance for false positives, asking more often will mean that you'll see false positives more often.

Tuning probe period is a trade-off between cluster chatter/noise and how fast you detect that a node went down. What needs to happen once a node goes down varies heavily based on the use-case so there's no clear right approach: if you want uncoordinated, eventually consistent properties at the swim level, you're gonna have to add coordination on your side (ask other peers? try a tcp connection? wait a few seconds? it really depends on what's your goal)

Generally speaking, I'd recommend setting up the cluster with a generous probe period/rtt, ensuring things are as stable as you want (it's reasonable to expect 10-node LAN cluster to have zero Rename/Down events in a large time window if the nodes are sitting idle; It's unreasonable to expect anything good from the cloud :P). Once things are stable, then you start tuning for speed of failure detection otherwise you'll be tweaking many variables at once.

I am a bit confused about periodic_gossip, which speed the cluster update. If i enable that and when Rename occured, would the Rename be faster and more frequent in cluster?

Period gossip just changes how often a node sends its knowledge around. It speeds up how fast the whole cluster converges to the same membership status. Serf has a little simulator that helps understand how gossip fan out and frequency influence the convergence speed: https://www.serf.io/docs/internals/simulator.html

have a question, if node1 send probe1 to node2, and node1 not receive the reponse from node2 at the end of the period, the node1 mark node2 as suspect, at the next period, node1 send probe2 to node2 again, and node1 receive the response of probe2 from node2, at this moment, will node1 still mark node2 as suspect? Or change to Up?

The probe message contains an identifier and replies must match it. Every cycle the identifier changes, so replies to old ones are ignored. I wrote foca with the ability to ask these kind of questions in mind as I was learning the protocol too :) You might find it useful to read and play with the probe_ tests. Starting here: https://github.com/caio/foca/blob/main/src/lib.rs#L2866

@Manphil
Copy link
Author

Manphil commented Jun 3, 2024

I captured the data from Foca::handle_data when it returns decode error, and compared with the data logged in the node which send the data. The data is same, no data pocket loss. Then, i create a new foca instance, and pass the data into Foca::handle_data to test it. Everything is ok, no errors occured, which is strange. The message type of decoded header is Message::TurnUndead.

I will change the codec to bincode, and see if the errors exists.

@caio
Copy link
Owner

caio commented Jun 3, 2024

Oh! Is it possible one of your nodes (the one with the decode error) is running an older foca version? That would 100% explain the decode error being strange and only happening occasionally

One of the many things foca doesn't do is version the protocol and TurnUndead was added somewhat recently- nodes that don't know about it will work kind of ok because not much else has changed in the header

@Manphil
Copy link
Author

Manphil commented Jun 3, 2024

The foca version is unlikely not same. I run the cluster in k8s, and each container uses the same image. But the decode error is indeed occasional, so it's strange. It is more likely to occur when there is a certain amount of the network traffic, so I initially suspected it was caused by udp packet loss.

@caio
Copy link
Owner

caio commented Jun 3, 2024

Managed to trigger it here, knowing the message type helped- thanks a lot!

The error is being caused by turnundead messages containing custom broadcasts when they shouldn't

The reason it didn't break when you tested on another node is because the header decodes correctly then, since the message is not to the same identity, it gets discarded. With the correct identity it tries to interpret the custom broadcast as a u16+sequence-of-identities 💣

I'll ship a fix for it tomorrow

@Manphil
Copy link
Author

Manphil commented Jun 4, 2024

Oh, you are right, sorry for that! Yesterday I new the foca instance with wrong identity, so it returns from

foca/src/lib.rs

Lines 902 to 906 in 1e8c322

if !self.accept_payload(&header) {
#[cfg(feature = "tracing")]
tracing::trace!("Payload not accepted");
return Ok(());

I renew a foca instance with the dst identity of the captrued data, the decode error reproduced. The error returns from the first call of decode_member:

foca/src/lib.rs

Lines 920 to 923 in 1e8c322

self.codec
.decode_member(&mut data)
.map_err(anyhow::Error::msg)
.map_err(Error::Decode)?,

@caio
Copy link
Owner

caio commented Jun 4, 2024

I've released v0.17.2 with a fix. Closing this issue, feel free to reopen.

Thank you very much for the report and assistance debugging!

@caio caio closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants