-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode error, then cause to Member::Rename? #35
Comments
Interesting! Assuming everything is running the same version this looks like it's either a problem with:
It's hard to diagnose Decode errors from afar- what you can do is store the data that caused the error and try to inspect it manually (it's a simple format, described here) Capturing the debug logs (at least Now, a high rate of Notification::Rename is interesting in itself. If you're not expecting it to happen (i.e.: you're not restarting nodes, you're not on a flaky network, nodes are not overloaded all the time), it may be a sign of a problem. I would try to figure out if it's a particular node that is constantly changing its identity or if it's generalized throughout the cluster. A generalized problem could mean you need to tune the foca config (maybe probe period is too short, maybe you need to give more time on suspect-to-down, maybe enable periodic gossiping) |
My Identity struct is like
My Codec is PostcardCodec. foca/examples/foca_insecure_udp_agent.rs Line 312 in 1e8c322
It's good idea, i will debug it. I runs the cluster in k8s, i am not sure if the network is stable. I will confirm whether it is caused by packet loss. I am a bit confused about periodic_gossip, which speed the cluster update. If i enable that and when Rename occured, would the Rename be faster and more frequent in cluster? The probe period is a bit misunderstanding, Intuitively, a smaller period means more frequent keepalives, a lower probability of false positive downtime(Rename or Down)? I have a question, if node1 send probe1 to node2, and node1 not receive the reponse from node2 at the end of the period, the node1 mark node2 as suspect, at the next period, node1 send probe2 to node2 again, and node1 receive the response of probe2 from node2, at this moment, will node1 still mark node2 as suspect? Or change to Up? |
Thanks for the info! As I suspected, the Option is within your identity. I'd try some easy to craft tests to figure out where the issue lies. Things like:
The fact that it only happens occasionally is really strange which is why I leaned on this being something to do with the network (assuming udp). Not so much packet loss (you wouldn't see errors), but corruption: the checksum udp gives is just 16bits, can be turned off and gets messed with at borders all the time, you should have some for the data. But then... if it were corruption you wouldn't be seeing errors only about decoding an option heh. Quite puzzling without seeing the whole code and setup
I'd say it really depends on what's the cause for these false positives and how you're looking at the numbers. If there's always a chance for false positives, asking more often will mean that you'll see false positives more often. Tuning probe period is a trade-off between cluster chatter/noise and how fast you detect that a node went down. What needs to happen once a node goes down varies heavily based on the use-case so there's no clear right approach: if you want uncoordinated, eventually consistent properties at the swim level, you're gonna have to add coordination on your side (ask other peers? try a tcp connection? wait a few seconds? it really depends on what's your goal) Generally speaking, I'd recommend setting up the cluster with a generous probe period/rtt, ensuring things are as stable as you want (it's reasonable to expect 10-node LAN cluster to have zero Rename/Down events in a large time window if the nodes are sitting idle; It's unreasonable to expect anything good from the cloud :P). Once things are stable, then you start tuning for speed of failure detection otherwise you'll be tweaking many variables at once.
Period gossip just changes how often a node sends its knowledge around. It speeds up how fast the whole cluster converges to the same membership status. Serf has a little simulator that helps understand how gossip fan out and frequency influence the convergence speed: https://www.serf.io/docs/internals/simulator.html
The probe message contains an identifier and replies must match it. Every cycle the identifier changes, so replies to old ones are ignored. I wrote foca with the ability to ask these kind of questions in mind as I was learning the protocol too :) You might find it useful to read and play with the probe_ tests. Starting here: https://github.com/caio/foca/blob/main/src/lib.rs#L2866 |
I captured the data from I will change the codec to bincode, and see if the errors exists. |
Oh! Is it possible one of your nodes (the one with the decode error) is running an older foca version? That would 100% explain the decode error being strange and only happening occasionally One of the many things foca doesn't do is version the protocol and TurnUndead was added somewhat recently- nodes that don't know about it will work kind of ok because not much else has changed in the header |
The foca version is unlikely not same. I run the cluster in k8s, and each container uses the same image. But the decode error is indeed occasional, so it's strange. It is more likely to occur when there is a certain amount of the network traffic, so I initially suspected it was caused by udp packet loss. |
Managed to trigger it here, knowing the message type helped- thanks a lot! The error is being caused by turnundead messages containing custom broadcasts when they shouldn't The reason it didn't break when you tested on another node is because the header decodes correctly then, since the message is not to the same identity, it gets discarded. With the correct identity it tries to interpret the custom broadcast as a u16+sequence-of-identities 💣 I'll ship a fix for it tomorrow |
Oh, you are right, sorry for that! Yesterday I new the foca instance with wrong identity, so it returns from Lines 902 to 906 in 1e8c322
I renew a foca instance with the dst identity of the captrued data, the decode error reproduced. The error returns from the first call of decode_member :Lines 920 to 923 in 1e8c322
|
I've released v0.17.2 with a fix. Closing this issue, feel free to reopen. Thank you very much for the report and assistance debugging! |
I run a cluster with 10 nodes, and I found the error Decode(Found an Option discriminant that wasn't 0 or 1) occurs every 10-20minutes in several nodes, then other nodes received the Member::Rename(i am not sure if that errors causes it), but the Member::Rename is more frequent even if there no error.
Here is my foca config:
The text was updated successfully, but these errors were encountered: