Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member can lose all memberships and never recover #15

Closed
jeromegn opened this issue Nov 3, 2022 · 32 comments
Closed

Member can lose all memberships and never recover #15

jeromegn opened this issue Nov 3, 2022 · 32 comments

Comments

@jeromegn
Copy link

jeromegn commented Nov 3, 2022

If there's a network event causing a node to lose all connectivity, all members will appear down from its point of view.

It doesn't appear to recover from this condition when the network goes back up.

Is this expected? If so, what would be a good way to remediate the situation? I assume detecting a large amount of down nodes -> re-announcing to the original bootstrap list would work?

@caio
Copy link
Owner

caio commented Nov 5, 2022

Yeah, it's expected. I should definitely make this clear in the docs. Thanks for the nudge!

I think re-announcing to the bootstrap list is a good approach (it's kind of what I do nowadays: restart the instance if its active set is empty after too long; the good old turn it off and on again).

Thinking about it now, perhaps a simple mechanism to periodically announce to known down members would work well. Something like: every X seconds it looks at the down set, picks a few random members and announces to them. From the top of my head, the only situation this won't cover is the first join: if nobody replies to your announces, your cluster will remain empty so it won't have a list of down members

I'm very open to ideas here, lemme know what you think if you have some 😃 The "just restart" approach has worked so well in practice that I haven't put any real thought into this problem yet

@jeromegn
Copy link
Author

jeromegn commented Nov 5, 2022

I could look for this condition myself (lost all contact with other nodes), even on first boot it can happen that no members are found (especially if I'm restarting a bunch at the same time).

I've also noticed the following:

  • A node ran out of memory
  • This situation created a whole bunch of issues, one of them was related to the network
  • The node lost visibility into a third of the fleet (100/300 nodes)
  • It never recovered from this, I had to restart my app for it to reconnect to the lost nodes

I would've thought that SWIM would constantly be advertising other nodes to each other.

I think I should tweak some foca config settings, I do lose members quite often. There's a worse-case "normal" scenario of 300ms ping between nodes (across the globe), but this can get worse given networks aren't always stable. The default "wan" settings don't seem to work that well with our topology.

@caio
Copy link
Owner

caio commented Nov 6, 2022

Ooh interesting! My guess is that this node started declaring members as down by itself (machine is thrashing, network i/o grinds to a halt, probe cycle "fails") and then when it got back to a working state, the cluster had decided it was down. If that was the case, its roster would slowly go down to zero since nobody would be answering its probes.

This is a SWIM "weakness": the Down state is terminal and there's no way for a member to differentiate between "my packets are being ignored" from "the other nodes are down".

I think we can try to improve that a bit by extending the join sub-protocol (this is from foca; SWIM doesn't specify one as the paper relies on multicast): Right now, we simply discard packets from dead members, but could instead add logic that if it's Message::Announce, reply with a new one that tells the receiver their identity is down and they should renew.

I would've thought that SWIM would constantly be advertising other nodes to each other.

It doesn't :( Hashicorp's memberlist does, however. It uses a mechanism they often call "push/pull" where it periodically connects two members via tcp and does a two way knowledge synch between them. You can implement a similar thing and use Foca::apply_many when merging the info if you like their approach.

Alternatively, you can approximate this logic by periodically announcing to a random member since the reply to an announce message (Feed) contains a sample of the active set. I think foca could offer this as a feature via config since it's simple enough and sounds super useful to help the cluster converge.

RE your config: I remember you mentioning not caring much about the failure detection part; If you don't want to blindly bump the probe period, perhaps you can just give more time for members to refute a suspicion (i.e.: a higher Config::suspect_to_down_after): the larger the cluster, the longer it takes for it to converge to the "same" knowledge and a suspicion cycle requires this to happen twice (memberA declares memberB as suspect -> memberB receives an update with the suspicion from someone else -> memberA receives B's refutal from someone else)

@jeromegn
Copy link
Author

jeromegn commented Nov 6, 2022

Alternatively, you can approximate this logic by periodically announcing to a random member since the reply to an announce message (Feed) contains a sample of the active set. I think foca could offer this as a feature via config since it's simple enough and sounds super useful to help the cluster converge.

That could work. Is that the equivalent of a "non-fast" rejoin?

RE your config: I remember you mentioning not caring much about the failure detection part; If you don't want to blindly bump the probe period, perhaps you can just give more time for members to refute a suspicion (i.e.: a higher Config::suspect_to_down_after): the larger the cluster, the longer it takes for it to converge to the "same" knowledge and a suspicion cycle requires this to happen twice (memberA declares memberB as suspect -> memberB receives an update with the suspicion from someone else -> memberA receives B's refutal from someone else)

Failure detection is going to become more useful very fast. If I still want somewhat reliable failure detection, but don't want to declare members down too hastily, which config should I mess with? I'm fine with blindly trying a few values until it works out!

@caio
Copy link
Owner

caio commented Nov 7, 2022

That could work. Is that the equivalent of a "non-fast" rejoin?

Kinda. Every valid message is interpreted as a join 1, the interesting bit about Announce is that it instructs the receiver to reply with a list of active members. It's just a very cheap way of asking peers for more peers.

The fast rejoin happens when foca learns the cluster thinks it's down: It is a change_identity followed by gossip (notice how it doesn't send an Announce). It's just functionality to save you from having to watch for the notification and doing the ceremony yourself.

Failure detection is going to become more useful very fast. If I still want somewhat reliable failure detection, but don't want to declare members down too hastily, which config should I mess with? I'm fine with blindly trying a few values until it works out!

I suspect these false positives are mostly due to the slow convergence of the long tail: the new_wan/lan values were derived from memberlist which has built-in periodic gossiping enabled. Bumping suspect_to_down_after would be a simple way of giving it more time, but since failure detection is back in the game, I'd start with emulating that: calling foca::gossip every 500ms. (this is another easy feature to implement inside foca)

Just to summarize the improvements we've been discussing so far:

  1. A way for members to learn that their messages are being ignored, so they can react to the situation where they've been declared down but haven't received that info through normal means
  2. Periodic announce() driven by foca, to emulate memberlist's push/pull mechanism (memberlist does it every minute, but it's two-way and synchronous; might end up going for higher frequency + fan out, say: every 30s announce to 3 members)
  3. Periodic gossip() driven by foca, to speed up cluster update propagation

2 and 3 are quite easy to implement outside of foca, but useful enough that I think it's worth to bring into it. I think I can ship these sometime this week

Footnotes

  1. here is where the "join" happens: foca simply tries to update its cluster knowledge: if the sender is already known, it's a no-op. In fact, don't need Announce to join a cluster at all, you can just tell foca that a member is alive (via foca::apply) and it will start doing its thing.

@jeromegn
Copy link
Author

jeromegn commented Nov 7, 2022

Thanks for the tips!

I did change our config generator over the weekend and it seems like it had the inverse effect I was hoping for.

image

That green line is the only node with the change in how we generate the config.

fn make_foca_config(cluster_size: NonZeroU32) -> foca::Config {
    // cluster_size is generally over 200, should be closer to 300 when everything works well
    let mut config = foca::Config::new_wan(cluster_size);
    config.max_packet_size = EFFECTIVE_CAP.try_into().unwrap(); // max payload size for udp over ipv6 wg - 1 for payload type
    // 12.0 multiplier instead of 6.0 in foca
    config.suspect_to_down_after = suspicion_duration(cluster_size, config.probe_period, 12.0);
    // that's 50 seconds I believe.
    config.remove_down_after = 10 * config.probe_period;
    config
}

fn suspicion_duration(cluster_size: NonZeroU32, probe_period: Duration, multiplier: f64) -> Duration {
    let secs = f64::max(1.0, f64::from(cluster_size.get()).log10()) * multiplier * probe_period.as_secs_f64();

    // NOTE: `Duration::from_secs_f64` is panicky, but:
    //  - multiplier is either 4 or 6
    //  - probe_period is either 1 or 5 secs
    // So we know `secs` is finite, greater than zero and won't overflow
    Duration::from_secs_f64(secs)
}

(I copied the function for calculating suspicion duration, but I'm passing in different, higher, values.)

Should I remove the change to remove_down_after?

I'd start with emulating that: calling foca::gossip every 500ms. (this is another easy feature to implement inside foca)

We're calling foca::gossip every 200ms presently.

@caio
Copy link
Owner

caio commented Nov 8, 2022

Ooof that's pretty weird! But I think I know what's up with the low cluster size. If you revert the config change and restart that just one node I suspect that it will behave the exact same way:

The cause of this is that in many cases a Feed message will not contain every member in the list (all identities must fit in max_packet_size for this to happen). So when Green sets its initial announce to join the cluster, they learned about these ~220 members and from then on, their roster would only change when the current unknown nodes change their identities for whatever reason or when they decide to communicate (via probe / gossip) with Green. This would look even worse on a larger cluster 😅

So, the current join protocol is not good enough for discovering all members as I failed to consider this case; It's good enough for joining the cluster. Doing a periodic announce now sounds like more of a requirement than a nice to have for any reasonably sized cluster heh

That said, you can probably roll a custom join mechanism that can beat any randomness-based approach: if you know which addresses should be part of the cluster, you can just periodically announce straight to the ones that are not in your active members list. So, essentially relying on swim for liveness/failure detection only.

Should I remove the change to remove_down_after?

Hard to say given the problem I described above. The graph seem to imply that members are going down often in your cluster, but if remove_down_after has such a noticeable impact I'd suspect something is off wrt renewing identities (i.e.: are instances restarting often? is there anything that tries to reuse a previous identity?). One way of testing this is setting the value to something very large: does the downward trend get worse?

We're calling foca::gossip every 200ms presently.

Cool. You got to this "high" frequency gossip because of the custom broadcasts right? I wonder if gossiping often can have an impact in the recovery: member goes down, stops receiving cluster updates, then by the time it comes back it's unable to catch up with the cluster changes it missed because all updates have been disseminated already... This could explain one thing that memberlist does that still puzzles me: it has a feature to gossip to the dead.

In fact, as I write this, I think that fits perfectly what we're seeing in the graph + the impact of remove_down_after: members are being declared down but it looks like they sometimes (never?) learn that info, so they keep chatting to the void for a while (declaring other members as down, as normal part of the cycle) and when remove_down_after expires they "magically" reappear. Would explain why there's a constant sharp drop in members.

I'll give it a think- if that's the case, "gossiping to the dead" like membership does could be a solution; The feature to notify chatty members that they're actually dead would definitely help too.

Man, I'm sorry you're having to discover these issues on the go, but it's super helpful for me to diagnose and fix issues- Thanks a lot for the patience and detailed reports so far 😁

@jeromegn
Copy link
Author

jeromegn commented Nov 8, 2022

My pleasure. We don't expect anything from you of course, but we're happy if these help.

Our nodes rarely ever go down. We have our own monitoring system and a down node is rare.

Our graph of members looks like this most of the time:
image

When we have less nodes (e.g. 120), then this graph is a straight line. They all know of each other. I only know this because sometimes I have to reset my database and when I'm bringing it back up I need to do it in a targeted manner.

I did notice 500ms gossip interval seemed to have helped with the new duration in other settings.

@jeromegn
Copy link
Author

jeromegn commented Nov 9, 2022

Something else that might affect gossip: my Identity has become a bit big.

#[derive(Clone, PartialEq, Eq, Deserialize, Serialize)]
pub struct Actor {
    id: ActorId,
    name: ActorName,
    addr: SocketAddr,
    region: String,
    datacenter: String,
    write_mode: WriteMode,
    instance_id: Uuid,

    // An extra field to allow fast rejoin
    bump: u16,
}

@jeromegn
Copy link
Author

Another update:

Somehow every node now thinks there are more members than there actually are and it stays that way. I'm really not sure why, but I think I'll revert my changes to the "suspect down duration".

There are ~310 nodes, and there are no nodes that think there are less than 400 nodes.

Could be another change I made, I'll have to check.

I'm thinking I should reduce the size of my Identity so it can fit in more gossip payloads? It's not that big, probably something like 90 bytes, but that would make gossip slower than if it was just a SocketAddr + the bump u16, I assume.

Our gossip payload size is 1372 because it's UDP over IPv6 over WireGuard. I sure hope my calculation was correct there!

pub const FRAGMENTS_AT: usize = 1372; // 1420 MTU on wg0 - 40 bytes IPv6 header - 8 bytes UDP header

I'm technically subtracting 1 byte from it (pub const EFFECTIVE_CAP: usize = FRAGMENTS_AT - 1;) because I'm passing different kinds of messages on the same socket. Basically a prefix u8 of 0 is a foca payload, and 1 is a broadcast payload.

@jeromegn
Copy link
Author

Ok the members count is a different problem I think.

Our internal Members map (similar to the one in the examples) is correctly tracking the number of members, but foca.num_members() is returning a figure over 400.

What does that mean?

Here's my Identity implementation:

impl Identity for Actor {
    fn has_same_prefix(&self, other: &Self) -> bool {
        // necessary check before we know the full identity, when we need to announce to them
        if other.id.is_nil() || self.id.is_nil() {
            self.addr.eq(&other.addr)
        } else {
            self.id.eq(&other.id)
        }
    }

    fn renew(&self) -> Option<Self> {
        Some(Self {
            id: self.id,
            name: self.name.clone(),
            addr: self.addr,
            region: self.region.clone(),
            datacenter: self.datacenter.clone(),
            write_mode: self.write_mode,
            instance_id: self.instance_id,
            bump: self.bump.wrapping_add(1),
        })
    }
}

My internal Members hashmap is keyed by ActorId (UUID), so it should end up with the same set.

@caio
Copy link
Owner

caio commented Nov 11, 2022

Hah I suspect you were looking at the side effects of an unclean restart:

When an instance leaves the cluster without notifying it (i.e.: without Foca::leave_cluster, for example), it takes a little while for its exit to be detected: somebody has to ping it, wait for probe_period, suspect it and then wait for suspect_to_down_after; Only after this the member will stop being considered active.

This means that if you restart an instance in a way that it keeps its addr but not the overall identity (say, the bump goes back to zero, or instance_id changes), the cluster will be seeing two identities with the same address at once.

This can happen during normal circumstances too (albeit rarely): an instance may learn about the new identity before it learns that the previous one is down.

That's the reason the Members hashmap in the examples uses a counter as the value (as opposed to just being a hashset): from Foca's perspective, there are multiple members but from a user perspective, the address is still active.

You can find the culprits out by looking at the counters in the hashmap: anything that's above one was likely restarted recently.

That being said, with some changes we may be able to detect the case of an unclean exit followed by a join by making the Identity trait smarter: has_same_prefix is super simple for it gives flexibility, but we could have instead a method that answers whether one identity replaces another (think version checks).


Our gossip payload size is 1372 because it's UDP over IPv6 over WireGuard. I sure hope my calculation was correct there!

Looks reasonable to me, but I don't have much experience with this part. I wouldn't worry much about it, maybe just send a few fat packets from one end and see if they end up being fragmented.

Quoted this mostly to say that I think your identity byte size is ok to me: a gossip packet will still be able to ship ~10 distinct updates. Minimizing id length is worthwhile from a bandwidth perspective because it is the largest thing in the payload, but large ids shouldn't impact functionality.

I think most of the fluctuations you're seeing are due to the member's inability to find out that they've been declared dead and the consequences of thinking they're alive while everybody ignores their messages. Things should look a lot more stable once #16 lands

@jeromegn
Copy link
Author

I suspect you were looking at the side effects of an unclean restart

I do notice that when I restart the cluster (picking a few random nodes to restart in parallel), then the number of members grows temporarily.

However, this is different. Nearly 24 hours since the last restart and they number is still inflated. Looks like some nodes never get cleaned up? Not entirely sure.

@caio
Copy link
Owner

caio commented Nov 11, 2022

Ooh that's interesting! Are you able to dump foca.iter_active() or perhaps the per-address counts? Would be helpful to know if this is a cluster-wide problem and many nodes have multiple active identities or if it's just a rogue one.

Did you start doing a periodic announce? If so, maybe it's an interaction with the problem with a node never learning that it was declared down.. ex: NodeA has a set of members with the identities from before a restart; then it gets declared down - meanwhile, the cluster learns that some of the old ids are down and cleans them up; Then Node A comes back and feeds the old ids back to the cluster as valid members.

@jeromegn
Copy link
Author

Yes, I am periodically announcing nodes to each other every minute. I could probably remove that for now.

@caio
Copy link
Owner

caio commented Nov 11, 2022

Great, looks like my theory was correct. Things should improve soon if my idea to notify dead members works

One way to flush out these nodes in your current setup would be to set remove_down_after to a very high number, then nobody would forget about the old ids; But don't do that, I think if you do it right now (since it lacks down awareness) your cluster size will slowly decrease to zero 😅

Thinking about this a bit, I think this is a problem foca can't solve generically: nothing prevents a node from appearing with a list of dead-nodes-that-we-forgot-about. So in the future, once we know that we can recover, I think the right way to go would be to have really high values for remove_down_after.

@caio
Copy link
Owner

caio commented Nov 13, 2022

#16 merged, v0.6.0 released

I expect that once your whole cluster gets this update, your gossip members graph will look a lot more stable.

Config::new_wan will enable everything we've discussed: periodic announce, periodic gossip and down member notification.

If you do this change "live", the dead members that keep coming back will still appear in the roster for a while but they should slowly disappear. If a member really goes down and magically pops back up (say: it gets partitioned for a few minutes), it may still end up feeding previously known dead members to the cluster: you can bump remove_down_after to a really high number to prevent this from ever happening; But even if it does happen, if your cluster is stable, dead nodes should disappear after a while regardless of the setting.

As I imagine you'll do a rolling restart to pick these updates up: you might see a few Decode errors as old clients will be oblivious to the new message foca may send.

Here's hoping things look saner now! 🚀

@jeromegn
Copy link
Author

I've rolled this out and it's nice to be able to get rid of both custom announce interval and gossip interval and have that baked in.

Looks like the number of active members doesn't reflect reality, even after 3 hours with the new version deployed.

image

I'm using the default setting for everything. The main thing I do is I'm constantly reloading the configuration as members are added or removed, just so I can update the cluster size dynamically. Should I not be doing that? :)

// ...
FocaInput::ClusterSize(size) => {
    debug!("Adjusting cluster size to {size}");
    config = make_foca_config(size);
    foca.set_config(config.clone())
}
// ...

fn make_foca_config(cluster_size: NonZeroU32) -> foca::Config {
    let mut config = foca::Config::new_wan(cluster_size);

    // max payload size for udp over ipv6 wg - 1 for payload type
    config.max_packet_size = EFFECTIVE_CAP.try_into().unwrap();

    config
}

I'm pushing FocaInput::ClusterSize every time a member is really removed or really added.

@caio
Copy link
Owner

caio commented Nov 13, 2022

That's disappointing haha

I still think it's something to do with flapping due to an interaction between remove_down_after and periodic announcing; 15s is too short of a time, especially on a large cluster I think... Can you try a very large value there? Say Duration::from_secs(60 * 60)

memberlist defaults to never cleaning up and now foca members know how to recover properly so I think it's a safe try.

I'm using the default setting for everything. The main thing I do is I'm constantly reloading the configuration as members are added or removed, just so I can update the cluster size dynamically. Should I not be doing that? :)

That's absolutely fine- you're using it exactly as I thought it would be used

@jeromegn
Copy link
Author

It's looking at lot better with a a 60s remove_down_after!

image

I'm wondering if the issue is that there's just too much to gossip? There are a lot of members, so that means there are a lot of pings to send and acks to receive (and all the other gossip payload types). Meaning foca might not "get to" relevant messages in time? The 15s default was surely too low to go through all members.

1 hour seems a bit high :) maybe 2 minutes is enough for our current cluster size. I'll play with that value a bit.

@jeromegn
Copy link
Author

jeromegn commented Nov 14, 2022

With a 2 minutes remove_down_after:

image

Can't see much because they're all overlapping 😄, now there's only a difference of 1 member for some nodes.

@caio
Copy link
Owner

caio commented Nov 14, 2022

Sweet! That's what I hoped to see after the release 😁

So we need a better default for remove_down_after now- The minimum "good" value should be somewhere along probe period + the suspicion timeout + propagation times, but large values are perfectly harmless (I'm assuming the 1h setting would end up with this beautiful flat line too after a while). I think I'll use your 2min as the default, we know it works well! haha

I'm wondering if the issue is that there's just too much to gossip? There are a lot of members, so that means there are a lot of pings to send and acks to receive (and all the other gossip payload types). Meaning foca might not "get to" relevant messages in time? The 15s default was surely too low to go through all members.

My understanding is that the problem with 15s was that the cluster would quickly learn that the member went down, but then quickly forget it too- So then when members announced, they had a high chance of receiving back the member they just forgot was down. The knowledge could have remained in the cluster longer with a less frequent gossip (or a higher max_transmissions) but I think the right fix is what you got now: a longer wait until forgetting about down members

@jeromegn
Copy link
Author

We often get in a situation where a single member knows about ~1-2 more/less members than the rest of the cluster. I do not know how this happens! It stays that way indefinitely.

The rest of the cluster is in agreement on a count of 309 members.

When I restart my program on the member with the wrong count, it fixes itself.

image

@caio
Copy link
Owner

caio commented Nov 26, 2022

Now that's a head scratcher 😅

I wonder if this is the case of a member learning about its own previous identity and then taking forever to flush it out (can take up to cluster_size * (probe_period + suspect_to_down) for it to pick up a particular member to probe since it uses a round-robin+shuffle approach)...

If that's the case, there should be a sign of this happening: a (rare) Error::DataFromOurselves on your logs. I think I'm able to artificially create this scenario, but never expected to see it actually happening - perhaps I need to think the identity renewal path more carefully.

Your config is pretty much Config::new_wan() right? What's the actual total number of nodes in your cluster? Can I get a view of your gossip members graph over a span of 24h (wondering if it's always a flat stable graph or if there's some jitter)?

@jeromegn
Copy link
Author

If that's the case, there should be a sign of this happening: a (rare) Error::DataFromOurselves on your logs

I can't find any such logs in the past few days (and the aforementioned issue definitely happened recently).

I did find a few others though:

error handling timer: BUG! Probe cycle finished without running its full course
Received data from something claiming to have an identity equal to our own

These sound like issues with how I'm encoding data on the wire:

Received data larger than maximum configured limit
UUID parsing failed: invalid length: expected 16 bytes, found 50
Invalid value (u128 range): you may have a version or configuration disagreement?
\nByte 255 is treated as an extension point; it should not be encoding anything.\nDo you have a mismatched bincode version or configuration?\n

I wonder if any encoding issue could cause random data to be interpreted as a valid foca message? hmmm. Probably unlikely.

Your config is pretty much Config::new_wan() right? What's the actual total number of nodes in your cluster?

Yes. Roughly 315 nodes. Grows by a few every week.

Can I get a view of your gossip members graph over a span of 24h (wondering if it's always a flat stable graph or if there's some jitter)?

I can show you the last 12 hours, the rest moved too much since I deployed a few times.

image

It's a bit hard to tell, but right now most nodes have settled at 313. It does look like the right number is 316 and some nodes were lost.

@caio
Copy link
Owner

caio commented Nov 27, 2022

Thanks a lot for all the info!

Received data from something claiming to have an identity equal to our own

That's the error I was looking for, nice! (DataFromOurselves is the enum discriminant, forgot that it would be rendered to something better on display)

There's a single point that creates the message headers and a single point that emits this error so I think this is a reliable datapoint.

My first guess would be that there is more than one node using the same identity.

But considering the other errors you posted, this may be a corruption thing: you're using bincode and foca recently introduced a new message (to notify chatty dead members), i wouldn't be surprised if receiving this new message on a node running a previous foca version led to mayhem. Can you verify that all your nodes are running the same version/config?

Foca could have checksum and protocol version in the payload to help with such cases- I'm a bit hesitant to adding protocol version because the follow-up is usually very annoying deal with: supporting multiple versions at once. I'd recommend putting a tag with a version/checksum in the beginning of the payload if you can (akin to what you're doing for your broadcasts).


Seeing the cluster settle on two different numbers (between hours 8 and 10, counts are either 316 or 314) makes me think there's some split-brain happening. I would expect more fluctuations in the numbers tho (maybe it just looks super stable because the time series db is rolling the counts up for scaling thus hiding the jitter? i.e. maybe it fluctuates minutely but we can't see it here since it had to merge values to render the graph? your other screenshots suggest that this is not the case).

I'll get a patch ready to prevent an instance from ever learning about its previous identity. Should mitigate what I'm worried about, if it's happening at all (I think we can trigger this by doing an unclean restart of the same node multiple times in a row).


As for the unfinished probe cycle: that super puzzling! It's hard to reason about it while there's the possibility of identity clash / valid-but-currupt-messages so I'll leave this one tabled for now. At least we know that foca gracefully recovers from these 😁

@caio
Copy link
Owner

caio commented Nov 27, 2022

Released v0.7.0! It contains a commit to prevent foca from learning about its own (previous) id; This may have been one of the causes for an inflated members count

@jeromegn
Copy link
Author

jeromegn commented Nov 27, 2022

Thanks, trying it out now, will report on how it's doing soon!

As for the corrupt messages: I'm wrapping messages with my own types too. Basically sending various types of messages: SWIM, Broadcast and Priority Broadcast.

SWIM are foca-generated messages, I prepend a 0 byte before these messages.

Broadcasts work similarly to foca broadcast but are more efficient (to work around #11). 1-prefixed.

Priority Broadcasts are basically direct broadcasting (not random) to nodes particularly interested by certain kinds of data. The main difference with broadcasts is that those aren't re-broadcasted to other nodes. 2-prefixed.

Some messages are sent via HTTP instead of UDP if they're too big to fit in a single UDP packet.

I'm reusing buffers a bunch, but it's possible I got it wrong and under certain circumstances I'm not clearing one properly.

@jeromegn
Copy link
Author

jeromegn commented Dec 5, 2022

I think things are better since the latest release.

I still see some divergence in the number of members in my cluster over time.

For example, every node agreed that there were 318 members. Then they all agreed there was 317 members, except 1 node.

image

So maybe that node somehow became unreachable from the others?

I should add something to inspect the foca members. Since it's in its own loop it's a little hard (but not too hard, just have to do it).

@caio
Copy link
Owner

caio commented Dec 10, 2022

Glad to see things seem to have improved!

I should add something to inspect the foca members. Since it's in its own loop it's a little hard (but not too hard, just have to do it).

Agreed- this is probably the best course of action to figure out what's up. Doesn't need to be super complicated even. I'd approach it by creating an endpoint that makes a node dump it's current foca.identity() and foca.iter_members() somewhere. Then when you see the situation in the graphs as above, you just poke every member to dump it

We'd be mostly interested in:

  • who is it the cluster thinks it's down; maybe it's behaving as a bad actor (identity clash / old version / flaky network)
  • are members still learning about their own previous identities (they shouldn't, but maybe there's a bug?)

@jeromegn
Copy link
Author

Good ideas. I'll try and get to that this week.

@jeromegn
Copy link
Author

It looks like this has been resolved as I fixed another issue in our logic. A buffered channel would get full and block the whole "gossip loop".

The number of members has been stable for over a week of not touching the cluster!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants