Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Membership discovery and (self) awareness improvements #16

Merged
merged 6 commits into from Nov 13, 2022
Merged

Membership discovery and (self) awareness improvements #16

merged 6 commits into from Nov 13, 2022

Conversation

caio
Copy link
Owner

@caio caio commented Nov 9, 2022

Going through features discussed on #15. I'll slowly be ticking the boxes as I complete each:

  • A way for members to learn that their messages are being ignored, so they can react to the situation where they've been declared down but haven't received that info through normal means
  • Periodic announce() to increase the speed to discover the whole cluster
  • Periodic gossip(), to speed up cluster update propagation

tracing-attributes 0.1.23 ships with a smarter `Span::record` that
allows owned values [^1], which makes

This means that previously invalid code such as:

    span.record("num_updates", num_items);

Is now valid, which makes clippy throw a warning for needless_borrow

Fixing the lint, however, means that foca would require a strict version
check for tracing-attributes (otherwise anything pulling 0.1.22 or
bellow would not compile anymore).

I don't want a strict dependency on tracing: it's tangential do foca and
I'd like users to benefit from the improvements that might be shipped
along, so I fix the lint warning by ignoring it instead.

Once we get a (at least minor) version bump on `tracing`, this can be
removed and updated to skip the borrow.

[^1]: tokio-rs/tracing#2212
When a member is declared down by the cluster, its messages get ignored
until the cluster "forgets" it went down (i.e.: after
`Config::remove_down_after` expires). This patch introduces a new
setting `Config::notify_down_members` which makes foca send
a lightweight message to said members when they're down.
we'll be repeating the pattern of picking random members and sending
a message to each
This introduces Config::periodic_announce, which allows users to
instrcut foca to periodically send a Announce message to random
members so that it learns about every cluster member faster
This introduces Config::periodic_gossip which instructs foca to
disseminate cluster updates to random members with a fixed frequency so
that the cluster learns new information faster

periodic announce paved the way, so this was mostly copy-paste :-)
@caio
Copy link
Owner Author

caio commented Nov 12, 2022

Good to go! Gonna leave this running for a while and inject a few failures to see if I can catch any obvious bugs before shipping (so maybe tomorrow).

Being cautious mostly because of 9b9dc7b (notifying chatty dead members that they're down). It's a very simple attempt at fixing the awareness problem.. can't help but think I'm missing something really obvious that makes this a bad idea.

@jeromegn
Copy link

Exciting!

What do you mean by:

Gonna leave this running for a while

Is there something fuzzing or testing this continuously? :)

@caio
Copy link
Owner Author

caio commented Nov 13, 2022

Haha I wish! Nothing as sophisticated as that at present. What I meant was that I put the code on this branch live in my LAN cluster to see how it behaves.

And by injecting failures I meant that my nodes have a little helper endpoint that lets me "drive" the cluster. Essentially, it takes a target address and does (maybe this is an interesting thing to do on your end, to see how the cluster behaves):

let found = foca.iter_members().find(|m| m.addr == target.addr).cloned();
if let Some(member) = found {
    let _ = foca.apply_many(
        std::iter::once(Member::new(member, 0, foca::State::Down)),
        &mut runtime,
    );
}

Technically, a gossip with no updates is still useful: it tells the
receiver that the sender is active so it helps with discovery and,
now, with the member gaining awareness that the cluster thinks its
dead.

It is, however, too chatty. During normal operations the updates buffer
is mostly empty; If we start sending messages with no updates too often
it won't be much different from a traditional gossip protocol with a
heartbeat.

So, since pings and announces already cover the case of talking to
the cluster unconditionally, it makes sense to make periodic gossiping
"smarter"
@caio
Copy link
Owner Author

caio commented Nov 13, 2022

Merging. Only had to fix the logic for periodic gossiping, making it only send if there are updates to be sent; Without that the cluster was super chatty (tiny packets, but still...) without any reason.

@caio caio merged commit 8be5f04 into main Nov 13, 2022
@caio caio deleted the next branch November 13, 2022 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants