-
Notifications
You must be signed in to change notification settings - Fork 78
Cluster gossip stop condition fixes and convergence improvements #699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
in order to remove it upon termination
since it may cause events, or it may cause convergence to trigger which means if we're a leader we must perform our duties
…D for eagerly rejecting
|
Getting a bit late, catching sleep for Labs :) Will do inline comments to ease review and sanity check the changes myself again tomorrow tho. |
Sources/DistributedActors/Cluster/Cluster+MembershipGossipLogic.swift
Outdated
Show resolved
Hide resolved
| func select() -> Peers | ||
| } | ||
|
|
||
| // public struct StableRandomRoundRobin<Peer: Hashable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to clean up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, i missed this!
I would so love to have all these strategies implemented as PeerSelection (protocol) but have not had enough brain yet to figure it out... It's a minor side project thing though I guess for now, so not going to impl it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Mis-implementing peer selection yet again has bitten me and caused bugs here! Those seem trivial but are not)
| ]) | ||
|
|
||
| let ack = target.ask(for: GossipACK.self, timeout: .seconds(3)) { replyTo in | ||
| // TODO: configurable timeout? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on it.
This made me realize that if an impl was NOT ack-ing then we'd get ask timeouts unnecessarily, fixing that as configuration and adding test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented and tested now https://github.com/apple/swift-distributed-actors/pull/699/files#diff-8396b4fe838cabc279f1a041cc6bd45dR128
gossip settings have: style: .acknowledged(timeout: .seconds(1)),
…nexpected ACks are required
…ves ordering observed on event stream
|
I think we're green 🙀 🟢 Including 40 minutes of repeat runs... |
|
Motivation:
Cluster convergence could take arbitrary time if unlucky -- causing flaky tests (and crappy timing of members moving up etc).
Modifications:
stopping to gossip on convergence is wrong - it's way too early, we don't know if all members have seen our full seen table. we only know that the membership we see is the same
fixed by storing gossips from other nodes and gossiping to them if they differ; This can be optimized a lot (just sending digests) but is not done today.
Result:
Stable downing and any cluster tests.
Stable clustering wrt. up/down/removed events