Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Jun 23, 2020

Motivation:

Cluster convergence could take arbitrary time if unlucky -- causing flaky tests (and crappy timing of members moving up etc).

Modifications:

  • predictable and stable gossip peer selection
    • can still be improved but as a nice bonus, TODO in source will make tickets
  • simulation tests, stressing the Logic by many rounds -- very quick and repeatable way to get failures
  • bug in inbound ongoing handshake rejecting -- we did so too eagerly -- this used to be right in the old Associations world, but now is not. Fixed and ensured that if we ever change it we won't miss that souce location by making it switch over the enum
  • gossiper was not watching peers 😱 😱 😱 that would cause offering dead peers to the gossip logics which could end up relying on getting info to those nodes, which is impossible -- fixed and this also is fixed for all Gossiper users, including CRDTs.
  • Implemented ACKs to be in the control of users of Gossiper -- any incoming gossip may be ACKed; and we use this to great advantage.
  • Gossiping now has a predictable and safe stop condition
    • stopping to gossip on convergence is wrong - it's way too early, we don't know if all members have seen our full seen table. we only know that the membership we see is the same

    • fixed by storing gossips from other nodes and gossiping to them if they differ; This can be optimized a lot (just sending digests) but is not done today.

      • with the gossip compression this is not too bad, and it also speeds up convergence; a gossip round is now "both ways" speeding up convergence by a lot (halfes number of rounds needed)

Result:

Stable downing and any cluster tests.

Stable clustering wrt. up/down/removed events

@ktoso
Copy link
Member Author

ktoso commented Jun 23, 2020

Getting a bit late, catching sleep for Labs :)

Will do inline comments to ease review and sanity check the changes myself again tomorrow tho.

func select() -> Peers
}

// public struct StableRandomRoundRobin<Peer: Hashable> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to clean up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, i missed this!

I would so love to have all these strategies implemented as PeerSelection (protocol) but have not had enough brain yet to figure it out... It's a minor side project thing though I guess for now, so not going to impl it yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Mis-implementing peer selection yet again has bitten me and caused bugs here! Those seem trivial but are not)

])

let ack = target.ask(for: GossipACK.self, timeout: .seconds(3)) { replyTo in
// TODO: configurable timeout?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on it.

This made me realize that if an impl was NOT ack-ing then we'd get ask timeouts unnecessarily, fixing that as configuration and adding test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented and tested now https://github.com/apple/swift-distributed-actors/pull/699/files#diff-8396b4fe838cabc279f1a041cc6bd45dR128

gossip settings have: style: .acknowledged(timeout: .seconds(1)),

@ktoso
Copy link
Member Author

ktoso commented Jun 24, 2020

I think we're green 🙀 🟢

Including 40 minutes of repeat runs...

@ktoso ktoso merged commit 2b07f79 into apple:master Jun 24, 2020
@ktoso ktoso deleted the convergence-cluster-improvements branch June 24, 2020 11:54
@ktoso
Copy link
Member Author

ktoso commented Jun 24, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants