Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Sep 29, 2020

immediately initialize membership, without "loop" through shell as this may cause a race condition between a node
extenging a handshake to this node before the "loop through self adding
the myself Member" has a chance to run; This manifested in tests by
rejecting handshakes by "no local member" which is nonsense, there
always is a local known member after all.

resolves #724, an actual bug 😱

// always update the snapshot before emitting events
context.system.cluster.updateMembershipSnapshot(state.membership)
self.clusterEvents.publish(.membershipChange(change))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix™

The sending to self to avoid having another spot where we have to write "updateMembershipSnapshot and publish event" was causing a window of opportunity for another message to be handled first -- and that message would then hit us when we had NO members at all in the membership, not even "us".

This would then cause the nonsensical handshake rejection because "there is no local member"

Yet there always is a local member! It's impossible to not know the local member, it is us.

/// If, and only if, the current node is a leader it performs a set of tasks, such as moving nodes to `.up` etc.
func collectLeaderActions() -> [LeaderAction] {
guard self.membership.isLeader(self.localNode) else {
guard self.membership.isLeader(self.selfNode) else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming consistency

reason: "Node cannot be part of cluster, no member available.",
whenHandshakeReplySent: nil
)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rejection reason is nonsensical and was triggered in some of the tests sporadically; This is now made impossible for good.

@ktoso ktoso requested a review from yim-lee September 29, 2020 03:50
"loop" through shell as this may cause a race condition between a node
extenging a handshake to this node before the "loop through self adding
the myself Member" has a chance to run; This manifested in tests by
rejecting handshakes by "no local member" which is nonsense, there
always is a local known member after all.
Co-authored-by: Yim Lee <yim_lee@apple.com>
@ktoso
Copy link
Member Author

ktoso commented Sep 29, 2020

Weird failure #779 :/

@ktoso
Copy link
Member Author

ktoso commented Sep 29, 2020

@swift-server-bot test this please

@ktoso ktoso merged commit a620780 into apple:main Oct 5, 2020
@ktoso ktoso deleted the fix-node-rejecting branch October 5, 2020 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FAILED: test_association_shouldStayAliveWhenMessageSerializationThrowsOnSendingSide

2 participants