Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for 2137 and 2138 #2140

Closed
wants to merge 2 commits into from
Closed

Fix for 2137 and 2138 #2140

wants to merge 2 commits into from

Conversation

pawanrawal
Copy link
Contributor

@pawanrawal pawanrawal commented Feb 19, 2018

#2137 was caused because retrieveSnapshot happens in a blocking fashion, and say when it is called the first time we don't have leader information, then because membership update doesn't happen we never get the leader.

#2138 happens because we might receive a request even before a node is initialized and even if we are not the leader. Now Leader() only returns leader for a group, so if we can't find one we will update state and retry.


This change is Reviewable

@manishrjain
Copy link
Contributor

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


worker/draft.go, line 417 at r1 (raw file):

func (n *node) retrieveSnapshot() error {
	pool := groups().Leader(groups().groupId())

Why not do this in an infinite for loop?

for {
  pool := leader
  if pool != nil { break }
  if err := UpdateMembershipState(..) not nil, return err
}

worker/draft.go, line 630 at r1 (raw file):

func (n *node) joinPeers() error {
	// Get leader information for MY group.
	pl := groups().Leader(n.gid)

INF for loop here as well, as above.


Comments from Reviewable

@manishrjain
Copy link
Contributor

:lgtm:


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@pawanrawal
Copy link
Contributor Author

Merged to master with 5563bd2.

@pawanrawal pawanrawal closed this Feb 19, 2018
@pawanrawal pawanrawal deleted the pawan/2138 branch March 21, 2018 01:38
@aphyr
Copy link

aphyr commented Apr 19, 2018

Hey, uh, is there a chance this might still allow crashes to happen? You mentioned that if nodes get requests when they're not the leader, that can cause them to crash--is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet? What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

@pawanrawal
Copy link
Contributor Author

is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet?

Yes, earlier it used to return a node which was not a leader if there was no leader. Now, it returns an error so it would retry after sleeping for a second.

What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

The node receiving the request for snapshot returns an error if it is not a leader, see

if !n.AmLeader() {
.

In JoinCluster, the node receiving the request doesn't check if it is still the leader but that should be ok, as a follower could also propose for a node to joined to the cluster. Did you observe this crash again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants