Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for 2137 and 2138 #2140

Closed
wants to merge 2 commits into from
Closed

Fix for 2137 and 2138 #2140

wants to merge 2 commits into from

Conversation

@pawanrawal
Copy link
Member

@pawanrawal pawanrawal commented Feb 19, 2018

#2137 was caused because retrieveSnapshot happens in a blocking fashion, and say when it is called the first time we don't have leader information, then because membership update doesn't happen we never get the leader.

#2138 happens because we might receive a request even before a node is initialized and even if we are not the leader. Now Leader() only returns leader for a group, so if we can't find one we will update state and retry.


This change is Reviewable

pawanrawal added 2 commits Feb 19, 2018
…int all connState its useless.
@manishrjain
Copy link
Member

@manishrjain manishrjain commented Feb 19, 2018

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


worker/draft.go, line 417 at r1 (raw file):

func (n *node) retrieveSnapshot() error {
	pool := groups().Leader(groups().groupId())

Why not do this in an infinite for loop?

for {
  pool := leader
  if pool != nil { break }
  if err := UpdateMembershipState(..) not nil, return err
}

worker/draft.go, line 630 at r1 (raw file):

func (n *node) joinPeers() error {
	// Get leader information for MY group.
	pl := groups().Leader(n.gid)

INF for loop here as well, as above.


Comments from Reviewable

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Feb 19, 2018

:lgtm:


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@pawanrawal
Copy link
Member Author

@pawanrawal pawanrawal commented Feb 19, 2018

Merged to master with 5563bd2.

@pawanrawal pawanrawal closed this Feb 19, 2018
@pawanrawal pawanrawal deleted the pawan/2138 branch Mar 21, 2018
@aphyr
Copy link

@aphyr aphyr commented Apr 19, 2018

Hey, uh, is there a chance this might still allow crashes to happen? You mentioned that if nodes get requests when they're not the leader, that can cause them to crash--is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet? What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

@pawanrawal
Copy link
Member Author

@pawanrawal pawanrawal commented Apr 20, 2018

is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet?

Yes, earlier it used to return a node which was not a leader if there was no leader. Now, it returns an error so it would retry after sleeping for a second.

What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

The node receiving the request for snapshot returns an error if it is not a leader, see

if !n.AmLeader() {
.

In JoinCluster, the node receiving the request doesn't check if it is still the leader but that should be ok, as a follower could also propose for a node to joined to the cluster. Did you observe this crash again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.