New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for 2137 and 2138 #2140

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@pawanrawal
Contributor

pawanrawal commented Feb 19, 2018

#2137 was caused because retrieveSnapshot happens in a blocking fashion, and say when it is called the first time we don't have leader information, then because membership update doesn't happen we never get the leader.

#2138 happens because we might receive a request even before a node is initialized and even if we are not the leader. Now Leader() only returns leader for a group, so if we can't find one we will update state and retry.


This change is Reviewable

pawanrawal added some commits Feb 19, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Feb 19, 2018

Member

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


worker/draft.go, line 417 at r1 (raw file):

func (n *node) retrieveSnapshot() error {
	pool := groups().Leader(groups().groupId())

Why not do this in an infinite for loop?

for {
  pool := leader
  if pool != nil { break }
  if err := UpdateMembershipState(..) not nil, return err
}

worker/draft.go, line 630 at r1 (raw file):

func (n *node) joinPeers() error {
	// Get leader information for MY group.
	pl := groups().Leader(n.gid)

INF for loop here as well, as above.


Comments from Reviewable

Member

manishrjain commented Feb 19, 2018

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


worker/draft.go, line 417 at r1 (raw file):

func (n *node) retrieveSnapshot() error {
	pool := groups().Leader(groups().groupId())

Why not do this in an infinite for loop?

for {
  pool := leader
  if pool != nil { break }
  if err := UpdateMembershipState(..) not nil, return err
}

worker/draft.go, line 630 at r1 (raw file):

func (n *node) joinPeers() error {
	// Get leader information for MY group.
	pl := groups().Leader(n.gid)

INF for loop here as well, as above.


Comments from Reviewable

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain Feb 19, 2018

Member

:lgtm:


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

Member

manishrjain commented Feb 19, 2018

:lgtm:


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@pawanrawal

This comment has been minimized.

Show comment
Hide comment
@pawanrawal

pawanrawal Feb 19, 2018

Contributor

Merged to master with 5563bd2.

Contributor

pawanrawal commented Feb 19, 2018

Merged to master with 5563bd2.

@pawanrawal pawanrawal closed this Feb 19, 2018

@pawanrawal pawanrawal deleted the pawan/2138 branch Mar 21, 2018

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 19, 2018

Hey, uh, is there a chance this might still allow crashes to happen? You mentioned that if nodes get requests when they're not the leader, that can cause them to crash--is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet? What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

aphyr commented Apr 19, 2018

Hey, uh, is there a chance this might still allow crashes to happen? You mentioned that if nodes get requests when they're not the leader, that can cause them to crash--is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet? What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

@pawanrawal

This comment has been minimized.

Show comment
Hide comment
@pawanrawal

pawanrawal Apr 20, 2018

Contributor

is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet?

Yes, earlier it used to return a node which was not a leader if there was no leader. Now, it returns an error so it would retry after sleeping for a second.

What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

The node receiving the request for snapshot returns an error if it is not a leader, see

if !n.AmLeader() {
.

In JoinCluster, the node receiving the request doesn't check if it is still the leader but that should be ok, as a follower could also propose for a node to joined to the cluster. Did you observe this crash again?

Contributor

pawanrawal commented Apr 20, 2018

is the change to Leader() here supposed to prevent clients from contacting nodes which aren't leaders yet?

Yes, earlier it used to return a node which was not a leader if there was no leader. Now, it returns an error so it would retry after sleeping for a second.

What happens if a node makes a request to a node it thinks is the leader, but that leader steps down, then receives the request?

The node receiving the request for snapshot returns an error if it is not a leader, see

if !n.AmLeader() {
.

In JoinCluster, the node receiving the request doesn't check if it is still the leader but that should be ok, as a follower could also propose for a node to joined to the cluster. Did you observe this crash again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment