raft: synchronous raft conf change, safeguard on join/leave raft #131

abronan · 2016-03-12T02:28:02Z

This is very WIP, don't merge it but I'll leave it around for now.

Signed-off-by: Alexandre Beslic alexandre.beslic@gmail.com

LK4D4 · 2016-03-18T21:26:33Z

manager/state/cluster.go

+	// ID returns the cluster ID
+	ID() uint64
+	// Members returns a slice of members sorted by their ID
+	Members() map[uint64]*Member


I don't think that map can be sorted.

I think this is just a confusing description. It should say "indexed by their ID".

I think it's better to return slice here to encourage a user to use it only for iteration and use GetMember for access by ID. But I'm okay with map too :)

LK4D4 · 2016-03-18T21:38:33Z

This adds some synchronization which will definitely help with #189

abronan · 2016-03-24T17:37:03Z

Will rework this PR in multiple parts (one for the synchronous config change and better locking on add node, and the rest depending on what we decide for the safe barrier on add and remove nodes).

@aaronlehmann @aluzzardi I think we should disallow removing nodes below a safe quorum number. Even if we can, this is totally not safe from an operational perspective and this should be discouraged, or at least we should add a Huge Warning saying: You removed a Manager and you are now below the safe limit allowed for a raft cluster, please add another node ASAP.

aaronlehmann · 2016-03-24T17:40:41Z

I think we should disallow removing nodes below a safe quorum number. Even if we can, this is totally not safe from an operational perspective and this should be discouraged, or at least we should add a Huge Warning saying: You removed a Manager and you are now below the safe limit allowed for a raft cluster, please add another node ASAP.

I'm okay with the warning, but my instinct is that it should be possible to have less than three managers. We want it to be easy to experiment with swarm on a developer scale, so a single-node setup should definitely be supported. And it's really weird if adding more managers means you can't remove them later. This constraint would make sense for production, but if someone's playing with swarm on a laptop, they should be able to add and remove managers at will.

docker-codecov-bot · 2016-03-24T21:58:51Z

Current coverage is `56.96%`

Merging #131 into master will decrease coverage by -0.80% as of 118a202

@@            master    #131   diff @@
======================================
  Files           45      44     -1
  Stmts         5900    5751   -149
  Branches       869     843    -26
  Methods          0       0       
======================================
- Hit           3408    3276   -132
+ Partial        474     462    -12
+ Missed        2018    2013     -5

Review entire Coverage Diff as of 118a202

Uncovered Suggestions

+0.40% via agent/session.go#187...209
+0.31% via agent/agent.go#542...559
+0.31% via .../test/deepcopy.pb.go#1492...1509
See 7 more...

Powered by Codecov. Updated on successful CI builds.

abronan · 2016-03-24T22:03:29Z

@aaronlehmann @LK4D4 I rebased only adding the necessary synchronous conf change and lock on Join/Leave. I think this will get rid of the race happening in #189 on Join that I could reproduce locally but not anymore with the change.

LK4D4 · 2016-03-24T22:15:59Z

LGTM

abronan · 2016-03-24T22:16:31Z

manager/state/cluster.go

 	*api.RaftNode
+	started bool


This is residue from the rebase and the hard checks, I'll remove it

aaronlehmann · 2016-03-24T22:17:21Z

@abronan: The summary says don't merge; is that out of date?

abronan · 2016-03-24T22:19:38Z

Let me rebase, one PR was merged in the meantime :)

aaronlehmann · 2016-03-24T22:42:55Z

manager/state/raft.go

-	// add them itself to its local list: grpc
-	// call add from the node sending the conf
-	// change
+	// TODO(aaronl): send back store snapshot after join?


This was an old comment but maybe outdated with your new PR, I can remove it.

aaronlehmann · 2016-03-24T22:48:33Z

What happens if the admin removes a node from the cluster and tries to re-add it later? It looks like that wouldn't work with the code here, because the node ID can never be deleted from removed.

abronan · 2016-03-24T23:03:02Z

@aaronlehmann Yes, a node that leaves gracefully can never be added back to the cluster with the same ID, but even now without this PR we don't handle that case. This should be done server side to make sure that a node who wants to join again gets its ID assigned from the node submitting the conf change or the leader.

Anyway, seems like the tests timed out. Not sure if the timeout is from raft or if the tests were really slow 😕

LK4D4 · 2016-03-30T21:54:28Z

ping @abronan
would you mind to rebase?

aaronlehmann · 2016-04-01T20:45:05Z

manager/state/raft.go

 	// The leader steps down
-	if n.Config.ID == n.Leader() && n.Config.ID == conf.NodeID {
+	if n.Config.ID == n.Leader() && n.Config.ID == cc.NodeID {
 		n.Stop()


Is calling n.Stop really the right thing to do?

Turns out the call is unnecessary, will remove it.

aaronlehmann · 2016-04-01T21:03:58Z

I have some concerns about the locking strategy. It seems that only Join and Leave hold n.lock. This means they can't race with each other, but they can race with processConfChange.

I recommend that we move the cluster member list locking out of the cluster struct, into Node, and require that this lock always be held while querying the member list or making changes to it.

aaronlehmann · 2016-04-01T21:34:09Z

manager/state/cluster.go

-	c.lock.Unlock()
+	defer c.lock.Unlock()
+	c.removed[id] = true
+	delete(c.members, id)


We should call Close on the grpc connection.

aaronlehmann · 2016-04-02T01:24:17Z

manager/state/raft.go

 	client, err := GetRaftClient(node.Addr, 0)
 	if err == nil {
-		n.Cluster.AddPeer(&Peer{RaftNode: node, Client: client})
+		err = n.cluster.AddMember(&Member{RaftNode: node, Client: client})


We need to be careful not to overwrite an existing member, because this will leak the grpc connection.

This shouldn't happen because Register is not called if the ID already exists in the cluster (this will throw ErrIDExists to the caller and not call register in this case).

I think you're right.

aaronlehmann · 2016-04-02T01:42:45Z

Man, I spent several hours trying to understand the TestRaftRejoin failure. In the process I learned some interesting things about the raft library, and tried totally changing our approach to creating/joining clusters, but the approach I tried turned out not to work.

Anyway, I finally found the problem, and it turns out to be very simple. The test is creating a new listener on the wildcard port for the restarted node, but it's not okay for the node to change its address. Before, this wasn't caught by the test because it wasn't doing anything that required consensus after restarting the node. Now, the joins are funneled through raft, so that was getting stuck.

It also takes a bit longer for the state to converge after the restart.

The following diff fixes the test:

diff --git a/manager/state/raft_test.go b/manager/state/raft_test.go
index ccaf1ca..9959c2d 100644
--- a/manager/state/raft_test.go
+++ b/manager/state/raft_test.go
@@ -193,7 +193,7 @@ func newJoinNode(t *testing.T, join string, opts ...NewNodeOptions) *Node {
 }

 func restartNode(t *testing.T, n *Node, join string) *Node {
-       l, err := net.Listen("tcp", "127.0.0.1:0")
+       l, err := net.Listen("tcp", n.Address)
        require.NoError(t, err, "can't bind to raft service port")
        s := grpc.NewServer()

@@ -709,7 +709,7 @@ func TestRaftRejoin(t *testing.T) {
        values[1], err = proposeValue(t, nodes[1], "id2")
        assert.NoError(t, err, "failed to propose value")

-       time.Sleep(500 * time.Millisecond)
+       time.Sleep(2 * time.Second)

        // Nodes 1 and 2 should have the new value
        checkValues := func(checkNodes ...*Node) {

However, I'm afraid the change to the Listen call could make the test flaky. There's no guarantee that the same port will still be available. Ideally we would keep the same listener, but I haven't found a way to shut down a GRPC server without closing the listener. I guess since net.Listener is an interface, we can create a wrapper that stubs out Close so GRPC isn't really closing it.

aaronlehmann · 2016-04-02T03:09:34Z

Recording a few notes to myself (not for fixing in this PR, but in followups):

Snapshot must store ConfState, the member list, and list of removed nodes.
Need to make it possible to rejoin a cluster without a successful Join RPC (fall back to saved member list). This is the scenario where all nodes went down (power failure, for example).
There is a potentially serious problem with the way we're initializing raft nodes. We always pass in a single-node cluster with only the node that's being initialized. This makes the first index in the log ambiguous, since it makes configuration change for whatever node happened to start that log. Newly joining nodes never see a configuration change adding the first node in the cluster - they already have an index 1 that adds themselves. I think nodes other than the original leader may not know the correct quorum for the cluster, which is a big problem. I'm not 100% sure about the right way to fix this, but maybe we should be only passing in an initial cluster when we're initializing a standalone node. A node that's joining a cluster should pass in an empty Peers slice, so it starts with an empty log that can be filled in by replication.
The issue above may have an impact on how we store member lists for recovery. If we can't fix it so that we can rely on the WAL and snapshots to get a complete list of cluster members, we may have to store this list out of band.

aaronlehmann · 2016-04-04T17:48:53Z

@abronan: Here's a proper fix for the test that wraps net.Listener: https://github.com/aaronlehmann/swarm-v2/commit/0e659d489301ce0c5f3b3b6eb31428b87dfd8fc0

abronan · 2016-04-04T18:10:05Z

Thanks @aaronlehmann, will take care of the comments and include the fix. I guess we should open issues for your notes, as we should keep track of those somewhere and make sure we address it.

Agreed on the consequent Join which are useless because the only thing I do on this one is bypassing the ConfChange. But this will require changing the logic of joining a bit.

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

…a member Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-04-04T22:42:44Z

manager/state/cluster.go

-type Peer struct {
+// Member represents a raft cluster member
+type Member struct {
+	l sync.RWMutex


Where is this used?

It's a remain of the member count check (to not remove a node if this is would result in a number of nodes that might be unsafe and would lose the quorum) that was part of the original PR, I should remove it.

…n a connection to self Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

…efore checkValues Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

aaronlehmann · 2016-04-05T00:06:35Z

LGTM

GordonTheTurtle added the status/0-triage label Mar 12, 2016

LK4D4 mentioned this pull request Mar 18, 2016

Sometimes raft deadlocks #189

Closed

LK4D4 reviewed Mar 18, 2016
View reviewed changes

abronan changed the title ~~WIP raft: improved join process and safeguard against illegal add or leave nodes~~ raft: synchronous raft conf change, safeguard on join/leave raft Mar 24, 2016

abronan reviewed Mar 24, 2016
View reviewed changes

manager/state/cluster.go

*api.RaftNode

started bool

Copy link

Contributor Author

abronan Mar 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is residue from the rebase and the hard checks, I'll remove it

aaronlehmann reviewed Mar 24, 2016
View reviewed changes

aaronlehmann reviewed Apr 1, 2016
View reviewed changes

aaronlehmann reviewed Apr 2, 2016
View reviewed changes

abronan and others added 3 commits April 4, 2016 11:21

raft: synchronous conf change and safeguard on Join/Leave

8af45c4

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

raft: don't propose conf change if nodes is already registered to be …

a8dbe8b

…a member Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

Wrap the net.Listener to reuse the listening socket for TestRaftRejoin

5999967

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann reviewed Apr 4, 2016
View reviewed changes

abronan added 2 commits April 4, 2016 16:10

raft: close grpc conn on remove node and check on Register not to ope…

637853e

…n a connection to self Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

raft: listener wrapper fix, timeout on node Join, add missing sleep b…

2963dc7

…efore checkValues Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>

aaronlehmann merged commit fbebe48 into moby:master Apr 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: synchronous raft conf change, safeguard on join/leave raft #131

raft: synchronous raft conf change, safeguard on join/leave raft #131

abronan commented Mar 12, 2016

LK4D4 Mar 18, 2016

aaronlehmann Mar 18, 2016

LK4D4 Mar 18, 2016

LK4D4 commented Mar 18, 2016

abronan commented Mar 24, 2016

aaronlehmann commented Mar 24, 2016

docker-codecov-bot commented Mar 24, 2016

abronan commented Mar 24, 2016

LK4D4 commented Mar 24, 2016

abronan Mar 24, 2016

aaronlehmann commented Mar 24, 2016

abronan commented Mar 24, 2016

aaronlehmann Mar 24, 2016

abronan Mar 24, 2016

aaronlehmann Apr 1, 2016

aaronlehmann commented Mar 24, 2016

abronan commented Mar 24, 2016

LK4D4 commented Mar 30, 2016

aaronlehmann Apr 1, 2016

abronan Apr 4, 2016

aaronlehmann commented Apr 1, 2016

aaronlehmann Apr 1, 2016

aaronlehmann Apr 2, 2016

abronan Apr 4, 2016

aaronlehmann Apr 4, 2016

aaronlehmann commented Apr 2, 2016

aaronlehmann commented Apr 2, 2016

aaronlehmann commented Apr 4, 2016

abronan commented Apr 4, 2016

aaronlehmann Apr 4, 2016

aaronlehmann Apr 4, 2016

abronan Apr 4, 2016

aaronlehmann commented Apr 5, 2016

raft: synchronous raft conf change, safeguard on join/leave raft #131

raft: synchronous raft conf change, safeguard on join/leave raft #131

Conversation

abronan commented Mar 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LK4D4 commented Mar 18, 2016

abronan commented Mar 24, 2016

aaronlehmann commented Mar 24, 2016

docker-codecov-bot commented Mar 24, 2016

Current coverage is 56.96%

abronan commented Mar 24, 2016

LK4D4 commented Mar 24, 2016

Choose a reason for hiding this comment

aaronlehmann commented Mar 24, 2016

abronan commented Mar 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Mar 24, 2016

abronan commented Mar 24, 2016

LK4D4 commented Mar 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Apr 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Apr 2, 2016

aaronlehmann commented Apr 2, 2016

aaronlehmann commented Apr 4, 2016

abronan commented Apr 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Apr 5, 2016

Current coverage is `56.96%`