Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: synchronous raft conf change, safeguard on join/leave raft #131

Merged
merged 5 commits into from
Apr 5, 2016
Merged

raft: synchronous raft conf change, safeguard on join/leave raft #131

merged 5 commits into from
Apr 5, 2016

Conversation

abronan
Copy link
Contributor

@abronan abronan commented Mar 12, 2016

This is very WIP, don't merge it but I'll leave it around for now.

/cc @aaronlehmann

Signed-off-by: Alexandre Beslic alexandre.beslic@gmail.com

// ID returns the cluster ID
ID() uint64
// Members returns a slice of members sorted by their ID
Members() map[uint64]*Member
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that map can be sorted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a confusing description. It should say "indexed by their ID".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to return slice here to encourage a user to use it only for iteration and use GetMember for access by ID. But I'm okay with map too :)

@LK4D4
Copy link
Contributor

LK4D4 commented Mar 18, 2016

This adds some synchronization which will definitely help with #189

@abronan
Copy link
Contributor Author

abronan commented Mar 24, 2016

Will rework this PR in multiple parts (one for the synchronous config change and better locking on add node, and the rest depending on what we decide for the safe barrier on add and remove nodes).

@aaronlehmann @aluzzardi I think we should disallow removing nodes below a safe quorum number. Even if we can, this is totally not safe from an operational perspective and this should be discouraged, or at least we should add a Huge Warning saying: You removed a Manager and you are now below the safe limit allowed for a raft cluster, please add another node ASAP.

@aaronlehmann
Copy link
Collaborator

I think we should disallow removing nodes below a safe quorum number. Even if we can, this is totally not safe from an operational perspective and this should be discouraged, or at least we should add a Huge Warning saying: You removed a Manager and you are now below the safe limit allowed for a raft cluster, please add another node ASAP.

I'm okay with the warning, but my instinct is that it should be possible to have less than three managers. We want it to be easy to experiment with swarm on a developer scale, so a single-node setup should definitely be supported. And it's really weird if adding more managers means you can't remove them later. This constraint would make sense for production, but if someone's playing with swarm on a laptop, they should be able to add and remove managers at will.

@abronan abronan changed the title WIP raft: improved join process and safeguard against illegal add or leave nodes raft: synchronous raft conf change, safeguard on join/leave raft Mar 24, 2016
@docker-codecov-bot
Copy link

Current coverage is 56.96%

Merging #131 into master will decrease coverage by -0.80% as of 118a202

@@            master    #131   diff @@
======================================
  Files           45      44     -1
  Stmts         5900    5751   -149
  Branches       869     843    -26
  Methods          0       0       
======================================
- Hit           3408    3276   -132
+ Partial        474     462    -12
+ Missed        2018    2013     -5

Review entire Coverage Diff as of 118a202


Uncovered Suggestions

  1. +0.40% via agent/session.go#187...209
  2. +0.31% via agent/agent.go#542...559
  3. +0.31% via .../test/deepcopy.pb.go#1492...1509
  4. See 7 more...

Powered by Codecov. Updated on successful CI builds.

@abronan
Copy link
Contributor Author

abronan commented Mar 24, 2016

@aaronlehmann @LK4D4 I rebased only adding the necessary synchronous conf change and lock on Join/Leave. I think this will get rid of the race happening in #189 on Join that I could reproduce locally but not anymore with the change.

@LK4D4
Copy link
Contributor

LK4D4 commented Mar 24, 2016

LGTM

*api.RaftNode
started bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is residue from the rebase and the hard checks, I'll remove it

@aaronlehmann
Copy link
Collaborator

@abronan: The summary says don't merge; is that out of date?

@abronan
Copy link
Contributor Author

abronan commented Mar 24, 2016

Let me rebase, one PR was merged in the meantime :)

// add them itself to its local list: grpc
// call add from the node sending the conf
// change
// TODO(aaronl): send back store snapshot after join?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an old comment but maybe outdated with your new PR, I can remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do.

@aaronlehmann
Copy link
Collaborator

What happens if the admin removes a node from the cluster and tries to re-add it later? It looks like that wouldn't work with the code here, because the node ID can never be deleted from removed.

@abronan
Copy link
Contributor Author

abronan commented Mar 24, 2016

@aaronlehmann Yes, a node that leaves gracefully can never be added back to the cluster with the same ID, but even now without this PR we don't handle that case. This should be done server side to make sure that a node who wants to join again gets its ID assigned from the node submitting the conf change or the leader.

Anyway, seems like the tests timed out. Not sure if the timeout is from raft or if the tests were really slow 😕

@LK4D4
Copy link
Contributor

LK4D4 commented Mar 30, 2016

ping @abronan
would you mind to rebase?

// The leader steps down
if n.Config.ID == n.Leader() && n.Config.ID == conf.NodeID {
if n.Config.ID == n.Leader() && n.Config.ID == cc.NodeID {
n.Stop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling n.Stop really the right thing to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the call is unnecessary, will remove it.

@aaronlehmann
Copy link
Collaborator

I have some concerns about the locking strategy. It seems that only Join and Leave hold n.lock. This means they can't race with each other, but they can race with processConfChange.

I recommend that we move the cluster member list locking out of the cluster struct, into Node, and require that this lock always be held while querying the member list or making changes to it.

c.lock.Unlock()
defer c.lock.Unlock()
c.removed[id] = true
delete(c.members, id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call Close on the grpc connection.

client, err := GetRaftClient(node.Addr, 0)
if err == nil {
n.Cluster.AddPeer(&Peer{RaftNode: node, Client: client})
err = n.cluster.AddMember(&Member{RaftNode: node, Client: client})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful not to overwrite an existing member, because this will leak the grpc connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't happen because Register is not called if the ID already exists in the cluster (this will throw ErrIDExists to the caller and not call register in this case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right.

@aaronlehmann
Copy link
Collaborator

Man, I spent several hours trying to understand the TestRaftRejoin failure. In the process I learned some interesting things about the raft library, and tried totally changing our approach to creating/joining clusters, but the approach I tried turned out not to work.

Anyway, I finally found the problem, and it turns out to be very simple. The test is creating a new listener on the wildcard port for the restarted node, but it's not okay for the node to change its address. Before, this wasn't caught by the test because it wasn't doing anything that required consensus after restarting the node. Now, the joins are funneled through raft, so that was getting stuck.

It also takes a bit longer for the state to converge after the restart.

The following diff fixes the test:

diff --git a/manager/state/raft_test.go b/manager/state/raft_test.go
index ccaf1ca..9959c2d 100644
--- a/manager/state/raft_test.go
+++ b/manager/state/raft_test.go
@@ -193,7 +193,7 @@ func newJoinNode(t *testing.T, join string, opts ...NewNodeOptions) *Node {
 }

 func restartNode(t *testing.T, n *Node, join string) *Node {
-       l, err := net.Listen("tcp", "127.0.0.1:0")
+       l, err := net.Listen("tcp", n.Address)
        require.NoError(t, err, "can't bind to raft service port")
        s := grpc.NewServer()

@@ -709,7 +709,7 @@ func TestRaftRejoin(t *testing.T) {
        values[1], err = proposeValue(t, nodes[1], "id2")
        assert.NoError(t, err, "failed to propose value")

-       time.Sleep(500 * time.Millisecond)
+       time.Sleep(2 * time.Second)

        // Nodes 1 and 2 should have the new value
        checkValues := func(checkNodes ...*Node) {

However, I'm afraid the change to the Listen call could make the test flaky. There's no guarantee that the same port will still be available. Ideally we would keep the same listener, but I haven't found a way to shut down a GRPC server without closing the listener. I guess since net.Listener is an interface, we can create a wrapper that stubs out Close so GRPC isn't really closing it.

@aaronlehmann
Copy link
Collaborator

Recording a few notes to myself (not for fixing in this PR, but in followups):

  • Snapshot must store ConfState, the member list, and list of removed nodes.
  • Need to make it possible to rejoin a cluster without a successful Join RPC (fall back to saved member list). This is the scenario where all nodes went down (power failure, for example).
  • There is a potentially serious problem with the way we're initializing raft nodes. We always pass in a single-node cluster with only the node that's being initialized. This makes the first index in the log ambiguous, since it makes configuration change for whatever node happened to start that log. Newly joining nodes never see a configuration change adding the first node in the cluster - they already have an index 1 that adds themselves. I think nodes other than the original leader may not know the correct quorum for the cluster, which is a big problem. I'm not 100% sure about the right way to fix this, but maybe we should be only passing in an initial cluster when we're initializing a standalone node. A node that's joining a cluster should pass in an empty Peers slice, so it starts with an empty log that can be filled in by replication.
  • The issue above may have an impact on how we store member lists for recovery. If we can't fix it so that we can rely on the WAL and snapshots to get a complete list of cluster members, we may have to store this list out of band.

@aaronlehmann
Copy link
Collaborator

@abronan: Here's a proper fix for the test that wraps net.Listener: https://github.com/aaronlehmann/swarm-v2/commit/0e659d489301ce0c5f3b3b6eb31428b87dfd8fc0

@abronan
Copy link
Contributor Author

abronan commented Apr 4, 2016

Thanks @aaronlehmann, will take care of the comments and include the fix. I guess we should open issues for your notes, as we should keep track of those somewhere and make sure we address it.

Agreed on the consequent Join which are useless because the only thing I do on this one is bypassing the ConfChange. But this will require changing the logic of joining a bit.

abronan and others added 3 commits April 4, 2016 11:21
Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
…a member

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
type Peer struct {
// Member represents a raft cluster member
type Member struct {
l sync.RWMutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mu :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a remain of the member count check (to not remove a node if this is would result in a number of nodes that might be unsafe and would lose the quorum) that was part of the original PR, I should remove it.

…n a connection to self

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
…efore checkValues

Signed-off-by: Alexandre Beslic <alexandre.beslic@gmail.com>
@aaronlehmann
Copy link
Collaborator

LGTM

@aaronlehmann aaronlehmann merged commit fbebe48 into moby:master Apr 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants