New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Use TransferLeadership to make leader demotion safer #1939

Merged
merged 1 commit into from Feb 17, 2017

Conversation

Projects
None yet
4 participants
@aaronlehmann
Collaborator

aaronlehmann commented Feb 8, 2017

When we demote the leader, we currently wait for all queued messages to be sent, as a best-effort approach to making sure the other nodes find out that the node removal has been committed, and stop treating the current leader as a cluster member. This doesn't work perfectly.

To make this more robust, use TransferLeadership when the leader is trying to remove itself. The new leader's reconcilation loop will kick in and remove the old leader.

cc @LK4D4 @cyli

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Feb 8, 2017

Codecov Report

Merging #1939 into master will decrease coverage by -0.31%.
The diff coverage is 20.37%.

@@            Coverage Diff             @@
##           master    #1939      +/-   ##
==========================================
- Coverage   54.29%   53.99%   -0.31%     
==========================================
  Files         108      108              
  Lines       18586    18588       +2     
==========================================
- Hits        10092    10036      -56     
- Misses       7257     7324      +67     
+ Partials     1237     1228       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 569defc...5470e07. Read the comment docs.

codecov-io commented Feb 8, 2017

Codecov Report

Merging #1939 into master will decrease coverage by -0.31%.
The diff coverage is 20.37%.

@@            Coverage Diff             @@
##           master    #1939      +/-   ##
==========================================
- Coverage   54.29%   53.99%   -0.31%     
==========================================
  Files         108      108              
  Lines       18586    18588       +2     
==========================================
- Hits        10092    10036      -56     
- Misses       7257     7324      +67     
+ Partials     1237     1228       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 569defc...5470e07. Read the comment docs.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 9, 2017

Collaborator

Note to self: need to confirm this won't cause any problems in mixed version swarms. I don't think it will but it's important to be sure.

Collaborator

aaronlehmann commented Feb 9, 2017

Note to self: need to confirm this won't cause any problems in mixed version swarms. I don't think it will but it's important to be sure.

Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 9, 2017

Collaborator

Addressed comments, PTAL

Collaborator

aaronlehmann commented Feb 9, 2017

Addressed comments, PTAL

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Feb 9, 2017

Contributor

@aaronlehmann I've tried TestDemoteToSingleManager and not a single transfer finished with success:

ERRO[0001] failed to leave raft cluster gracefully       error="failed to transfer leadership: context canceled"
Contributor

LK4D4 commented Feb 9, 2017

@aaronlehmann I've tried TestDemoteToSingleManager and not a single transfer finished with success:

ERRO[0001] failed to leave raft cluster gracefully       error="failed to transfer leadership: context canceled"
Show outdated Hide outdated manager/state/raft/raft.go Outdated
@cyli

This comment has been minimized.

Show comment
Hide comment
@cyli

cyli Feb 9, 2017

Contributor

TestDemoteToSingleManager passed 10 times in a row for me when running that test alone. I am often seeing this though when running the integration test suite with -race, though (although my machine may be haunted again and it may be time to restart again):

--- FAIL: TestDemotePromote (75.30s)
        Error Trace:    integration_test.go:254
			integration_test.go:278
	Error:  	Received unexpected error error stop worker z2dj61jsjefe8dbunf0twdlo1: context deadline exceeded
Contributor

cyli commented Feb 9, 2017

TestDemoteToSingleManager passed 10 times in a row for me when running that test alone. I am often seeing this though when running the integration test suite with -race, though (although my machine may be haunted again and it may be time to restart again):

--- FAIL: TestDemotePromote (75.30s)
        Error Trace:    integration_test.go:254
			integration_test.go:278
	Error:  	Received unexpected error error stop worker z2dj61jsjefe8dbunf0twdlo1: context deadline exceeded
@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Feb 9, 2017

Contributor

Tests is passing okay, but transfer leadership doesn't work.

Contributor

LK4D4 commented Feb 9, 2017

Tests is passing okay, but transfer leadership doesn't work.

@cyli

This comment has been minimized.

Show comment
Hide comment
@cyli

cyli Feb 10, 2017

Contributor

@LK4D4 Ah sorry, yes, thanks for clarifying. I'm seeing the same.

Contributor

cyli commented Feb 10, 2017

@LK4D4 Ah sorry, yes, thanks for clarifying. I'm seeing the same.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

Thanks for pointing out that TransferLeadership was not actually succeeding in the tests.

The first problem here was that the context would be cancelled when the old leader lost the leadership. This caused the function to return an error.

I fixed this, but afterwards it turned out that the node would get removed twice: once by the call to Leave, and another by the new leader's reconciliation loop kicking in and trying to remove the same node.

So I'm trying a new, simpler, approach. RemoveNode no longer supports removing the local node. If this is attempted, an error gets returned, and the node role reconciliation loop can invoke TransferLeadership. Once there is a new leader, that leader can remove the old leader without incident.

PTAL

Collaborator

aaronlehmann commented Feb 10, 2017

Thanks for pointing out that TransferLeadership was not actually succeeding in the tests.

The first problem here was that the context would be cancelled when the old leader lost the leadership. This caused the function to return an error.

I fixed this, but afterwards it turned out that the node would get removed twice: once by the call to Leave, and another by the new leader's reconciliation loop kicking in and trying to remove the same node.

So I'm trying a new, simpler, approach. RemoveNode no longer supports removing the local node. If this is attempted, an error gets returned, and the node role reconciliation loop can invoke TransferLeadership. Once there is a new leader, that leader can remove the old leader without incident.

PTAL

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

...of course the tests consistently pass on my machine :/

Collaborator

aaronlehmann commented Feb 10, 2017

...of course the tests consistently pass on my machine :/

Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/state/raft/raft.go Outdated
Show outdated Hide outdated manager/role_manager.go Outdated
@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Feb 10, 2017

Contributor

have minor comments.
Looks good overall

Contributor

LK4D4 commented Feb 10, 2017

have minor comments.
Looks good overall

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

I've made the cosmetic changes. I'm happy to add back the "transfered leadership" log message if you think it makes sense.

I had some test failures in CI with an earlier version of this, but I can't reproduce them locally. Can you try running the tests a few times to see if they are stable?

Collaborator

aaronlehmann commented Feb 10, 2017

I've made the cosmetic changes. I'm happy to add back the "transfered leadership" log message if you think it makes sense.

I had some test failures in CI with an earlier version of this, but I can't reproduce them locally. Can you try running the tests a few times to see if they are stable?

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Feb 10, 2017

Contributor

@aaronlehmann sure will run tests now.
Yeah, I liked transfer finished with times, it's kinda cool to see how fast raft is :)

Contributor

LK4D4 commented Feb 10, 2017

@aaronlehmann sure will run tests now.
Yeah, I liked transfer finished with times, it's kinda cool to see how fast raft is :)

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

Added back the transfer timing.

Collaborator

aaronlehmann commented Feb 10, 2017

Added back the transfer timing.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

Adding back the n.asyncTasks.Wait() in applyRemoveNode seems to have stablized CI. I'm not sure why, because I expected this would only be necessary when the leader removes itself. Otherwise, this case should only be reached when a node finds out its removal has been committed, and then it shouldn't be necessary to communicate anymore.

Collaborator

aaronlehmann commented Feb 10, 2017

Adding back the n.asyncTasks.Wait() in applyRemoveNode seems to have stablized CI. I'm not sure why, because I expected this would only be necessary when the leader removes itself. Otherwise, this case should only be reached when a node finds out its removal has been committed, and then it shouldn't be necessary to communicate anymore.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Feb 10, 2017

Contributor

Tests are quite stable for me, no leaks, no races.

Contributor

LK4D4 commented Feb 10, 2017

Tests are quite stable for me, no leaks, no races.

@LK4D4

LK4D4 approved these changes Feb 10, 2017

LGTM

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

I'm rebuilding it a few more times in CI.

Collaborator

aaronlehmann commented Feb 10, 2017

I'm rebuilding it a few more times in CI.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 10, 2017

Collaborator

I can't get this to fail in CI anymore. Either the initial failures were a fluke, or adding back n.asyncTasks.Wait() made a difference for reasons I don't understand (maybe a timing issue?).

Collaborator

aaronlehmann commented Feb 10, 2017

I can't get this to fail in CI anymore. Either the initial failures were a fluke, or adding back n.asyncTasks.Wait() made a difference for reasons I don't understand (maybe a timing issue?).

Show outdated Hide outdated manager/role_manager.go Outdated
Show outdated Hide outdated manager/state/raft/raft_test.go Outdated
@cyli

This comment has been minimized.

Show comment
Hide comment
@cyli

cyli Feb 10, 2017

Contributor

Nice, I'm not getting any intermittent context exceeded errors in other integration tests anymore. Just have a few questions above, but other than that LGTM.

Contributor

cyli commented Feb 10, 2017

Nice, I'm not getting any intermittent context exceeded errors in other integration tests anymore. Just have a few questions above, but other than that LGTM.

raft: Use TransferLeadership to make leader demotion safer
When we demote the leader, we currently wait for all queued messages to
be sent, as a best-effort approach to making sure the other nodes find
out that the node removal has been committed, and stop treating the
current leader as a cluster member. This doesn't work perfectly.

To make this more robust, use TransferLeadership when the leader is
trying to remove itself. The new leader's reconcilation loop will kick
in and remove the old leader.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 15, 2017

Collaborator

I've updated this to work slightly differently. If the leadership transfer times out, we fall back to self-demoting the old way. I've tested this with a 1.14-dev leader in a cluster with a 1.12.6 follower, and demoting the leader works properly (after the timeout elapses).

PTAL. Hoping to merge this soon if it looks good.

Collaborator

aaronlehmann commented Feb 15, 2017

I've updated this to work slightly differently. If the leadership transfer times out, we fall back to self-demoting the old way. I've tested this with a 1.14-dev leader in a cluster with a 1.12.6 follower, and demoting the leader works properly (after the timeout elapses).

PTAL. Hoping to merge this soon if it looks good.

@cyli

This comment has been minimized.

Show comment
Hide comment
@cyli

cyli Feb 17, 2017

Contributor

LGTM! Thank you for tracking down a different unrelated error!

Contributor

cyli commented Feb 17, 2017

LGTM! Thank you for tracking down a different unrelated error!

@LK4D4 LK4D4 merged commit a82deb6 into docker:master Feb 17, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 53.99% (target 0%)
Details
dco-signed All commits are signed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment