New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve node management. #1569

Merged
merged 6 commits into from Jan 8, 2016

Conversation

Projects
None yet
4 participants
@dongluochen
Contributor

dongluochen commented Dec 30, 2015

This change improves swarm node management according to proposal #1486.

  1. Introduce pending state. Pending nodes need validation before moving to healthy state. Resolve issues of duplicate ID and no join retry for unreachable node issues.
  2. Expose error and last update time in docker info.
  3. Use connect success/failure to drive state transition between healthy and unhealthy.

Signed-off-by: Dong Chen dongluo.chen@docker.com

Improve node management.
1. Introduce pending state. Pending nodes need validation before moving to healthy state. Resolve issues of duplicate ID and dead node drop issues.
2. Expose error and last update time in docker info.
3. Use connect success/failure to drive state transition between healthy and unhealthy.

Signed-off-by: Dong Chen <dongluo.chen@docker.com>
@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Jan 4, 2016

Contributor

Design LGTM. Will review the code and test shortly.

ping @vieux for design approval.

@dongluochen This might need a set of integration tests (or improve the existing ones) to validate that the state machine works and that a node transitions correctly from a state to another. At least for Pending to Healthy and from Healthy to Unhealthy.

Contributor

abronan commented Jan 4, 2016

Design LGTM. Will review the code and test shortly.

ping @vieux for design approval.

@dongluochen This might need a set of integration tests (or improve the existing ones) to validate that the state machine works and that a node transitions correctly from a state to another. At least for Pending to Healthy and from Healthy to Unhealthy.

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 4, 2016

Contributor

@abronan thanks Alex! I'll update test cases soon.

Contributor

dongluochen commented Jan 4, 2016

@abronan thanks Alex! I'll update test cases soon.

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 5, 2016

Contributor

@dongluochen small comment, but could we please put └ Error: <none> or └ Error: (none) instead of blanc ?

Contributor

vieux commented Jan 5, 2016

@dongluochen small comment, but could we please put └ Error: <none> or └ Error: (none) instead of blanc ?

Add integration test for state machine.
Signed-off-by: Dong Chen <dongluo.chen@docker.com>
@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 5, 2016

Contributor

@vieux I updated no error to └ Error: (none). Thanks for the comment.

Contributor

dongluochen commented Jan 5, 2016

@vieux I updated no error to └ Error: (none). Thanks for the comment.

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 5, 2016

Collaborator

you can replace the sleeps and the # Verify swarm detects node failure bloc after a stop by

# Wait for Swarm to detect the node failure.
retry 5 1 eval "docker_swarm info | grep -q 'Unhealthy'"
Collaborator

vieux commented on test/integration/nodemanagement/state.bats in 52a7616 Jan 5, 2016

you can replace the sleeps and the # Verify swarm detects node failure bloc after a stop by

# Wait for Swarm to detect the node failure.
retry 5 1 eval "docker_swarm info | grep -q 'Unhealthy'"

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 5, 2016

Collaborator

I'm sure we can find something similar for start

Collaborator

vieux replied Jan 5, 2016

I'm sure we can find something similar for start

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 5, 2016

Collaborator

why 40 here ?

Collaborator

vieux commented on test/integration/nodemanagement/state.bats in 52a7616 Jan 5, 2016

why 40 here ?

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 5, 2016

Owner

Pending engines are validated with a minimum of 30 seconds interval. It's a static value. We may cut it to 10 secs. Each pending engine has its own gap based on failureCount.

Owner

dongluochen replied Jan 5, 2016

Pending engines are validated with a minimum of 30 seconds interval. It's a static value. We may cut it to 10 secs. Each pending engine has its own gap based on failureCount.

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 5, 2016

Collaborator

ok.... :(

let's at least use a retry syntax so it might take less than 40

Collaborator

vieux replied Jan 5, 2016

ok.... :(

let's at least use a retry syntax so it might take less than 40

dongluochen added some commits Jan 5, 2016

Update integration test. Reduce pending node validation sleep interva…
…l. Each pending node has its own validation interval according to failure count. So reducing sleep interval is not increasing validation frequency for unreachable nodes.

Signed-off-by: Dong Chen <dongluo.chen@docker.com>
Fix format issue in state.bats.
Signed-off-by: Dong Chen <dongluo.chen@docker.com>
Update failureCount scenario and test cases.
Signed-off-by: Dong Chen <dongluo.chen@docker.com>
@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 7, 2016

Contributor

LGTM

Contributor

vieux commented Jan 7, 2016

LGTM

@vieux vieux added this to the 1.1.0 milestone Jan 7, 2016

@vieux vieux added the priority/P1 label Jan 7, 2016

@vieux vieux assigned vieux and dongluochen and unassigned vieux Jan 7, 2016

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 7, 2016

Contributor

ping @docker/swarm-maintainers.

Contributor

dongluochen commented Jan 7, 2016

ping @docker/swarm-maintainers.

Show outdated Hide outdated cluster/engine.go
Show outdated Hide outdated cluster/engine.go
@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Jan 7, 2016

Contributor

Tested and seems to work as expected, just very small nits but LGTM

Contributor

abronan commented Jan 7, 2016

Tested and seems to work as expected, just very small nits but LGTM

Name constants.
Signed-off-by: Dong Chen <dongluo.chen@docker.com>
@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Jan 7, 2016

Contributor

Thanks @abronan! I've updated code accordingly.

Contributor

dongluochen commented Jan 7, 2016

Thanks @abronan! I've updated code accordingly.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Jan 8, 2016

Contributor

Sweet! Thanks @dongluochen!

Contributor

abronan commented Jan 8, 2016

Sweet! Thanks @dongluochen!

abronan added a commit that referenced this pull request Jan 8, 2016

@abronan abronan merged commit 8b173fd into docker:master Jan 8, 2016

4 of 5 checks passed

engine-master Jenkins build Swarm-PRs (engine master) 1597 has failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
docker/dco-signed All commits signed
Details
documentation success 1 tests run, 1 skipped, 0 failed.
Details
engine-stable Jenkins build Swarm-PRs (engine stable) 60 has succeeded
Details
@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan Jan 8, 2016

Contributor

🎉

Contributor

abronan commented Jan 8, 2016

🎉

@abronan abronan referenced this pull request Jan 12, 2016

Closed

Proposal Node failover #755

@dongluochen dongluochen deleted the dongluochen:nodeManagement branch Feb 2, 2016

ChristianKniep pushed a commit to ChristianKniep/swarm that referenced this pull request Jul 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment