New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taskreaper: Wait for tasks to stop running #1948

Merged
merged 1 commit into from Feb 11, 2017

Conversation

Projects
None yet
4 participants
@aaronlehmann
Collaborator

aaronlehmann commented Feb 10, 2017

Currently, the task reaper only waits for a task's desired state to go
past "running" before deleting it. This can cause problems with rolling
updates when task-history-limit is set to 1. The old task can get
deleted by the reaper before the agent has a chance to update its actual
state to "shutdown", so the update never sees that the task has
completed, and has to wait awhile for a timeout.

This fixes it by making the task reaper only delete tasks whose desired
and actual state has moved past "running". It's also necessary to keep
slots in the "dirty" list until there is only one task in that slot with
desired state or actual state <= "running", so that the old task still
gets cleaned up once the actual state moves. Finally, "deleteTasks" is
changed to a map so that a task which is both part of a dirty slot and
orphaned won't cause two delete attempts (one of which would fail).

Note that this means tasks on unavailable nodes will stay around for
awhile, until the "orphaned" state is reached.

Tested by vendoring into docker and using the repro steps from moby/moby#28291.

cc @aluzzardi @dongluochen

taskreaper: Wait for tasks to stop running
Currently, the task reaper only waits for a task's desired state to go
past "running" before deleting it. This can cause problems with rolling
updates when task-history-limit is set to 1. The old task can get
deleted by the reaper before the agent has a chance to update its actual
state to "shutdown", so the update never sees that the task has
completed, and has to wait awhile for a timeout.

This fixes it by making the task reaper only delete tasks whose desired
*and* actual state has moved past "running". It's also necessary to keep
slots in the "dirty" list until there is only one task in that slot with
desired state or actual state <= "running", so that the old task still
gets cleaned up once the actual state moves. Finally, "deleteTasks" is
changed to a map so that a task which is both part of a dirty slot and
orphaned won't cause two delete attempts (one of which would fail).

Note that this means tasks on unavailable nodes will stay around for
awhile, until the "orphaned" state is reached.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Feb 11, 2017

Contributor

Flaky test?

--- FAIL: TestDemoteToSingleManager (89.78s)
	Error Trace:	integration_test.go:118
			integration_test.go:351
	Error:		Received unexpected error worker node mrqdmyhbrpofzeljref7wt4v6 should not have manager status, returned &ManagerStatus{RaftID:8991989376972981745,Addr:127.0.0.1:33826,Leader:false,Reachability:UNREACHABLE,}
			github.com/docker/swarmkit/manager/state/raft/testutils.PollFuncWithTimeout
				/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:76: polling failed
Contributor

dongluochen commented Feb 11, 2017

Flaky test?

--- FAIL: TestDemoteToSingleManager (89.78s)
	Error Trace:	integration_test.go:118
			integration_test.go:351
	Error:		Received unexpected error worker node mrqdmyhbrpofzeljref7wt4v6 should not have manager status, returned &ManagerStatus{RaftID:8991989376972981745,Addr:127.0.0.1:33826,Leader:false,Reachability:UNREACHABLE,}
			github.com/docker/swarmkit/manager/state/raft/testutils.PollFuncWithTimeout
				/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:76: polling failed
@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Feb 11, 2017

Collaborator

Yeah, definitely unrelated.

#1939 might help.

Collaborator

aaronlehmann commented Feb 11, 2017

Yeah, definitely unrelated.

#1939 might help.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Feb 11, 2017

Codecov Report

Merging #1948 into master will increase coverage by 0.46%.

@@            Coverage Diff             @@
##           master    #1948      +/-   ##
==========================================
+ Coverage   54.05%   54.51%   +0.46%     
==========================================
  Files         108      108              
  Lines       18547    18548       +1     
==========================================
+ Hits        10026    10112      +86     
+ Misses       7289     7198      -91     
- Partials     1232     1238       +6

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccdf5e0...2d575ef. Read the comment docs.

codecov-io commented Feb 11, 2017

Codecov Report

Merging #1948 into master will increase coverage by 0.46%.

@@            Coverage Diff             @@
##           master    #1948      +/-   ##
==========================================
+ Coverage   54.05%   54.51%   +0.46%     
==========================================
  Files         108      108              
  Lines       18547    18548       +1     
==========================================
+ Hits        10026    10112      +86     
+ Misses       7289     7198      -91     
- Partials     1232     1238       +6

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccdf5e0...2d575ef. Read the comment docs.

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Feb 11, 2017

Contributor

LGTM

Contributor

dongluochen commented Feb 11, 2017

LGTM

1 similar comment
@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Feb 11, 2017

Contributor

LGTM

Contributor

aluzzardi commented Feb 11, 2017

LGTM

@aaronlehmann aaronlehmann merged commit c7a2fb9 into docker:master Feb 11, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 54.51% (target 0%)
Details
dco-signed All commits are signed

@aaronlehmann aaronlehmann deleted the aaronlehmann:taskreaper-keep-running-tasks branch Feb 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment