allocator: Avoid allocating tasks that are no longer running #2017

aaronlehmann · 2017-03-08T19:44:33Z

The allocator currently uses taskDead to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of network attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely.

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

cc @alexmavr @yongtang @aboch @dongluochen

codecov · 2017-03-08T19:54:21Z

Codecov Report

Merging #2017 into master will increase coverage by 0.06%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #2017      +/-   ##
==========================================
+ Coverage   53.76%   53.83%   +0.06%     
==========================================
  Files         109      109              
  Lines       19037    19033       -4     
==========================================
+ Hits        10236    10247      +11     
+ Misses       7563     7557       -6     
+ Partials     1238     1229       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5eb9c0...9de6ccd. Read the comment docs.

aboch · 2017-03-08T22:06:56Z

The issue would potentially be exposed only during leader re-election, because at swarm init there can't be any network attachment task in store. Correct ?

aaronlehmann · 2017-03-08T22:10:45Z

That's right. It wouldn't happen when the task first fails, because the state would be >= Pending. But after a leader election, the task permanently gets added to the unallocated list.

aboch · 2017-03-08T23:05:16Z

Looks good to me

aluzzardi · 2017-03-09T19:17:58Z

Having a hard time mentally mapping this.

Current state:

checks that both the desired state and actual state are past "running"

Proposed state:

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

So we are completely reversing the behavior?

aaronlehmann · 2017-03-09T19:22:56Z

@aluzzardi: This shouldn't be reversing the behavior. I think maybe you're missing the ! in !taskRunning?

Current state:

~~checks that both the desired state and actual state are past "running"~~
Skips tasks when both the desired state and actual state are past "running"

Proposed state:

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

aluzzardi · 2017-03-09T19:29:49Z

Sorry, I don't get it :/

Aren't those two the same? Also, "Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below." -> why would we want to allocate tasks that are running?

aaronlehmann · 2017-03-09T19:42:19Z

Sorry, I don't get it :/ Aren't those two the same?

Basically this is changing an && to an ||. Pseudocode for before:

if desiredState > running && actualState > running {
    skip task
}

Pseudocode for after:

if desiredState > running || actualState > running {
    skip task
}

The motivation for this is that network attachment tasks never change their desired state. The tasks I saw in a dump from a cluster getting lots of log spam from repeated allocations were network attachment tasks with desired state running but actual state failed.

why would we want to allocate tasks that are running?

I had a comment in the PR description about this:

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Basically trying to make the least invasive change that will fix the problem, but I think tightening the check to skip running tasks as well would also be correct.

aluzzardi · 2017-03-09T21:05:24Z

Why not:

if actualState > running {
    skip task
}

If we don't desire a task to be running, but it is actually running, should we really skip it?

aaronlehmann · 2017-03-09T21:32:04Z

I think you're right. Paying any attention at all to desired state here seems bogus.

Suppose the orchestrator tries to start a task, but soon after, the service is updated, and it ends up changing that task to desired state = shutdown. That task might get stuck in the allocator forever if the allocator ignores it on the grounds that desired state is past running.

Should we just switch to observed state? I think we should, but I'm wondering if I'm missing anything here.

The only potential risk I see in making this change is that we won't deallocate failed tasks until the observed state changes. So if a node is down, we might not be able to free the resources until the task becomes orphaned. But I thought that was how it was supposed to work anyway! The fact that it's freeing resources based on desired state seems like a bug.

aaronlehmann · 2017-03-09T22:04:47Z

@aluzzardi: I've updated this to only check observed state. I think this is the correct thing to do. It also makes the code simpler. It's a little bit more of a change from what was there before, though. I removed the taskRunning and taskDead functions becuase I think they're confusing and obscure what is actually being checked.

@aboch: Can you please take a second look?

The allocator currently uses taskDead to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely. Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2017-03-09T22:57:51Z

The PR that added desired state checking is #435. I reviewed it, but right now I don't understand why it's necessary or beneficial. I see that the conditions inside taskRunning and taskDead have changed significantly since then. They used to check for specific desired states, and now they check for a range. Maybe this made sense at the time, but it doesn't make sense now in the context of how desired state has evolved.

Still, I'm a little concerned that we don't understand this code very well. I'm tempted to go back to the original fix to be safer, but that's just going to delay addressing the technical debt.

What about removing the desired state checking in master, and just doing the tweak in the original PR for the stable release? Not perfect either because it introduces a weird difference, but it seems worth considering.

aaronlehmann · 2017-03-10T01:33:00Z

FWIW, I vendored this PR (along with all the others) into docker, and all integration tests passed.

aboch · 2017-03-10T16:31:07Z

manager/allocator/network.go

@@ -281,7 +281,7 @@ func (a *Allocator) doNetworkInit(ctx context.Context) (err error) {
 	}

 	for _, t := range tasks {
-		if taskDead(t) {
+		if t.Status.State > api.TaskStateRunning {


I think this is fine, because if the task's state is beyond running, we should not bother attempting an allocation for it, regardless of the desired state. Once a task moves beyond running, it cannot be brought back in the pipeline right ?
A new task will be created in place of it, I think.

Once a task moves beyond running, it cannot be brought back in the pipeline right ?

That's correct.

aboch · 2017-03-10T16:36:19Z

manager/allocator/network.go

-	if taskDead(t) || isDelete {
+	// If the task has stopped running then we should free the network
+	// resources associated with the task right away.
+	if t.Status.State > api.TaskStateRunning || isDelete {


Only concern here is, as you guys already discussed here and also in the pr which introduced the current check, is that we release the resources associated to this task, while the task may still be running on an unresponsive node, and a new task coming up will be given those resources.
So, we could end up with two containers in the cluster having same IP address.

Actually I don't think this is a problem. This is looking at observed state, not desired state. So we err on the side of keeping the resources tied up if the node is unresponsive. There is a mechanism that marks tasks as orphaned after 24 hours of the node being down, so the allocator can finally free these resources.

Cool, thanks for the clarification.

dongluochen · 2017-03-10T17:30:25Z

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here. While a task could move between pending and running, network allocation for the task might not have propagated.

aaronlehmann · 2017-03-10T17:34:25Z

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here.

Agreed. I realized that after making that comment.

aboch · 2017-03-10T17:45:42Z

Looks good to me still.

dongluochen · 2017-03-10T17:52:21Z

LGTM

aluzzardi · 2017-03-10T18:49:15Z

LGTM

aaronlehmann added the process/cherry-pick label Mar 8, 2017

aaronlehmann added this to the 17.03.1 milestone Mar 8, 2017

aaronlehmann force-pushed the allocator-dead-tasks branch from 81be482 to aad8422 Compare March 9, 2017 22:03

aaronlehmann force-pushed the allocator-dead-tasks branch from aad8422 to 9de6ccd Compare March 9, 2017 22:21

aboch reviewed Mar 10, 2017

View reviewed changes

aluzzardi merged commit 0e2d9eb into moby:master Mar 10, 2017

aaronlehmann deleted the allocator-dead-tasks branch March 10, 2017 18:57

aaronlehmann added process/cherry-picked and removed process/cherry-pick labels Mar 10, 2017

aaronlehmann mentioned this pull request Mar 10, 2017

[17.03.x] Vendor swarmkit f93948c moby/moby#31742

Merged

aaronlehmann mentioned this pull request Mar 11, 2017

bump to 17.03.1-rc1 moby/moby#31754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator: Avoid allocating tasks that are no longer running #2017

allocator: Avoid allocating tasks that are no longer running #2017

aaronlehmann commented Mar 8, 2017

codecov bot commented Mar 8, 2017 •

edited

Loading

aboch commented Mar 8, 2017

aaronlehmann commented Mar 8, 2017

aboch commented Mar 8, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 10, 2017

aboch Mar 10, 2017

aaronlehmann Mar 10, 2017

aboch Mar 10, 2017

aaronlehmann Mar 10, 2017

aboch Mar 10, 2017

dongluochen commented Mar 10, 2017

aaronlehmann commented Mar 10, 2017

aboch commented Mar 10, 2017

dongluochen commented Mar 10, 2017

aluzzardi commented Mar 10, 2017

allocator: Avoid allocating tasks that are no longer running #2017

allocator: Avoid allocating tasks that are no longer running #2017

Conversation

aaronlehmann commented Mar 8, 2017

codecov bot commented Mar 8, 2017 • edited Loading

Codecov Report

aboch commented Mar 8, 2017

aaronlehmann commented Mar 8, 2017

aboch commented Mar 8, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aaronlehmann commented Mar 10, 2017

aboch Mar 10, 2017

Choose a reason for hiding this comment

aaronlehmann Mar 10, 2017

Choose a reason for hiding this comment

aboch Mar 10, 2017

Choose a reason for hiding this comment

aaronlehmann Mar 10, 2017

Choose a reason for hiding this comment

aboch Mar 10, 2017

Choose a reason for hiding this comment

dongluochen commented Mar 10, 2017

aaronlehmann commented Mar 10, 2017

aboch commented Mar 10, 2017

dongluochen commented Mar 10, 2017

aluzzardi commented Mar 10, 2017

codecov bot commented Mar 8, 2017 •

edited

Loading