New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: Avoid allocating tasks that are no longer running #2017

Merged
merged 1 commit into from Mar 10, 2017

Conversation

Projects
None yet
4 participants
@aaronlehmann
Collaborator

aaronlehmann commented Mar 8, 2017

The allocator currently uses taskDead to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of network attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely.

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

cc @alexmavr @yongtang @aboch @dongluochen

@aaronlehmann aaronlehmann added this to the 17.03.1 milestone Mar 8, 2017

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Mar 8, 2017

Codecov Report

Merging #2017 into master will increase coverage by 0.06%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #2017      +/-   ##
==========================================
+ Coverage   53.76%   53.83%   +0.06%     
==========================================
  Files         109      109              
  Lines       19037    19033       -4     
==========================================
+ Hits        10236    10247      +11     
+ Misses       7563     7557       -6     
+ Partials     1238     1229       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5eb9c0...9de6ccd. Read the comment docs.

codecov bot commented Mar 8, 2017

Codecov Report

Merging #2017 into master will increase coverage by 0.06%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #2017      +/-   ##
==========================================
+ Coverage   53.76%   53.83%   +0.06%     
==========================================
  Files         109      109              
  Lines       19037    19033       -4     
==========================================
+ Hits        10236    10247      +11     
+ Misses       7563     7557       -6     
+ Partials     1238     1229       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5eb9c0...9de6ccd. Read the comment docs.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Mar 8, 2017

The issue would potentially be exposed only during leader re-election, because at swarm init there can't be any network attachment task in store. Correct ?

aboch commented Mar 8, 2017

The issue would potentially be exposed only during leader re-election, because at swarm init there can't be any network attachment task in store. Correct ?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 8, 2017

Collaborator

That's right. It wouldn't happen when the task first fails, because the state would be >= Pending. But after a leader election, the task permanently gets added to the unallocated list.

Collaborator

aaronlehmann commented Mar 8, 2017

That's right. It wouldn't happen when the task first fails, because the state would be >= Pending. But after a leader election, the task permanently gets added to the unallocated list.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Mar 8, 2017

Looks good to me

aboch commented Mar 8, 2017

Looks good to me

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Mar 9, 2017

Contributor

Having a hard time mentally mapping this.

Current state:

checks that both the desired state and actual state are past "running"

Proposed state:

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

So we are completely reversing the behavior?

Contributor

aluzzardi commented Mar 9, 2017

Having a hard time mentally mapping this.

Current state:

checks that both the desired state and actual state are past "running"

Proposed state:

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

So we are completely reversing the behavior?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 9, 2017

Collaborator

@aluzzardi: This shouldn't be reversing the behavior. I think maybe you're missing the ! in !taskRunning?

Current state:

  • checks that both the desired state and actual state are past "running"
  • Skips tasks when both the desired state and actual state are past "running"

Proposed state:

  • Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.
Collaborator

aaronlehmann commented Mar 9, 2017

@aluzzardi: This shouldn't be reversing the behavior. I think maybe you're missing the ! in !taskRunning?

Current state:

  • checks that both the desired state and actual state are past "running"
  • Skips tasks when both the desired state and actual state are past "running"

Proposed state:

  • Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.
@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Mar 9, 2017

Contributor

Sorry, I don't get it :/

Aren't those two the same? Also, "Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below." -> why would we want to allocate tasks that are running?

Contributor

aluzzardi commented Mar 9, 2017

Sorry, I don't get it :/

Aren't those two the same? Also, "Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below." -> why would we want to allocate tasks that are running?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 9, 2017

Collaborator

Sorry, I don't get it :/ Aren't those two the same?

Basically this is changing an && to an ||. Pseudocode for before:

if desiredState > running && actualState > running {
    skip task
}

Pseudocode for after:

if desiredState > running || actualState > running {
    skip task
}

The motivation for this is that network attachment tasks never change their desired state. The tasks I saw in a dump from a cluster getting lots of log spam from repeated allocations were network attachment tasks with desired state running but actual state failed.

why would we want to allocate tasks that are running?

I had a comment in the PR description about this:

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Basically trying to make the least invasive change that will fix the problem, but I think tightening the check to skip running tasks as well would also be correct.

Collaborator

aaronlehmann commented Mar 9, 2017

Sorry, I don't get it :/ Aren't those two the same?

Basically this is changing an && to an ||. Pseudocode for before:

if desiredState > running && actualState > running {
    skip task
}

Pseudocode for after:

if desiredState > running || actualState > running {
    skip task
}

The motivation for this is that network attachment tasks never change their desired state. The tasks I saw in a dump from a cluster getting lots of log spam from repeated allocations were network attachment tasks with desired state running but actual state failed.

why would we want to allocate tasks that are running?

I had a comment in the PR description about this:

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Basically trying to make the least invasive change that will fix the problem, but I think tightening the check to skip running tasks as well would also be correct.

@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Mar 9, 2017

Contributor

Why not:

if actualState > running {
    skip task
}

If we don't desire a task to be running, but it is actually running, should we really skip it?

Contributor

aluzzardi commented Mar 9, 2017

Why not:

if actualState > running {
    skip task
}

If we don't desire a task to be running, but it is actually running, should we really skip it?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 9, 2017

Collaborator

I think you're right. Paying any attention at all to desired state here seems bogus.

Suppose the orchestrator tries to start a task, but soon after, the service is updated, and it ends up changing that task to desired state = shutdown. That task might get stuck in the allocator forever if the allocator ignores it on the grounds that desired state is past running.

Should we just switch to observed state? I think we should, but I'm wondering if I'm missing anything here.

The only potential risk I see in making this change is that we won't deallocate failed tasks until the observed state changes. So if a node is down, we might not be able to free the resources until the task becomes orphaned. But I thought that was how it was supposed to work anyway! The fact that it's freeing resources based on desired state seems like a bug.

Collaborator

aaronlehmann commented Mar 9, 2017

I think you're right. Paying any attention at all to desired state here seems bogus.

Suppose the orchestrator tries to start a task, but soon after, the service is updated, and it ends up changing that task to desired state = shutdown. That task might get stuck in the allocator forever if the allocator ignores it on the grounds that desired state is past running.

Should we just switch to observed state? I think we should, but I'm wondering if I'm missing anything here.

The only potential risk I see in making this change is that we won't deallocate failed tasks until the observed state changes. So if a node is down, we might not be able to free the resources until the task becomes orphaned. But I thought that was how it was supposed to work anyway! The fact that it's freeing resources based on desired state seems like a bug.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 9, 2017

Collaborator

@aluzzardi: I've updated this to only check observed state. I think this is the correct thing to do. It also makes the code simpler. It's a little bit more of a change from what was there before, though. I removed the taskRunning and taskDead functions becuase I think they're confusing and obscure what is actually being checked.

@aboch: Can you please take a second look?

Collaborator

aaronlehmann commented Mar 9, 2017

@aluzzardi: I've updated this to only check observed state. I think this is the correct thing to do. It also makes the code simpler. It's a little bit more of a change from what was there before, though. I removed the taskRunning and taskDead functions becuase I think they're confusing and obscure what is actually being checked.

@aboch: Can you please take a second look?

allocator: Avoid allocating tasks that are no longer running
The allocator currently uses taskDead to check if a task should not be
allocated. However, this checks that both the desired state and actual
state are past "running". In the case of attachment tasks, there is no
orchestration, and the desired state is always "running". This means
allocation can be reattempted on a failed attachment task indefinitely.

Change the allocator to only try to allocate tasks where both the
desired and actual state are "running" or below.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 9, 2017

Collaborator

The PR that added desired state checking is #435. I reviewed it, but right now I don't understand why it's necessary or beneficial. I see that the conditions inside taskRunning and taskDead have changed significantly since then. They used to check for specific desired states, and now they check for a range. Maybe this made sense at the time, but it doesn't make sense now in the context of how desired state has evolved.

Still, I'm a little concerned that we don't understand this code very well. I'm tempted to go back to the original fix to be safer, but that's just going to delay addressing the technical debt.

What about removing the desired state checking in master, and just doing the tweak in the original PR for the stable release? Not perfect either because it introduces a weird difference, but it seems worth considering.

Collaborator

aaronlehmann commented Mar 9, 2017

The PR that added desired state checking is #435. I reviewed it, but right now I don't understand why it's necessary or beneficial. I see that the conditions inside taskRunning and taskDead have changed significantly since then. They used to check for specific desired states, and now they check for a range. Maybe this made sense at the time, but it doesn't make sense now in the context of how desired state has evolved.

Still, I'm a little concerned that we don't understand this code very well. I'm tempted to go back to the original fix to be safer, but that's just going to delay addressing the technical debt.

What about removing the desired state checking in master, and just doing the tweak in the original PR for the stable release? Not perfect either because it introduces a weird difference, but it seems worth considering.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

FWIW, I vendored this PR (along with all the others) into docker, and all integration tests passed.

Collaborator

aaronlehmann commented Mar 10, 2017

FWIW, I vendored this PR (along with all the others) into docker, and all integration tests passed.

@@ -281,7 +281,7 @@ func (a *Allocator) doNetworkInit(ctx context.Context) (err error) {
}
for _, t := range tasks {
if taskDead(t) {
if t.Status.State > api.TaskStateRunning {

This comment has been minimized.

@aboch

aboch Mar 10, 2017

I think this is fine, because if the task's state is beyond running, we should not bother attempting an allocation for it, regardless of the desired state. Once a task moves beyond running, it cannot be brought back in the pipeline right ?
A new task will be created in place of it, I think.

@aboch

aboch Mar 10, 2017

I think this is fine, because if the task's state is beyond running, we should not bother attempting an allocation for it, regardless of the desired state. Once a task moves beyond running, it cannot be brought back in the pipeline right ?
A new task will be created in place of it, I think.

This comment has been minimized.

@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

Once a task moves beyond running, it cannot be brought back in the pipeline right ?

That's correct.

@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

Once a task moves beyond running, it cannot be brought back in the pipeline right ?

That's correct.

if taskDead(t) || isDelete {
// If the task has stopped running then we should free the network
// resources associated with the task right away.
if t.Status.State > api.TaskStateRunning || isDelete {

This comment has been minimized.

@aboch

aboch Mar 10, 2017

Only concern here is, as you guys already discussed here and also in the pr which introduced the current check, is that we release the resources associated to this task, while the task may still be running on an unresponsive node, and a new task coming up will be given those resources.
So, we could end up with two containers in the cluster having same IP address.

@aboch

aboch Mar 10, 2017

Only concern here is, as you guys already discussed here and also in the pr which introduced the current check, is that we release the resources associated to this task, while the task may still be running on an unresponsive node, and a new task coming up will be given those resources.
So, we could end up with two containers in the cluster having same IP address.

This comment has been minimized.

@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

Actually I don't think this is a problem. This is looking at observed state, not desired state. So we err on the side of keeping the resources tied up if the node is unresponsive. There is a mechanism that marks tasks as orphaned after 24 hours of the node being down, so the allocator can finally free these resources.

@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

Actually I don't think this is a problem. This is looking at observed state, not desired state. So we err on the side of keeping the resources tied up if the node is unresponsive. There is a mechanism that marks tasks as orphaned after 24 hours of the node being down, so the allocator can finally free these resources.

This comment has been minimized.

@aboch

aboch Mar 10, 2017

Cool, thanks for the clarification.

@aboch

aboch Mar 10, 2017

Cool, thanks for the clarification.

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Mar 10, 2017

Contributor

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here. While a task could move between pending and running, network allocation for the task might not have propagated.

Contributor

dongluochen commented Mar 10, 2017

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here. While a task could move between pending and running, network allocation for the task might not have propagated.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 10, 2017

Collaborator

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here.

Agreed. I realized that after making that comment.

Collaborator

aaronlehmann commented Mar 10, 2017

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here.

Agreed. I realized that after making that comment.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Mar 10, 2017

Looks good to me still.

aboch commented Mar 10, 2017

Looks good to me still.

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Mar 10, 2017

Contributor

LGTM

Contributor

dongluochen commented Mar 10, 2017

LGTM

1 similar comment
@aluzzardi

This comment has been minimized.

Show comment
Hide comment
@aluzzardi

aluzzardi Mar 10, 2017

Contributor

LGTM

Contributor

aluzzardi commented Mar 10, 2017

LGTM

@aluzzardi aluzzardi merged commit 0e2d9eb into docker:master Mar 10, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 53.83% (target 0%)
Details
dco-signed All commits are signed

@aaronlehmann aaronlehmann deleted the aaronlehmann:allocator-dead-tasks branch Mar 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment