Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Proposal: Container Rescheduling #1488
The goal of this proposal is to reschedule containers automatically in case of node failure.
This is currently one of the top requested feature for Swarm.
The behavior should be user controllable and disabled by default since rescheduling can have nasty effects on stateful containers.
The user can select the policy at
Possible values for
The reason this is more complicated than
Rescheduling policies will be stored as a container label:
Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.
Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.
Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).
By storing the rescheduling policy as a container label, we are able to reconstruct the desired state at startup time.
Since we are already storing constraints, affinities etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.
This means we can restart the manager as much as we want and it will resume rescheduling as expected.
However, the problem arises when a node goes down while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.
This situation can be counter-balanced by using
Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.
This functionality is already provided by
Rescheduling can rely on the health status already available.
Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.
Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.
If duplicate containers were started with a
We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.
However, there are many drawbacks to this approach:
Furthermore, It doesn't actually entirely solve the issue. If the node didn't actually die (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.
Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with
When rescheduling containers, Swarm must handle multi-host networking properly.
The goal is for the new container to take over the previous one.
In an overlay network setup, this may involve:
This was referenced
Dec 3, 2015
referenced this issue
Dec 3, 2015
I will note that Amazon ECS (which I'm now using, because swarm did not deliver this feature in time), does rebalancing by disabling local container restarts. I also don't think having a KV store as a dependency as out of the question, as long as it is swappable
For me this is the missing piece for running swarm in production.
I think that it's ok to have this feature dependant on a KV store, as docker overlay network requires it and I don't see any point of running swarm without overlay network (ok, this can be swapped but the majority of implementations rely on a kv store).
I would add the volume driver portion in here as well. If there is a container that has external volumes attached and is requested on another host, then it should be the case that the same volumes are brought to the new host.
In the case of REX-Ray (rexray/rexray#190) it now has pre-emption built into most drivers. This means that the new requesting container runtime will cause a forceful mount which detaches it from any host that currently has it. The setting is currently a global setting at a driver level for us, but it would be an interesting addition to the volume plugins to allow a flag on mount that gets used by Swarm to tell it to pre-empt or force mount in the case of re-scheduling. Typically we wouldn't want to enable pre-emption since it is a safety feature to block mounting from multiple hosts or block detaching/attching unintentionally. cc @cpuguy83
Drivers that don't have a forceful mount option or pre-emption will cause the containers that get requested on a new host to error since their volume will not be able to be dismounted. The exception here depends on the storage platform. For example, EC2 and OpenStack disallows this by default. This makes sense for safety, as we want to make people be explicit about mounting a volume to multiple hosts or doing detach/attach operations.
How come you want to use an extra label?
Also need to account for paused containers... I have a feeling that these should not be rescheduled ever.
I'd have a question about this topic.
Currently the docker daemon handles restarting of containers. But isn't there a conflict between daemon and swarm manager when it comes to rescheduling?
The (already discussed) scenario I am referring to is: There's a node-failure, swarm master would reschedule containers to healthy nodes. Now the failed node gets healed...
It comes up and the daemon starts containers with
So, should restarting/rescheduling handled by exclusively either docker daemon or swarm-manager?
I wanted to throw in another idea re volumes here.
Volumes could also have to do with container placement. For example, if a
A second would be for the volumesfrom flag. This should have similar
Otherwise the volume bring requested is going to fail to start for those
On Sunday, December 13, 2015, Geovani de Souza firstname.lastname@example.org
@cpuguy83 I think
For instance, let's say that you start a
Unless you are using a distributed volume, you definitely DO NOT want Swarm to create a brand new
You might want to always restart but never re-schedule, or you might want to get both.
In such a case it may be better to only support the explicit case of not rescheduling containers that do have a restart policy.
Alternatively, maybe restart policies could be modified to accept conditions like
@aluzzardi In that case, I'd almost prefer to not reschedule containers with volumes (unless explicitly specified through some configuration) until we can figure out a way to make it just work with restart policies... but maybe there is no perfect world here.
Also wondering if there's a plan to have some delay once a host is marked as unhealthy to do rescheduling (or maybe that's just the health check itself).
@aluzzardi from your suggestion:
Do you mean the reschedulted containers are allowed to duplicate sometime, and finally, swarm would delete the oldest container, and keep the newest container, is it ?
speeking of restarts, reschedule and eventually rebalancing seems like speeking of different functionnalities.
the way i look at it after reading these few posts is that rebalancing is another world with considerations like what to do to minimize downtime and volumes access (allow duplicates or not).
restarts and reschedules are the key features that looks like high availability.
restarts might very well be more suitable for statefull services which would require specific volumes.
In the end only the one running the full stack can say what is best.
restrarts are handled at the docker level and reschedules more likely at the swarm level. although if you have different tanks linked together if you add a bucket of water in on, it will naturally spill to the other ones. in this sense docker could check with swarm if it realy is up to him to start a container and eventually leave the job to swarm to decide (certainly not mandatory)
lastly, what i don't grasp is the network implications...
I think "make sure the new container takes over the ip of the old one" is unnecessary and may be harmful. Swarm do not specify the IP for the original container. It only attaches the container to an overlay network where IP is dynamically assigned. How this IP is used is up to user. The same logic applies to the new container. Generally speaking, distributed service should use names, not IPs.
Persisting IP usually happens on VM
referenced this issue
Jan 4, 2016
In order to safely perform such "rebalancing" the failed node (and/or containers on that node) need first to be "fenced". There are several approaches we can perform such a fence:
referenced this issue
Feb 12, 2016
referenced this issue
Apr 11, 2016
Are there plans to allow an explicit "rebalance" of all eligible containers in the cluster?
A potential use case would be that we add node(s) to the cluster and want to utilize the newly available resources without having to explicitly choose which containers to go there.
@nishanttotla Yes , rebalancing automatically when the node comes back in life . Kind of Resurrection .
Let's say if i have one manager & one worker .-
Ideally manager should automatically balance the containers when worker comes back in life by moving the oldest container back to the worker . I don't know how this can be done without downtime ( in case you have just one application container ) .
PS - I understand it's not efficient way to use swarm, we should use at least 3 managers but i caught up in this situation so i thought to get some ideas from community.
@viveky4d4v I want to confirm that you mean Docker Swarm standalone (this project
I think this issue is implemented. See https://docs.docker.com/swarm/scheduler/rescheduling/