Proposal: docker daemon's SPoF and hot upgrade issue #13884

mudverma · 2015-06-11T12:11:33Z

Background:

As per the current architecture of Docker, all containers are spawned as children of docker daemon. This parent-child relationship between the docker daemon and containers provides a straight forward way of signalling/communication between them. However, this tight coupling between the containers and daemon results in some issues which are critical for containers' up-time, stability and high availability. Few of the issues with this approach are:

Daemon's death (restart, kill, abrupt crash in daemon's code) causes all the running containers to die as well.
An upgrade to daemon can not be performed without impacting the up-time/stability of containers.

Both of these issues become even more important in production environments such as containers cloud where different containers running on a server might belong to same or different clients and might host highly available or stateless services. In these scenarios, a container downtime caused by external factors such as daemon's death/upgrade, is highly undesirable.

The 2nd issue was opened by @shykes in 2013, but this is still an open item #2658

Goals:

Containers should run independently of docker daemon and should continue to function normally even if docker daemon dies/restarts
When containers die/exit their exit status should be appropriately communicated to docker daemon (whenever it restarts)
All commands that are used to interact with containers (start/stop/kill/exec/pause, etc) and I/O (stderr/stdout) redirection should work normally

Findings:

Based on our investigation and experimentation with docker we found that once started, a container can function stand-alone and does not require daemon's presence for the execution of an encapsulated service.

We changed the daemon's code such that upon it's death, containers would become orphaned and shall be adopted by INIT. We ran the official mysql image. We were able to connect to, and use the mysql service even when the daemon was not running and container had become orphaned.

The namespaces (pid,network, ipc, mnt, uts) and cgroups etc, which are the building blocks for container creation and execution, continue to exist and function normally as they are provided by the linux kernel. Therefore, there does not seem to be any reason for a container to stop functioning when the daemon is not present.

Proposal:

In context of our findings and our goals, we propose two alternative design models through which containers will no longer be tightly coupled with the daemon either always (proposal A) or after daemon's death (proposal B). Both of these models require an external communication/signalling mechanism between the daemon and the containers.

Proposal A: Docker daemon starts the containers in such a way that they are the children of INIT from the very beginning and are decoupled from the daemon. All the running containers and daemon shall be siblings in process tree and will not be impacted by daemon's absence.
Proposal B: A hybrid model where newly spawned containers shall be the children of docker daemon as in the current design. However, upon abrupt daemon failure or upgrade, all the running containers shall become orphaned and will move to INIT. When the daemon restarts, it will detect that there are old running containers (current code detects and kills such containers) and would manage them by external communication means (new feature to be added). Consequently, all the containers that are spawned by the current instance of daemon will be its decedents and other containers that were started by previous daemon instances will be independent processes and would require external mode of communication.

Communication between containers and the daemon:

Both of these proposals require some sort of two-way communication mechanism between the daemon and the containers. For example, how would a daemon get notified when a container finishes its execution? Also, how would the daemon pass the commands to containers. In current design, daemon does "wait" on child process (container) in a go routine. This can be tackled by having a dedicated parent monitor process for each container, whose job will be to wait on the container and communicate with the daemon. Below describes the communication:

From a container to daemon: Once the container terminates, waiting monitor parent process can communicate the exit status to the docker daemon either by connecting through sockets or by sending special signals to the daemon process (SIGRTMIN-SIGRTMAX, SIGUSR1,SIGUSR2) and writing to a special file from where daemon can read the status upon receiving a special signal from the container.
From daemon to a container: Similarly, daemon can be made aware of container's pid by monitor process so that it can directly interact with the containers using signals (SIGKILL,SIGTERM,SIGSTP, SIGCONT etc).
For I/O: Interprocess pipes can be used to redirect stdout, stderr etc.

Implementation:

Proposal A:

Daemon first spawns a monitor process and that monitor process spawns the container. Once the container is fully spawned, monitor process replaces itself with an ultra lightweight code (unix exec) whose job is to wait on its child process (container).
Daemon then daemonise the monitor process so that monitor+container become orphan and are adopted by INIT.
Daemon interacts with the monitor process using sockets or send direct signals to container process (monitor can communicate the container pid to daemon) for management (stop, start, pause, kill, etc).
Once the container finishes its execution, its waiting monitor process can communicate the exit status to the daemon. There are two possibilities here
- Daemon is alive:- communication is straight forward and the monitoring process can exit after supplying the status to the daemon
- Daemon is dead:- Monitoring process will wait until the daemon is up, communicate the status and exit. Alternatively it can write to a file, which the new instance of daemon when it would come back up, will read from.

Proposal B:

Monitor+container move to INIT only after the daemon dies (becomes orphaned). This would require a change in cleanup code.
New instance of daemon becomes aware of old running containers and communicates with them using the techniques described in proposal A.
Monitor process can look for its ppid (parent pid) and if it is 1, it can assume that it is orphaned and must communicate with daemon by other means. If not, it would have daemon as it's parent and communication flow be similar to what we have in current design.

Proposal A vs Proposal B:

Proposal A requires daemonising the monitor just after it is spawned. (Daemonising a child process is not inherently supported by golang which docker uses but can be worked around by using double fork).
runtime: support for daemonize golang/go#227
Proposal B does not require any such functionality, and upon daemon's death, all the child monitors (orphaned now) are moved to INIT along with the containers. However,
- Proposal B is easier to code although slightly less cleaner from a design perspective as it would require separate maintenance (different codes paths/checks) for the container that daemon owns (descendants) and the containers that it manages (siblings)

Use Cases:

Containers' high availability
Hot upgrades
Scalability
- Scalability is possible because a particular daemon will not be the owner of containers. Therefore, multiple daemons can be allowed to run on same host doing load balancing of requests coming from different channels (CLI, remote API etc).
- However, we don't know if docker daemon is a performance bottleneck (most likely not). Daemon seems just to be a control channel/management point and does not picture in data/critical path.
- Nonetheless, one can have a redundant docker daemon sharing the same DB, so that even upon upgrades or failures, daemon continue to serve requests - Daemon's high availability

Commands' analysis:

Following table lists all the commands that will be (not) impacted and would require a code change.

Docker command	Description	Impact
exec	run a new command in a running container	impacted, doable (using setns)
Attach	attach to a running container	Impacted, doable (using pipes or by connecting to stdout of the container by other means)
build	build an image from docker file	Not impacted, unrelated
commit	create a new images from a container's changes	Not impacted, related (read file system)
cp	copy files from a location to another	not impacted, related (read file system)
diff	inspect changes on a container's file system	not impacted, related (read file system)
events	global daemon events	not impacted, unrelated
export/save	create a tar archive	not impacted, unrelated
history	history of an image	not impacted, unrelated
images	list images	not impacted, unrelated
import/load	tar to image	not impacted, unrelated
info	daemon version build etc	not impacted, unrelated
inspect	return container's json	not impacted, related (read json from file system)
kill	kill the running container	impacted, doable (need to pass the signal)
logs	fetch the logs of a container	impacted, doable (reading from a special filefd/pipe)
port	shows public facing port-> NATed to private port	not impacted, related
pause	Pause all running processes in a container	impacted, doable (need to send SIGSTOP)
start/unpause	start all stopped processes in a container	impacted, doable (need to send SIGCONT)
ps	list all containers	not impacted, unrelated
pull/push	pull and push images from/to repo	not impacted, unrelated
restart	restart a running container	impacted, doable (kill the old container, start afresh)
rm	remove the container	impacted, doable
rmi	remove the image	not impacted, unrelated
run	run a command in a new container	impacted, doable (start a new container)
search	search repo	not impacted, unrelated
stop	stop a running container	impacted, doable (need to send SIGTERM/KILL)
tag	tag an image	not impacted, unrelated
top	lookup the running processes of a container	impacted, doable (using setns + top)
version	docker version info	not impacted, unrelated
wait	block until a container stops and print exit code	impacted, doable using our external communication mode

Limitations:

We will have the overhead of extra monitor processes (as many as the number of running containers).

GordonTheTurtle · 2015-06-11T12:11:34Z

Please sign your commits following these rules:
https://github.com/docker/docker/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "master" git@github.com:mudverma/docker.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Ammending updates the existing PR. You DO NOT need to open a new one.

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>

bobrik · 2015-06-11T14:47:52Z

I like the idea with proposal A, been thinking about it for a while. Decoupling containers from daemon API layer would be super-awesome. Having extra monitor process doesn't look like a big issue, it's going to be pretty lightweight anyway.

Forward upgrades / rollbacks with preserved running containers should be considered as a top priority.

cpuguy83 · 2015-06-11T14:56:15Z

Looked at this myself as well.
This could also just be a single monitoring process for all containers instead of one for each... and that could use the proposed (#11529) evented monitoring system.

bobrik · 2015-06-11T14:58:52Z

@cpuguy83 can you describe me how upgrade 1.X -> 1.Y would work with a single monitor process? I can't see how new containers could take advantage of new features this way. Will it be 1 monitor per docker version?

mudverma · 2015-06-12T09:27:59Z

@cpuguy83 @bobrik One single monitor process for all containers will not work, specially in upgrades. We thought about this option but ruled it out later. We are not sure even if 1 monitor per docker version will work without it having all the daemon code which is required to spawn a container. Let's assume that we have two processes at a given instance.

Init ----- Docker Daemon(v1)
|------Monitor(v1) ----> containers (A,B,C,D)

Now if we want to have a new container E spawned, how can we make the existing monitor its parent? Transfer of parentship is not allowed. In that case, monitor itself will have to keep all the code and run time to able to spawn a new container. Which will make it sort of another daemon in itself. The idea of monitor being just a light weight process which has nothing to do with what daemon does, will be lost.

Pls correct me if I am wrong.

chenchun · 2015-06-12T09:50:10Z

+1 for proposal A.

thaJeztah · 2015-06-15T14:22:53Z

Thanks for a very well-written proposal, and reviving this topic @mudverma

moxiegirl · 2015-06-15T15:23:04Z

@mudverma Really well structured proposal, definitely going to add this as "the prototype" approach for newbies. Very interesting technically too, i'm looking forward to following it.

aidanhs · 2015-06-15T16:47:01Z

Isn't the architecture the relatively easy part of this? This is a nice proposal (I like A), I'm just wondering about some of the details.

When I pondered on it, the part that always brought me up short was the compatibility layer that would need to exist for handling running containers started by an older daemon and the oddities that would arise from starting a daemon with different arguments.

Thoughts on an upgraded daemon:

how many versions will an upgrade will work for? If you put no limits then someone somewhere will expect a 20 version jump to work.
the problem with the above is you can never remove some parts of old of old features. Let's say links get removed - you can never remove link cleanup functionality (grep for RemoveLink) because someone may have still-linked containers running they want to stop at some point
somewhat similar to the above - how do you ensure that restart policies work (e.g. recreating the link for a linked container)?

Thoughts on a daemon started with different arguments:

how would you represent three containers running on three different storage driver backends in docker ps?
relatedly, how do you indicate that although container A is running on image I1, that image is only available on a different storage driver so you'll need to pull it again?
also related, does volumes-from work across storage drivers?
what do you do if a daemon was started with icc=false, then restarted with icc=true? And vice-versa?
ditto for an altered --dns argument? (this one is of particular interest for me - when I move between home and the office I have to alter this)

Notes that fit into both of the above:

linking - are you allowed to link to containers started by a previous daemon? Does this work if you've been altering daemon arguments related to iptables etc?

This is just scratching the surface!
I very much want this, but defining the minimal featureset is probably not going to be straightforward. Part of the problem is the amount of state held globally the daemon - e.g. if storage driver was a per-container property, the storage driver issues listed above would already be solved.

I'd start by using the proposed architecture to resolve the SPoF problem and revisit the hot upgrade problem later, enforcing this separation by killing all containers if a different daemon version is started or any daemon args are different. You're then in no worse position for upgrades, but you are resilient to daemon crashes/deadlocks.

aidanhs · 2015-06-15T18:21:44Z

Related to #5305 and #6851.

PikachuEXE · 2015-06-16T05:22:56Z

+1 for proposal A.... and I agree with @aidanhs
(and I am not an expert here :S)

mudverma · 2015-06-17T10:06:13Z

@aidanhs I understand the situation. While it is not right to expect 20 version jumps to work, it is still worthwhile to have forward/backward compatibility for at-least 3-4 versions. All softwares have it. Otherwise, a simple patch to resolve some issue even in "docker search" command will also bring down the running containers. Idea is to provide more flexibility to the users and sysadmins. Of-course, they can decide what would be in their best interest, whether to bring down the containers on upgrade or not, but at-least this option should be given. Disruptive shutdown of containers should be the last resort, not first.

On your second part, we might have to do more research on this. I will take a look and get back to you. Thanks for brining it up.

hustcat · 2015-06-17T12:54:11Z

+1 for proposal A

cpuguy83 · 2015-06-17T13:07:11Z

Please don't +1 unless you have something meaningful to add here.
It's not whether or not we want hot upgrades, everybody does. There are some significant technical hurdles to deal with.

mudverma · 2015-06-19T11:42:21Z

#13884 #7086 #13304

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>

prologic · 2015-06-22T09:07:16Z

/cc @prologic

unclejack · 2015-06-22T09:09:48Z

Please keep in mind that there's no need to comment in order to subscribe to notifications for an issue. There's a button for that.

nadgowdas · 2015-06-29T05:19:53Z

Would be interesting to see if this can be solved in the new OCF specification with runc
https://github.com/docker/docker/pull/14211/files#r33429661

rhatdan · 2015-06-29T11:20:51Z

One possible way to do this would be to use runc and have containers register with machinectl under systemd.

#13526

Then when the docker daemon starts up it could query systemd/machinectl for any docker containers already running. This would plug docker daemon much more into the normal system framework.

Since systemd is becoming the default even for ubunto, I think this is the best way forward, rather then creating some other kind of process manager.

cpuguy83 · 2015-06-29T13:19:45Z

We can't tie this to systemd, not every system has systemd regardless of the major distros including it... maybe it wouldn't be incredibly difficult to support systemd if it exists...

Wrapping runC with something docker daemon can attach to might go a long way.

rhatdan · 2015-06-30T11:03:04Z

Well defining the protocol for what this connects to would be better, with perhaps an example, then we could do this the systemd way. docker not working well with systemd, in my opinion is a major weakness.

icecrime · 2015-07-23T18:18:47Z

Collective PR review with maintainers

@mudverma Thanks for the highly detailed proposal!

What we're seeing here is that all assumptions have to be reconsidered in light of runC. We hope Docker 1.9.0 will ship with a dependency on runC for container runtime. That means that neither proposal A (containers as children of init) nor proposal B (containers as children of the daemon) seem to be the way forward. RunC being the parent of the container process will open new doors for "hot upgrades" of the daemon.

We definitely want that feature, and we definitely appreciate your work here! Can you please reformulate this proposal in terms of runC? Thanks 👍

bobrik · 2015-07-23T18:23:04Z

@icecrime I don't see how anything changes with runC. Just read "runC" as "container" and it is all the same. What did I miss?

tiborvass · 2015-08-06T19:02:55Z

Collective review

@duglin @calavera @LK4D4 @tonistiigi @icecrime @jfrazelle

I opened opencontainers/runc#185 to keep track of the issue.
However, I don't think we can accept this PR at this time.

Here's what we suggest for the following steps:

resolve Discovering and reattaching to running containers opencontainers/runc#185
integrate runc /cc @LK4D4
send a PR to docker implementing the discovery and reattach mechanisms agreed upon in Discovering and reattaching to running containers opencontainers/runc#185

We would gladly appreciate your help in those steps, and sorry again for our long review.

mudverma · 2015-08-07T10:44:12Z

@tiborvass Hi, Thanks for your review.

I see that, after the introduction of RunC, lot of assumptions have to be reconsidered.
However, we as outsiders are still not clear about how docker daemon would integrate with runc (if it is planned as suggested by @icecrime ). Can you give us some brief insight into it?

icecrime · 2015-08-07T16:57:03Z

@mudverma Some pieces of information here and there

mudverma · 2015-08-11T11:42:22Z

@icecrime Thanks. So it is going to be

Daemon ----> runc --> container A
|----> runc ---> container B
|-----> runc ----> container C

All containers will have separate parent process (RunC) and Daemon would the parent of all RunC processes?

icecrime · 2015-08-11T16:04:14Z

All containers will have separate parent process (RunC)

Yes!

and Daemon would be the parent of all RunC processes

I'd like to say: not necessarily (especially if we want to support use cases such as restarting the daemon and "reattaching" to the existing runC processes).

proposal for docker daemon's SPoF issue

411da89

GordonTheTurtle added status/1-design-review dco/no labels Jun 11, 2015

mudverma changed the title ~~proposal for docker daemon's SPoF issue~~ Proposal: docker daemon's SPoF and hot upgrade issue Jun 11, 2015

proposal for docker daemon's SPoF issue v2

83c595a

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>

mudverma mentioned this pull request Jun 15, 2015

Upgrade daemon without restarting containers #2658

Closed

mudverma added 2 commits June 22, 2015 11:19

proposal for docker daemon's SPoF issue v2

a1879f7

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>

Merge branch 'master' of https://github.com/mudverma/docker

87e97c1

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>

jessfraz removed the dco/no label Jul 10, 2015

tiborvass added the status/needs-attention Calls for a collective discussion during a review session label Jul 23, 2015

tiborvass closed this Aug 6, 2015

tiborvass removed status/1-design-review status/needs-attention Calls for a collective discussion during a review session labels Aug 6, 2015

chenchun mentioned this pull request Aug 10, 2015

Discovering and reattaching to running containers opencontainers/runc#185

Closed

chenchun mentioned this pull request Aug 18, 2015

Local restore of LocalScope drivers such as bridge and host driver moby/libnetwork#461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: docker daemon's SPoF and hot upgrade issue #13884

Proposal: docker daemon's SPoF and hot upgrade issue #13884

mudverma commented Jun 11, 2015

GordonTheTurtle commented Jun 11, 2015

bobrik commented Jun 11, 2015

cpuguy83 commented Jun 11, 2015

bobrik commented Jun 11, 2015

mudverma commented Jun 12, 2015

chenchun commented Jun 12, 2015

thaJeztah commented Jun 15, 2015

moxiegirl commented Jun 15, 2015

aidanhs commented Jun 15, 2015

aidanhs commented Jun 15, 2015

PikachuEXE commented Jun 16, 2015

mudverma commented Jun 17, 2015

hustcat commented Jun 17, 2015

cpuguy83 commented Jun 17, 2015

mudverma commented Jun 19, 2015

prologic commented Jun 22, 2015

unclejack commented Jun 22, 2015

nadgowdas commented Jun 29, 2015

rhatdan commented Jun 29, 2015

cpuguy83 commented Jun 29, 2015

rhatdan commented Jun 30, 2015

icecrime commented Jul 23, 2015

bobrik commented Jul 23, 2015

tiborvass commented Aug 6, 2015

mudverma commented Aug 7, 2015

icecrime commented Aug 7, 2015

mudverma commented Aug 11, 2015

icecrime commented Aug 11, 2015

Proposal: docker daemon's SPoF and hot upgrade issue #13884

Proposal: docker daemon's SPoF and hot upgrade issue #13884

Conversation

mudverma commented Jun 11, 2015

Background:

Goals:

Findings:

Proposal:

Communication between containers and the daemon:

Implementation:

Use Cases:

Commands' analysis:

Limitations:

GordonTheTurtle commented Jun 11, 2015

bobrik commented Jun 11, 2015

cpuguy83 commented Jun 11, 2015

bobrik commented Jun 11, 2015

mudverma commented Jun 12, 2015

chenchun commented Jun 12, 2015

thaJeztah commented Jun 15, 2015

moxiegirl commented Jun 15, 2015

aidanhs commented Jun 15, 2015

aidanhs commented Jun 15, 2015

PikachuEXE commented Jun 16, 2015

mudverma commented Jun 17, 2015

hustcat commented Jun 17, 2015

cpuguy83 commented Jun 17, 2015

mudverma commented Jun 19, 2015

prologic commented Jun 22, 2015

unclejack commented Jun 22, 2015

nadgowdas commented Jun 29, 2015

rhatdan commented Jun 29, 2015

cpuguy83 commented Jun 29, 2015

rhatdan commented Jun 30, 2015

icecrime commented Jul 23, 2015

bobrik commented Jul 23, 2015

tiborvass commented Aug 6, 2015

mudverma commented Aug 7, 2015

icecrime commented Aug 7, 2015

mudverma commented Aug 11, 2015

icecrime commented Aug 11, 2015