Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: docker daemon's SPoF and hot upgrade issue #13884

Closed
wants to merge 4 commits into from

Conversation

mudverma
Copy link

Background:

As per the current architecture of Docker, all containers are spawned as children of docker daemon. This parent-child relationship between the docker daemon and containers provides a straight forward way of signalling/communication between them. However, this tight coupling between the containers and daemon results in some issues which are critical for containers' up-time, stability and high availability. Few of the issues with this approach are:

  1. Daemon's death (restart, kill, abrupt crash in daemon's code) causes all the running containers to die as well.
  2. An upgrade to daemon can not be performed without impacting the up-time/stability of containers.

Both of these issues become even more important in production environments such as containers cloud where different containers running on a server might belong to same or different clients and might host highly available or stateless services. In these scenarios, a container downtime caused by external factors such as daemon's death/upgrade, is highly undesirable.

The 2nd issue was opened by @shykes in 2013, but this is still an open item #2658

Goals:

  1. Containers should run independently of docker daemon and should continue to function normally even if docker daemon dies/restarts
  2. When containers die/exit their exit status should be appropriately communicated to docker daemon (whenever it restarts)
  3. All commands that are used to interact with containers (start/stop/kill/exec/pause, etc) and I/O (stderr/stdout) redirection should work normally

Findings:

Based on our investigation and experimentation with docker we found that once started, a container can function stand-alone and does not require daemon's presence for the execution of an encapsulated service.

We changed the daemon's code such that upon it's death, containers would become orphaned and shall be adopted by INIT. We ran the official mysql image. We were able to connect to, and use the mysql service even when the daemon was not running and container had become orphaned.

The namespaces (pid,network, ipc, mnt, uts) and cgroups etc, which are the building blocks for container creation and execution, continue to exist and function normally as they are provided by the linux kernel. Therefore, there does not seem to be any reason for a container to stop functioning when the daemon is not present.

Proposal:

In context of our findings and our goals, we propose two alternative design models through which containers will no longer be tightly coupled with the daemon either always (proposal A) or after daemon's death (proposal B). Both of these models require an external communication/signalling mechanism between the daemon and the containers.

  • Proposal A: Docker daemon starts the containers in such a way that they are the children of INIT from the very beginning and are decoupled from the daemon. All the running containers and daemon shall be siblings in process tree and will not be impacted by daemon's absence.
  • Proposal B: A hybrid model where newly spawned containers shall be the children of docker daemon as in the current design. However, upon abrupt daemon failure or upgrade, all the running containers shall become orphaned and will move to INIT. When the daemon restarts, it will detect that there are old running containers (current code detects and kills such containers) and would manage them by external communication means (new feature to be added). Consequently, all the containers that are spawned by the current instance of daemon will be its decedents and other containers that were started by previous daemon instances will be independent processes and would require external mode of communication.

Communication between containers and the daemon:

Both of these proposals require some sort of two-way communication mechanism between the daemon and the containers. For example, how would a daemon get notified when a container finishes its execution? Also, how would the daemon pass the commands to containers. In current design, daemon does "wait" on child process (container) in a go routine. This can be tackled by having a dedicated parent monitor process for each container, whose job will be to wait on the container and communicate with the daemon. Below describes the communication:

  • From a container to daemon: Once the container terminates, waiting monitor parent process can communicate the exit status to the docker daemon either by connecting through sockets or by sending special signals to the daemon process (SIGRTMIN-SIGRTMAX, SIGUSR1,SIGUSR2) and writing to a special file from where daemon can read the status upon receiving a special signal from the container.
  • From daemon to a container: Similarly, daemon can be made aware of container's pid by monitor process so that it can directly interact with the containers using signals (SIGKILL,SIGTERM,SIGSTP, SIGCONT etc).
  • For I/O: Interprocess pipes can be used to redirect stdout, stderr etc.

Implementation:

Proposal A:

  • Daemon first spawns a monitor process and that monitor process spawns the container. Once the container is fully spawned, monitor process replaces itself with an ultra lightweight code (unix exec) whose job is to wait on its child process (container).
  • Daemon then daemonise the monitor process so that monitor+container become orphan and are adopted by INIT.
  • Daemon interacts with the monitor process using sockets or send direct signals to container process (monitor can communicate the container pid to daemon) for management (stop, start, pause, kill, etc).
  • Once the container finishes its execution, its waiting monitor process can communicate the exit status to the daemon. There are two possibilities here
    • Daemon is alive:- communication is straight forward and the monitoring process can exit after supplying the status to the daemon
    • Daemon is dead:- Monitoring process will wait until the daemon is up, communicate the status and exit. Alternatively it can write to a file, which the new instance of daemon when it would come back up, will read from.
      screen shot 2015-06-11 at 3 46 06 pm

Proposal B:

  • Monitor+container move to INIT only after the daemon dies (becomes orphaned). This would require a change in cleanup code.
  • New instance of daemon becomes aware of old running containers and communicates with them using the techniques described in proposal A.
  • Monitor process can look for its ppid (parent pid) and if it is 1, it can assume that it is orphaned and must communicate with daemon by other means. If not, it would have daemon as it's parent and communication flow be similar to what we have in current design.

screen shot 2015-06-11 at 3 46 45 pm

Proposal A vs Proposal B:

  • Proposal A requires daemonising the monitor just after it is spawned. (Daemonising a child process is not inherently supported by golang which docker uses but can be worked around by using double fork).
    runtime: support for daemonize golang/go#227
  • Proposal B does not require any such functionality, and upon daemon's death, all the child monitors (orphaned now) are moved to INIT along with the containers. However,
    • Proposal B is easier to code although slightly less cleaner from a design perspective as it would require separate maintenance (different codes paths/checks) for the container that daemon owns (descendants) and the containers that it manages (siblings)

Use Cases:

  • Containers' high availability
  • Hot upgrades
  • Scalability
    • Scalability is possible because a particular daemon will not be the owner of containers. Therefore, multiple daemons can be allowed to run on same host doing load balancing of requests coming from different channels (CLI, remote API etc).
    • However, we don't know if docker daemon is a performance bottleneck (most likely not). Daemon seems just to be a control channel/management point and does not picture in data/critical path.
    • Nonetheless, one can have a redundant docker daemon sharing the same DB, so that even upon upgrades or failures, daemon continue to serve requests - Daemon's high availability

Commands' analysis:

Following table lists all the commands that will be (not) impacted and would require a code change.

Docker command Description Impact
exec run a new command in a running container impacted, doable (using setns)
Attach attach to a running container Impacted, doable (using pipes or by connecting to stdout of the container by other means)
build build an image from docker file Not impacted, unrelated
commit create a new images from a container's changes Not impacted, related (read file system)
cp copy files from a location to another not impacted, related (read file system)
diff inspect changes on a container's file system not impacted, related (read file system)
events global daemon events not impacted, unrelated
export/save create a tar archive not impacted, unrelated
history history of an image not impacted, unrelated
images list images not impacted, unrelated
import/load tar to image not impacted, unrelated
info daemon version build etc not impacted, unrelated
inspect return container's json not impacted, related (read json from file system)
kill kill the running container impacted, doable (need to pass the signal)
logs fetch the logs of a container impacted, doable (reading from a special filefd/pipe)
port shows public facing port-> NATed to private port not impacted, related
pause Pause all running processes in a container impacted, doable (need to send SIGSTOP)
start/unpause start all stopped processes in a container impacted, doable (need to send SIGCONT)
ps list all containers not impacted, unrelated
pull/push pull and push images from/to repo not impacted, unrelated
restart restart a running container impacted, doable (kill the old container, start afresh)
rm remove the container impacted, doable
rmi remove the image not impacted, unrelated
run run a command in a new container impacted, doable (start a new container)
search search repo not impacted, unrelated
stop stop a running container impacted, doable (need to send SIGTERM/KILL)
tag tag an image not impacted, unrelated
top lookup the running processes of a container impacted, doable (using setns + top)
version docker version info not impacted, unrelated
wait block until a container stops and print exit code impacted, doable using our external communication mode

Limitations:

We will have the overhead of extra monitor processes (as many as the number of running containers).

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/docker/docker/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "master" git@github.com:mudverma/docker.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Ammending updates the existing PR. You DO NOT need to open a new one.

@mudverma mudverma changed the title proposal for docker daemon's SPoF issue Proposal: docker daemon's SPoF and hot upgrade issue Jun 11, 2015
Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>
@bobrik
Copy link
Contributor

bobrik commented Jun 11, 2015

I like the idea with proposal A, been thinking about it for a while. Decoupling containers from daemon API layer would be super-awesome. Having extra monitor process doesn't look like a big issue, it's going to be pretty lightweight anyway.

Forward upgrades / rollbacks with preserved running containers should be considered as a top priority.

@cpuguy83
Copy link
Member

Looked at this myself as well.
This could also just be a single monitoring process for all containers instead of one for each... and that could use the proposed (#11529) evented monitoring system.

@bobrik
Copy link
Contributor

bobrik commented Jun 11, 2015

@cpuguy83 can you describe me how upgrade 1.X -> 1.Y would work with a single monitor process? I can't see how new containers could take advantage of new features this way. Will it be 1 monitor per docker version?

@mudverma
Copy link
Author

@cpuguy83 @bobrik One single monitor process for all containers will not work, specially in upgrades. We thought about this option but ruled it out later. We are not sure even if 1 monitor per docker version will work without it having all the daemon code which is required to spawn a container. Let's assume that we have two processes at a given instance.

Init ----- Docker Daemon(v1)
|------Monitor(v1) ----> containers (A,B,C,D)

Now if we want to have a new container E spawned, how can we make the existing monitor its parent? Transfer of parentship is not allowed. In that case, monitor itself will have to keep all the code and run time to able to spawn a new container. Which will make it sort of another daemon in itself. The idea of monitor being just a light weight process which has nothing to do with what daemon does, will be lost.

Pls correct me if I am wrong.

@chenchun
Copy link
Contributor

+1 for proposal A.

@thaJeztah
Copy link
Member

Thanks for a very well-written proposal, and reviving this topic @mudverma

@moxiegirl
Copy link
Contributor

@mudverma Really well structured proposal, definitely going to add this as "the prototype" approach for newbies. Very interesting technically too, i'm looking forward to following it.

@aidanhs
Copy link
Contributor

aidanhs commented Jun 15, 2015

Isn't the architecture the relatively easy part of this? This is a nice proposal (I like A), I'm just wondering about some of the details.

When I pondered on it, the part that always brought me up short was the compatibility layer that would need to exist for handling running containers started by an older daemon and the oddities that would arise from starting a daemon with different arguments.

Thoughts on an upgraded daemon:

  • how many versions will an upgrade will work for? If you put no limits then someone somewhere will expect a 20 version jump to work.
  • the problem with the above is you can never remove some parts of old of old features. Let's say links get removed - you can never remove link cleanup functionality (grep for RemoveLink) because someone may have still-linked containers running they want to stop at some point
  • somewhat similar to the above - how do you ensure that restart policies work (e.g. recreating the link for a linked container)?

Thoughts on a daemon started with different arguments:

  • how would you represent three containers running on three different storage driver backends in docker ps?
  • relatedly, how do you indicate that although container A is running on image I1, that image is only available on a different storage driver so you'll need to pull it again?
  • also related, does volumes-from work across storage drivers?
  • what do you do if a daemon was started with icc=false, then restarted with icc=true? And vice-versa?
  • ditto for an altered --dns argument? (this one is of particular interest for me - when I move between home and the office I have to alter this)

Notes that fit into both of the above:

  • linking - are you allowed to link to containers started by a previous daemon? Does this work if you've been altering daemon arguments related to iptables etc?

This is just scratching the surface!
I very much want this, but defining the minimal featureset is probably not going to be straightforward. Part of the problem is the amount of state held globally the daemon - e.g. if storage driver was a per-container property, the storage driver issues listed above would already be solved.

I'd start by using the proposed architecture to resolve the SPoF problem and revisit the hot upgrade problem later, enforcing this separation by killing all containers if a different daemon version is started or any daemon args are different. You're then in no worse position for upgrades, but you are resilient to daemon crashes/deadlocks.

@aidanhs
Copy link
Contributor

aidanhs commented Jun 15, 2015

Related to #5305 and #6851.

@PikachuEXE
Copy link

+1 for proposal A.... and I agree with @aidanhs
(and I am not an expert here :S)

@mudverma
Copy link
Author

@aidanhs I understand the situation. While it is not right to expect 20 version jumps to work, it is still worthwhile to have forward/backward compatibility for at-least 3-4 versions. All softwares have it. Otherwise, a simple patch to resolve some issue even in "docker search" command will also bring down the running containers. Idea is to provide more flexibility to the users and sysadmins. Of-course, they can decide what would be in their best interest, whether to bring down the containers on upgrade or not, but at-least this option should be given. Disruptive shutdown of containers should be the last resort, not first.

On your second part, we might have to do more research on this. I will take a look and get back to you. Thanks for brining it up.

@hustcat
Copy link

hustcat commented Jun 17, 2015

+1 for proposal A

@cpuguy83
Copy link
Member

Please don't +1 unless you have something meaningful to add here.
It's not whether or not we want hot upgrades, everybody does. There are some significant technical hurdles to deal with.

@mudverma
Copy link
Author

#13884 #7086 #13304

Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>
Signed-off-by: Mudit Verma <mudit.f2004912@gmail.com>
@prologic
Copy link
Contributor

/cc @prologic

@unclejack
Copy link
Contributor

Please keep in mind that there's no need to comment in order to subscribe to notifications for an issue. There's a button for that.

@nadgowdas
Copy link

Would be interesting to see if this can be solved in the new OCF specification with runc
https://github.com/docker/docker/pull/14211/files#r33429661

@rhatdan
Copy link
Contributor

rhatdan commented Jun 29, 2015

One possible way to do this would be to use runc and have containers register with machinectl under systemd.

#13526

Then when the docker daemon starts up it could query systemd/machinectl for any docker containers already running. This would plug docker daemon much more into the normal system framework.

Since systemd is becoming the default even for ubunto, I think this is the best way forward, rather then creating some other kind of process manager.

@cpuguy83
Copy link
Member

We can't tie this to systemd, not every system has systemd regardless of the major distros including it... maybe it wouldn't be incredibly difficult to support systemd if it exists...

Wrapping runC with something docker daemon can attach to might go a long way.

@rhatdan
Copy link
Contributor

rhatdan commented Jun 30, 2015

Well defining the protocol for what this connects to would be better, with perhaps an example, then we could do this the systemd way. docker not working well with systemd, in my opinion is a major weakness.

@jessfraz jessfraz removed the dco/no label Jul 10, 2015
@tiborvass tiborvass added the status/needs-attention Calls for a collective discussion during a review session label Jul 23, 2015
@icecrime
Copy link
Contributor

Collective PR review with maintainers

@mudverma Thanks for the highly detailed proposal!

What we're seeing here is that all assumptions have to be reconsidered in light of runC. We hope Docker 1.9.0 will ship with a dependency on runC for container runtime. That means that neither proposal A (containers as children of init) nor proposal B (containers as children of the daemon) seem to be the way forward. RunC being the parent of the container process will open new doors for "hot upgrades" of the daemon.

We definitely want that feature, and we definitely appreciate your work here! Can you please reformulate this proposal in terms of runC? Thanks 👍

@bobrik
Copy link
Contributor

bobrik commented Jul 23, 2015

@icecrime I don't see how anything changes with runC. Just read "runC" as "container" and it is all the same. What did I miss?

@tiborvass
Copy link
Contributor

Collective review

@duglin @calavera @LK4D4 @tonistiigi @icecrime @jfrazelle

I opened opencontainers/runc#185 to keep track of the issue.
However, I don't think we can accept this PR at this time.

Here's what we suggest for the following steps:

We would gladly appreciate your help in those steps, and sorry again for our long review.

@tiborvass tiborvass closed this Aug 6, 2015
@tiborvass tiborvass removed status/1-design-review status/needs-attention Calls for a collective discussion during a review session labels Aug 6, 2015
@mudverma
Copy link
Author

mudverma commented Aug 7, 2015

@tiborvass Hi, Thanks for your review.

I see that, after the introduction of RunC, lot of assumptions have to be reconsidered.
However, we as outsiders are still not clear about how docker daemon would integrate with runc (if it is planned as suggested by @icecrime ). Can you give us some brief insight into it?

@icecrime
Copy link
Contributor

icecrime commented Aug 7, 2015

@mudverma Some pieces of information here and there

@mudverma
Copy link
Author

@icecrime Thanks. So it is going to be

Daemon ----> runc --> container A
|----> runc ---> container B
|-----> runc ----> container C

All containers will have separate parent process (RunC) and Daemon would the parent of all RunC processes?

@icecrime
Copy link
Contributor

All containers will have separate parent process (RunC)

Yes!

and Daemon would be the parent of all RunC processes

I'd like to say: not necessarily (especially if we want to support use cases such as restarting the daemon and "reattaching" to the existing runC processes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet