Memory usage of containerd-shim #21737

ibuildthecloud · 2016-04-03T03:18:59Z

This is a question and concern. With every container a containerd-shim process is started. It seems to be that that process consumes between 3-5mb (excluding shared memory). Doing some quick, not so scientific tests, I have seen the memory usage of my system take ~400mb more than 1.10.3 to launch 100 nginx containers. This was a quick cursory test, so I wanted to ask the maintainers how they envision this working going forward, before I dug in much more doing performance tests on 1.11. I also have a concern about VM overcommit because each containerd-shim process consumes over 200mb of virtual memory.

I honestly don't see how this approach will scale with this intermediate process. With the current usage pattern if I run 500 containers (which is a very common use case), I will be using 2gb of memory just for Docker. Am I missing something obvious here? This doesn't seem good.

cyphar · 2016-04-04T13:58:38Z

It should be noted that the same issue is true for Docker versions <1.11. If you start 1000 containers on Docker 1.9.1, the Docker daemon will be killed because of the overcommitting. We recently found out about this issue and have been working on finding a solution that doesn't require implementing unimplemented features in the Go runtime. There are two issues here:

The Go runtime doesn't free memory to the operating system in the traditional sense of the word. On Linux it uses madvise(MADV_DONTNEED) which is bad for several reasons. But the gist is that this results in pages not being freed and just becoming overallocated pages.
I believe there's some kind of resource leak in the Go runtime, because even given (1) it doesn't explain why stopping all the containers and starting new ones causes the memory to keep increasing. I played around with modifying the runtime, and it looks like sysUnused pages aren't actually reused (this is a bug) because if I switch madvise for munmap the program doesn't crash (and it really should).

So this isn't an issue that's unique to Docker 1.11. And I actually thought that having containerd-shim would reduce this issue. Clearly it doesn't, and it's something we should either bring up to the Go community or we figure out some way around this limitation of the Go memory allocator.

ibuildthecloud · 2016-04-04T16:03:36Z

@cyphar Yes, I only care about overcommit because I see issues with it often in <= 1.10. I'm not sure if containerd-shim makes it worse or better and that was one of my questions. My guess is that ultimately it makes it worse. The problem with overcommit usually manifests itself in forking iptables, not a container. So containerd doesn't help there and containerd-shim is now consuming even more memory than before. The reason that Docker daemon goes up to >1gb memory usage does not seem to be related to the number of containers, but some other type of bug.

cyphar · 2016-04-04T16:14:22Z

I'm of the opinion this is a Go runtime issue (there's no resource leak in pprof, and the "leak" is of the "unused heap space freed by madvise"). However, I noticed in my testing of 1.10.3 that the overcommitted memory stops growing for me after 8GB (which is the amount of physical memory I have on my test box). Can you see something similar in your case, where the overcommitted overhead stops rising after a while?

The problem with overcommit usually manifests itself in forking iptables, not a container.

That's very interesting, I didn't know that. To be clear, you're referring to forking the iptables processes? Is there a chance we could switch to using syscalls (I never liked the way we do iptables stuff)? Or is this something that we can't really avoid and the Go runtime needs to fix (that forking a process has such a huge dent in the overcommitted memory)?

ibuildthecloud · 2016-04-04T16:19:26Z

AFAIK not calling iptables isn't feasible because the netlink protocol is not stable and the CLI is the considered the stable interface. I don't think there is anything special about iptables, it is just that once memory gets high enough you can't fork any process and iptables gets forked before you can launch a container.

I don't really have a reproducible use case for memory usage. If I had to guess I would say a lot of container logging has a tendency to grow the virtual size.

crosbymichael · 2016-04-04T18:59:38Z

I have some optimizations that I'm working on, this being one:

https://github.com/docker/containerd/pull/184

The end goal is to have the shim written in C.

ibuildthecloud · 2016-04-04T19:45:51Z

@crosbymichael What is the purpose of the shim? Why can't containerd be the parent?

ibuildthecloud · 2016-04-04T19:46:32Z

@crosbymichael you should write it in OCaml or Rust. C would be far too practical.

crosbymichael · 2016-04-04T19:53:49Z

@ibuildthecloud so the main purpose is for reattach. Its not enabled in this release but with the shim being the parent and keeping a hold of the fifo's and pty master it allows docker and containerd to both die and your containers to keep running. When docker/containerd come back up then they can reattach to the container and get exit events and attach to stdio.

cyphar · 2016-04-05T01:24:09Z

There are ways we can improve iptables handling. For example, we can generate an iptables-restore payload and the pipe it through to iptables-restore. That way, we only need to fork once rather than for every rule we add. In addition, iptables-restore is atomic, which our current setup isn't. I can get working on a PR like this if you want.

As for @crosbymichael's optimisations, I'm not sure whether that actually decreases the overcommit footprint. It looks like it just decreases the binary size. Or am I misunderstanding something?

cyphar · 2016-04-05T01:34:34Z

@crosbymichael As for your second point:

so the main purpose is for reattach. Its not enabled in this release but with the shim being the parent and keeping a hold of the fifo's and pty master it allows docker and containerd to both die and your containers to keep running. When docker/containerd come back up then they can reattach to the container and get exit events and attach to stdio.

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

ibuildthecloud · 2016-04-06T02:39:56Z

@crosbymichael the two levels seem like over kill then. Why not just embed containerd in Docker?

ibuildthecloud · 2016-04-06T02:46:33Z

@crosbymichael I do like the idea of having mini olaf rain clouds floating around with containers. That is cool. If the shim was in C then the overhead really should be like 2k.

mrjana · 2016-04-06T02:54:27Z

AFAIK not calling iptables isn't feasible because the netlink protocol is not stable and the CLI is the considered the stable interface.

Yes, FWIW, iptables doesn't use netlink until 3.14 (when nftables api became the core of how iptables are implemented) so programmatic access is not available in all kernels. In the older kernels the kernel api is fairly undocumented and even though there is something called libiptc it is not something that should be your first choice for implementing iptables.

mrjana · 2016-04-06T02:59:11Z

There are ways we can improve iptables handling. For example, we can generate an iptables-restore payload and the pipe it through to iptables-restore. That way, we only need to fork once rather than for every rule we add. In addition, iptables-restore is atomic, which our current setup isn't. I can get working on a PR like this if you want.

We have been thinking about it for some time (to use iptables-restore). The one thing that we want to do before is cleanup docker iptable rules into a fairly organized set of chains and then reduce the workload in the builtin chains. This also ensures that we can maintain certain kind of ordering that needs to be maintained for various use case scenarios. Finally wrap all of these into a iptables-restore based api.

cyphar · 2016-04-06T06:36:21Z

@ibuildthecloud I did some testing with Docker 1.10.3 (before containerd, because the architecture is simpler). Running the daemon with --iptables=false and the containers with --net=none (which should cause libnetwork and thus the iptables stuff to not be used at all), the overcommitted address space doesn't grow any more slowly. It still had ~6GB of overcommitted memory after starting ~500 containers.

dqminh · 2016-04-06T11:03:55Z

The end goal is to have the shim written in C

+1. The shim is really not that big.

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

I think that separating it also facilitates more independent updates ( Docker is a lot more than just just running bundles after all :p ).

.. Finally wrap all of these into a iptables-restore based api.

This is interesting. Can you explain how such iptables-restore api will work and/or interact with existing iptables rules / iptables manipulator on the host ?

cyphar · 2016-04-06T11:56:24Z

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

I think that separating it also facilitates more independent updates ( Docker is a lot more than just just running bundles after all :p ).

Right, but if the end-goal is to have containers be completely separated from either daemon (meaning if you update the daemon the containers won't die) -- what is the point of having "independent" updates if updating either daemon will not kill containers (and there isn't any other overwhelming "independence" issue AFAICS)? Surely if this functionality was embedded inside Docker, it would also make updates of the Docker daemon independent of the container lifetimes? And it wouldn't require packaging an extra daemon that is very intertwined with the inner workings of Docker (mountpoints are very fiddly between the two daemons).

I'm going to be honest, I never understood the benefits of splitting the daemon into two separate daemons. So I'm probably biased in this discussion, but that's because I don't see any overwhelming benefit and do see the potential for a lot of pain.

Also, I really want to know how many bugs are going to come up because of the fact that there's a lot of mountpoint cleaning up that we must not do if containerd can stand on its own.

icecrime · 2016-04-07T17:45:00Z

We did the best we can in 1.11 with the Go shim. I'm keeping the issue open and rolling to 1.12 for when we port it to C.

jetheurer · 2016-04-18T16:26:31Z

I think I'm having a similar problem. I'm running a flask api served by nginx and uwsgi. the contanerd-shim processes use a lot of virtual memory and ultimately crashing the machine.

Can I solve this problem by downgrading docker?

uname -a:
Linux ssrs-01 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker info:

Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 52
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-202:1-1193092-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.746 GB
 Data Space Total: 107.4 GB
 Data Space Available: 17.99 GB
 Metadata Space Used: 4.78 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.143 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-85-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859 GiB
ID: FYL6:52VJ:K2HQ:TCA4:MT5F:QJRH:NSWB:JCXI:ENPV:OL2G:IH2F:3DDU
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

docker version:

 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64```

cpuguy83 · 2016-04-18T17:28:11Z

@jetheurer Why do you think the shim is the cause of the machine crashing?

jetheurer · 2016-04-18T17:46:06Z

@cpuguy83 I thought the shim might be the issue because its the root process. Is there a way to diagnose the issue?

cpuguy83 · 2016-04-18T17:48:15Z

@jetheurer This is probably not the best place since it is not related to the topic of this issue.
If you think the issue is Docker related, please open a separate issue and provide some details, especially logs about the crash (what exactly is crashing?)

cyphar · 2016-05-25T09:31:17Z

This issue is a very big problem for us, and as far as I can tell it's a bug in the Go runtime (it doesn't fully deallocate memory it just uses madvise(MADV_FREE) to clear the page). This causes problems because not everyone can enable unlimited memory overcommit to overcome this issue in the Go runtime.

Unfortunately, some of my interactions with the Go community have been met with "this isn't an issue because there's no limit on address space" -- missing the whole point of the issue.

So, is there a way we can make this a big enough issue for the Go community (like we did with ForkExec) so as to fix this issue? This poses a serious density problem for Docker (you can't run more than a few hundred containers even though you could in principle run thousands -- not to mention that stopped containers contribute to the memory overhead).

WDYT?

/cc @cpuguy83 @crosbymichael

mrunalp · 2016-05-26T01:15:14Z

@cyphar Could you point to issues you have created in Go so we can help push for fixes there?

cyphar · 2016-05-26T09:40:18Z

I didn't open an issue as of yet, I just had some discussion with upstream here https://groups.google.com/forum/#!topic/golang-dev/zqFt5oVcTCY.

ibuildthecloud · 2016-05-27T00:34:45Z

FWIW Im working on a shimless version of containerd https://github.com/docker/containerd/pull/227
Still working out some small kinks but so far it's been good

frol · 2018-02-09T11:11:33Z

I beg your pardon, but was there any progress on this issue?

I have a project which needs to run thousands of tiny processes in hundreds of containers per minute. I discovered the overhead of docker run right in the first days of prototyping, so I went with docker start (via direct API calls) with some batching. Yet, even then some batches ended up spending 90% of their total execution time for docker start, so I went experimenting with running the preallocated containers + docker exec, and while the execution overhead is finally OK (0.045s for docker exec while docker start can take around 0.9s to boot on the same machine, and docker run gets up to 2s), the memory footprint of docker-containerd-shim is far beyond usable for the use-case (~10MB per container means that my project will require 2GB of RAM for 200 "preallocated" containers even though the minimal entrypoint consumes only 64KB). Any ideas for optimization?

cpuguy83 · 2018-02-09T14:35:52Z

Shim memory usage is much better (than previous 1.0 based shims) now.

This is still from containerd 1.0.1, there's more patches in 1.0.2 to further improve shim memory usage.

@frol If you are looking for raw runtime performance, directly using containerd may be better although containerd doesn't setup networking for you which is a large amount of the time that moby takes (although moby has other inefficiencies that need to be addressed as well).

frol · 2018-02-09T16:01:15Z

@cpuguy83 Thank you for the suggestion. I will consider using containerd directly.

cpuguy83 · 2018-02-09T16:05:09Z

FYI, in containerd create+start is ~250ms, stop+delete add another ~250ms

cyphar · 2018-02-10T08:02:49Z

@frol Another option would be to use runc directly, which is even faster.

atline · 2019-10-09T05:14:22Z

So, any official guide to tell us how many containers suggest for per 1GB memory as shim will cost 5M memory?

icecrime added this to the 1.11.0 milestone Apr 3, 2016

icecrime modified the milestones: 1.12.0, 1.11.0 Apr 7, 2016

icecrime changed the title ~~[1.11-rc3] Memory usage of containerd-shim~~ Memory usage of containerd-shim Apr 7, 2016

icecrime added the roadmap label Apr 7, 2016

yujuhong mentioned this issue Apr 13, 2016

Validate Docker v1.11 kubernetes/kubernetes#23397

Closed

5 tasks

bprashanth mentioned this issue May 29, 2016

Kubelet [Serial] [Slow] [k8s.io] regular resource usage tracking resource tracking for 35 pods per node kubernetes/kubernetes#26495

Closed

chenliu0831 mentioned this issue Jun 10, 2016

Excessive memory usage on docker 1.11 aws/amazon-ecs-agent#424

Closed

crosbymichael removed this from the 1.12.0 milestone Jun 15, 2016

cyphar mentioned this issue Dec 5, 2016

Add support for streaming requests (attach/exec/port-forward) cri-o/cri-o#200

Closed

4 tasks

vdemeester removed the roadmap label Feb 14, 2018

thaJeztah added area/runtime area/performance labels Dec 11, 2019

ibuildthecloud closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage of containerd-shim #21737

Memory usage of containerd-shim #21737

ibuildthecloud commented Apr 3, 2016

cyphar commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

cyphar commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

crosbymichael commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

crosbymichael commented Apr 4, 2016

cyphar commented Apr 5, 2016

cyphar commented Apr 5, 2016

ibuildthecloud commented Apr 6, 2016

ibuildthecloud commented Apr 6, 2016

mrjana commented Apr 6, 2016

mrjana commented Apr 6, 2016

cyphar commented Apr 6, 2016

dqminh commented Apr 6, 2016

cyphar commented Apr 6, 2016

icecrime commented Apr 7, 2016

jetheurer commented Apr 18, 2016 •

edited

cpuguy83 commented Apr 18, 2016

jetheurer commented Apr 18, 2016

cpuguy83 commented Apr 18, 2016

cyphar commented May 25, 2016

mrunalp commented May 26, 2016

cyphar commented May 26, 2016

ibuildthecloud commented May 27, 2016

frol commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018 •

edited

frol commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

cyphar commented Feb 10, 2018

atline commented Oct 9, 2019 •

edited

Memory usage of containerd-shim #21737

Memory usage of containerd-shim #21737

Comments

ibuildthecloud commented Apr 3, 2016

cyphar commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

cyphar commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

crosbymichael commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

ibuildthecloud commented Apr 4, 2016

crosbymichael commented Apr 4, 2016

cyphar commented Apr 5, 2016

cyphar commented Apr 5, 2016

ibuildthecloud commented Apr 6, 2016

ibuildthecloud commented Apr 6, 2016

mrjana commented Apr 6, 2016

mrjana commented Apr 6, 2016

cyphar commented Apr 6, 2016

dqminh commented Apr 6, 2016

cyphar commented Apr 6, 2016

icecrime commented Apr 7, 2016

jetheurer commented Apr 18, 2016 • edited

cpuguy83 commented Apr 18, 2016

jetheurer commented Apr 18, 2016

cpuguy83 commented Apr 18, 2016

cyphar commented May 25, 2016

mrunalp commented May 26, 2016

cyphar commented May 26, 2016

ibuildthecloud commented May 27, 2016

frol commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018 • edited

frol commented Feb 9, 2018

cpuguy83 commented Feb 9, 2018

cyphar commented Feb 10, 2018

atline commented Oct 9, 2019 • edited

jetheurer commented Apr 18, 2016 •

edited

cpuguy83 commented Feb 9, 2018 •

edited

atline commented Oct 9, 2019 •

edited