Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage of containerd-shim #21737

Closed
ibuildthecloud opened this issue Apr 3, 2016 · 32 comments
Closed

Memory usage of containerd-shim #21737

ibuildthecloud opened this issue Apr 3, 2016 · 32 comments

Comments

@ibuildthecloud
Copy link
Contributor

This is a question and concern. With every container a containerd-shim process is started. It seems to be that that process consumes between 3-5mb (excluding shared memory). Doing some quick, not so scientific tests, I have seen the memory usage of my system take ~400mb more than 1.10.3 to launch 100 nginx containers. This was a quick cursory test, so I wanted to ask the maintainers how they envision this working going forward, before I dug in much more doing performance tests on 1.11. I also have a concern about VM overcommit because each containerd-shim process consumes over 200mb of virtual memory.

I honestly don't see how this approach will scale with this intermediate process. With the current usage pattern if I run 500 containers (which is a very common use case), I will be using 2gb of memory just for Docker. Am I missing something obvious here? This doesn't seem good.

@icecrime icecrime added this to the 1.11.0 milestone Apr 3, 2016
@cyphar
Copy link
Contributor

cyphar commented Apr 4, 2016

It should be noted that the same issue is true for Docker versions <1.11. If you start 1000 containers on Docker 1.9.1, the Docker daemon will be killed because of the overcommitting. We recently found out about this issue and have been working on finding a solution that doesn't require implementing unimplemented features in the Go runtime. There are two issues here:

  1. The Go runtime doesn't free memory to the operating system in the traditional sense of the word. On Linux it uses madvise(MADV_DONTNEED) which is bad for several reasons. But the gist is that this results in pages not being freed and just becoming overallocated pages.

  2. I believe there's some kind of resource leak in the Go runtime, because even given (1) it doesn't explain why stopping all the containers and starting new ones causes the memory to keep increasing. I played around with modifying the runtime, and it looks like sysUnused pages aren't actually reused (this is a bug) because if I switch madvise for munmap the program doesn't crash (and it really should).

So this isn't an issue that's unique to Docker 1.11. And I actually thought that having containerd-shim would reduce this issue. Clearly it doesn't, and it's something we should either bring up to the Go community or we figure out some way around this limitation of the Go memory allocator.

@ibuildthecloud
Copy link
Contributor Author

@cyphar Yes, I only care about overcommit because I see issues with it often in <= 1.10. I'm not sure if containerd-shim makes it worse or better and that was one of my questions. My guess is that ultimately it makes it worse. The problem with overcommit usually manifests itself in forking iptables, not a container. So containerd doesn't help there and containerd-shim is now consuming even more memory than before. The reason that Docker daemon goes up to >1gb memory usage does not seem to be related to the number of containers, but some other type of bug.

@cyphar
Copy link
Contributor

cyphar commented Apr 4, 2016

I'm of the opinion this is a Go runtime issue (there's no resource leak in pprof, and the "leak" is of the "unused heap space freed by madvise"). However, I noticed in my testing of 1.10.3 that the overcommitted memory stops growing for me after 8GB (which is the amount of physical memory I have on my test box). Can you see something similar in your case, where the overcommitted overhead stops rising after a while?

The problem with overcommit usually manifests itself in forking iptables, not a container.

That's very interesting, I didn't know that. To be clear, you're referring to forking the iptables processes? Is there a chance we could switch to using syscalls (I never liked the way we do iptables stuff)? Or is this something that we can't really avoid and the Go runtime needs to fix (that forking a process has such a huge dent in the overcommitted memory)?

@ibuildthecloud
Copy link
Contributor Author

AFAIK not calling iptables isn't feasible because the netlink protocol is not stable and the CLI is the considered the stable interface. I don't think there is anything special about iptables, it is just that once memory gets high enough you can't fork any process and iptables gets forked before you can launch a container.

I don't really have a reproducible use case for memory usage. If I had to guess I would say a lot of container logging has a tendency to grow the virtual size.

@crosbymichael
Copy link
Contributor

I have some optimizations that I'm working on, this being one:

https://github.com/docker/containerd/pull/184

The end goal is to have the shim written in C.

@ibuildthecloud
Copy link
Contributor Author

@crosbymichael What is the purpose of the shim? Why can't containerd be the parent?

@ibuildthecloud
Copy link
Contributor Author

@crosbymichael you should write it in OCaml or Rust. C would be far too practical.

@crosbymichael
Copy link
Contributor

@ibuildthecloud so the main purpose is for reattach. Its not enabled in this release but with the shim being the parent and keeping a hold of the fifo's and pty master it allows docker and containerd to both die and your containers to keep running. When docker/containerd come back up then they can reattach to the container and get exit events and attach to stdio.

@cyphar
Copy link
Contributor

cyphar commented Apr 5, 2016

There are ways we can improve iptables handling. For example, we can generate an iptables-restore payload and the pipe it through to iptables-restore. That way, we only need to fork once rather than for every rule we add. In addition, iptables-restore is atomic, which our current setup isn't. I can get working on a PR like this if you want.

As for @crosbymichael's optimisations, I'm not sure whether that actually decreases the overcommit footprint. It looks like it just decreases the binary size. Or am I misunderstanding something?

@cyphar
Copy link
Contributor

cyphar commented Apr 5, 2016

@crosbymichael As for your second point:

so the main purpose is for reattach. Its not enabled in this release but with the shim being the parent and keeping a hold of the fifo's and pty master it allows docker and containerd to both die and your containers to keep running. When docker/containerd come back up then they can reattach to the container and get exit events and attach to stdio.

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

@ibuildthecloud
Copy link
Contributor Author

@crosbymichael the two levels seem like over kill then. Why not just embed containerd in Docker?

@ibuildthecloud
Copy link
Contributor Author

@crosbymichael I do like the idea of having mini olaf rain clouds floating around with containers. That is cool. If the shim was in C then the overhead really should be like 2k.

@mrjana
Copy link
Contributor

mrjana commented Apr 6, 2016

AFAIK not calling iptables isn't feasible because the netlink protocol is not stable and the CLI is the considered the stable interface.

Yes, FWIW, iptables doesn't use netlink until 3.14 (when nftables api became the core of how iptables are implemented) so programmatic access is not available in all kernels. In the older kernels the kernel api is fairly undocumented and even though there is something called libiptc it is not something that should be your first choice for implementing iptables.

@mrjana
Copy link
Contributor

mrjana commented Apr 6, 2016

There are ways we can improve iptables handling. For example, we can generate an iptables-restore payload and the pipe it through to iptables-restore. That way, we only need to fork once rather than for every rule we add. In addition, iptables-restore is atomic, which our current setup isn't. I can get working on a PR like this if you want.

We have been thinking about it for some time (to use iptables-restore). The one thing that we want to do before is cleanup docker iptable rules into a fairly organized set of chains and then reduce the workload in the builtin chains. This also ensures that we can maintain certain kind of ordering that needs to be maintained for various use case scenarios. Finally wrap all of these into a iptables-restore based api.

@cyphar
Copy link
Contributor

cyphar commented Apr 6, 2016

@ibuildthecloud I did some testing with Docker 1.10.3 (before containerd, because the architecture is simpler). Running the daemon with --iptables=false and the containers with --net=none (which should cause libnetwork and thus the iptables stuff to not be used at all), the overcommitted address space doesn't grow any more slowly. It still had ~6GB of overcommitted memory after starting ~500 containers.

@dqminh
Copy link
Contributor

dqminh commented Apr 6, 2016

The end goal is to have the shim written in C

+1. The shim is really not that big.

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

I think that separating it also facilitates more independent updates ( Docker is a lot more than just just running bundles after all :p ).

.. Finally wrap all of these into a iptables-restore based api.

This is interesting. Can you explain how such iptables-restore api will work and/or interact with existing iptables rules / iptables manipulator on the host ?

@cyphar
Copy link
Contributor

cyphar commented Apr 6, 2016

That sort of begs the question why containerd was necessary at all. Surely that could've been implemented purely within Docker.

I think that separating it also facilitates more independent updates ( Docker is a lot more than just just running bundles after all :p ).

Right, but if the end-goal is to have containers be completely separated from either daemon (meaning if you update the daemon the containers won't die) -- what is the point of having "independent" updates if updating either daemon will not kill containers (and there isn't any other overwhelming "independence" issue AFAICS)? Surely if this functionality was embedded inside Docker, it would also make updates of the Docker daemon independent of the container lifetimes? And it wouldn't require packaging an extra daemon that is very intertwined with the inner workings of Docker (mountpoints are very fiddly between the two daemons).

I'm going to be honest, I never understood the benefits of splitting the daemon into two separate daemons. So I'm probably biased in this discussion, but that's because I don't see any overwhelming benefit and do see the potential for a lot of pain.

Also, I really want to know how many bugs are going to come up because of the fact that there's a lot of mountpoint cleaning up that we must not do if containerd can stand on its own.

@icecrime
Copy link
Contributor

icecrime commented Apr 7, 2016

We did the best we can in 1.11 with the Go shim. I'm keeping the issue open and rolling to 1.12 for when we port it to C.

@icecrime icecrime modified the milestones: 1.12.0, 1.11.0 Apr 7, 2016
@icecrime icecrime changed the title [1.11-rc3] Memory usage of containerd-shim Memory usage of containerd-shim Apr 7, 2016
@jetheurer
Copy link

jetheurer commented Apr 18, 2016

I think I'm having a similar problem. I'm running a flask api served by nginx and uwsgi. the contanerd-shim processes use a lot of virtual memory and ultimately crashing the machine.
screen shot 2016-04-18 at 12 21 33 pm

Can I solve this problem by downgrading docker?

uname -a:
Linux ssrs-01 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker info:

Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 52
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-202:1-1193092-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.746 GB
 Data Space Total: 107.4 GB
 Data Space Available: 17.99 GB
 Metadata Space Used: 4.78 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.143 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-85-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859 GiB
ID: FYL6:52VJ:K2HQ:TCA4:MT5F:QJRH:NSWB:JCXI:ENPV:OL2G:IH2F:3DDU
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

docker version:

 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64```

@cpuguy83
Copy link
Member

@jetheurer Why do you think the shim is the cause of the machine crashing?

@jetheurer
Copy link

@cpuguy83 I thought the shim might be the issue because its the root process. Is there a way to diagnose the issue?

@cpuguy83
Copy link
Member

@jetheurer This is probably not the best place since it is not related to the topic of this issue.
If you think the issue is Docker related, please open a separate issue and provide some details, especially logs about the crash (what exactly is crashing?)

@cyphar
Copy link
Contributor

cyphar commented May 25, 2016

This issue is a very big problem for us, and as far as I can tell it's a bug in the Go runtime (it doesn't fully deallocate memory it just uses madvise(MADV_FREE) to clear the page). This causes problems because not everyone can enable unlimited memory overcommit to overcome this issue in the Go runtime.

Unfortunately, some of my interactions with the Go community have been met with "this isn't an issue because there's no limit on address space" -- missing the whole point of the issue.

So, is there a way we can make this a big enough issue for the Go community (like we did with ForkExec) so as to fix this issue? This poses a serious density problem for Docker (you can't run more than a few hundred containers even though you could in principle run thousands -- not to mention that stopped containers contribute to the memory overhead).

WDYT?

/cc @cpuguy83 @crosbymichael

@mrunalp
Copy link
Contributor

mrunalp commented May 26, 2016

@cyphar Could you point to issues you have created in Go so we can help push for fixes there?

@cyphar
Copy link
Contributor

cyphar commented May 26, 2016

I didn't open an issue as of yet, I just had some discussion with upstream here https://groups.google.com/forum/#!topic/golang-dev/zqFt5oVcTCY.

@ibuildthecloud
Copy link
Contributor Author

FWIW Im working on a shimless version of containerd https://github.com/docker/containerd/pull/227
Still working out some small kinks but so far it's been good

@frol
Copy link

frol commented Feb 9, 2018

I beg your pardon, but was there any progress on this issue?

I have a project which needs to run thousands of tiny processes in hundreds of containers per minute. I discovered the overhead of docker run right in the first days of prototyping, so I went with docker start (via direct API calls) with some batching. Yet, even then some batches ended up spending 90% of their total execution time for docker start, so I went experimenting with running the preallocated containers + docker exec, and while the execution overhead is finally OK (0.045s for docker exec while docker start can take around 0.9s to boot on the same machine, and docker run gets up to 2s), the memory footprint of docker-containerd-shim is far beyond usable for the use-case (~10MB per container means that my project will require 2GB of RAM for 200 "preallocated" containers even though the minimal entrypoint consumes only 64KB). Any ideas for optimization?

@cpuguy83
Copy link
Member

cpuguy83 commented Feb 9, 2018

Shim memory usage is much better (than previous 1.0 based shims) now.

image

This is still from containerd 1.0.1, there's more patches in 1.0.2 to further improve shim memory usage.

@frol If you are looking for raw runtime performance, directly using containerd may be better although containerd doesn't setup networking for you which is a large amount of the time that moby takes (although moby has other inefficiencies that need to be addressed as well).

@frol
Copy link

frol commented Feb 9, 2018

@cpuguy83 Thank you for the suggestion. I will consider using containerd directly.

@cpuguy83
Copy link
Member

cpuguy83 commented Feb 9, 2018

FYI, in containerd create+start is ~250ms, stop+delete add another ~250ms

@cyphar
Copy link
Contributor

cyphar commented Feb 10, 2018

@frol Another option would be to use runc directly, which is even faster.

@atline
Copy link

atline commented Oct 9, 2019

So, any official guide to tell us how many containers suggest for per 1GB memory as shim will cost 5M memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests