Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd RSS usage too high #693

Open
2 of 3 tasks
msharbiani1 opened this issue Jun 7, 2019 · 5 comments
Open
2 of 3 tasks

dockerd RSS usage too high #693

msharbiani1 opened this issue Jun 7, 2019 · 5 comments

Comments

@msharbiani1
Copy link

msharbiani1 commented Jun 7, 2019

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

dockerd RSS usage too high

Expected behavior

$ ps -aux | grep dockerd | grep -v grep
root                408183 12.1  0.0 3665184 182872 ?      Ssl  Jun05 343:01 /usr/bin/dockerd ...

Actual behavior

$ ps -aux | grep dockerd | grep -v grep
root                  1827 22.2  1.3 17000744 3511884 ?    Ssl  May24 4443:40 /usr/bin/dockerd --insecure-registry=172.31.0.0/17 --data-root=/var/lib/docker  --max-concurrent-downloads=10 --default-ulimit nofile=65000:450000 --iptables=false --dns redacted-dns-ip1 --dns redacted-dns-ip2 --dns redacted-dns-ip3 --dns-search default.svc.cluster --dns-search svc.cluster --dns-search redacted-example.com --dns-opt ndots:2 --dns-opt timeout:2 --dns-opt attempts:2

Steps to reproduce the behavior

We have not been able to reproduce the issue at will. We have tried:

  1. Having docker containers killed repeatedly through OOM.
  2. Having the same images pulled into the good node.
  3. Having it launch multiple containers simultaneously that finish and get restarted

Output of docker version:

# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
Containers: 106
 Running: 48
 Paused: 0
 Stopped: 58
Images: 490
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.19.15-<redacted>1.el7.x86_64  ## Custom compiled.
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 251.7GiB
Name: redacted.redacted-corporate.com
ID: OZ5E:P4CH:7KDI:AZQX:RA56:QNRY:ZVZH:DWYR:YA3L:PAMK:26TM:LQHJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 189
 Goroutines: 196
 System Time: 2019-06-07T09:47:10.727964606-07:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 172.31.0.0/17
 127.0.0.0/8
Registry Mirrors:
 https://docker.redacted-corporate.com/
Live Restore Enabled: true

docker is being controlled by kubelet 1.12.7

I ran the pprof to profile the heap, and here is the result:

# ./pprof-linux -inuse_space http://localhost:8080/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8080/debug/pprof/heap
Saved profile in /root/pprof/pprof.dockerd.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
File: dockerd
Build ID: 3916f89da49cb3de23a3e581f62c7cbf68162a78
Type: inuse_space
Time: Jun 7, 2019 at 9:50am (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 1372.60MB, 97.58% of 1406.57MB total
Dropped 174 nodes (cum <= 7.03MB)
Showing top 5 nodes out of 29
      flat  flat%   sum%        cum   cum%
 1339.49MB 95.23% 95.23%  1339.49MB 95.23%  runtime.malg
   33.11MB  2.35% 97.58%    33.11MB  2.35%  runtime.allgadd
         0     0% 97.58%     9.04MB  0.64%  encoding/json.(*Decoder).Decode
         0     0% 97.58%     8.54MB  0.61%  encoding/json.(*decodeState).array
         0     0% 97.58%     9.54MB  0.68%  encoding/json.(*decodeState).object
(pprof)

From what I understand, the culprit for all the allocated runtime.malg structures is a large number of goroutines being active simultaneously, and again, from what I read, that memory is never freed by go runtime. My question is this:

How can I avoid the conditions that causes dockerd to launch many goroutines simultaneously?

We have seen the same issue with another node with docker-ce version 18.09.5, albeit with slightly lower memory consumption:

File: dockerd-ce
Build ID: b4032a635c0bd70506db103a49118f0380b8530c
Type: inuse_space
Time: Jun 7, 2019 at 9:57am (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 482.63MB, 98.10% of 491.99MB total
Dropped 107 nodes (cum <= 2.46MB)
Showing top 5 nodes out of 28
      flat  flat%   sum%        cum   cum%
  466.67MB 94.85% 94.85%   466.67MB 94.85%  local.runtime.malg
   10.84MB  2.20% 97.06%    10.84MB  2.20%  local.runtime.allgadd
    4.61MB  0.94% 98.00%     4.61MB  0.94%  local.regexp.(*bitState).reset
    0.50MB   0.1% 98.10%   478.01MB 97.16%  local.runtime.systemstack
         0     0% 98.10%     2.50MB  0.51%  local.github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1
(pprof)
@thaJeztah
Copy link
Member

ping @kolyshkin ptal

@ravilr
Copy link

ravilr commented Jul 8, 2019

@kolyshkin this seems very similar to #670

Addtional details about our setup:

  • using journald docker log driver
  • run dockerd with "live-restore": true option in docker/daemon.json.

when the issue happens, the docker daemon appears to do lot of writes to disk (observed as vm page out activity in vmstat).

@tambetliiv
Copy link

Seems the same issue im having (moby/moby#40183).

Any workarounds/fixes proposed?

@thaJeztah
Copy link
Member

@tambetliiv from your comment on moby/moby#40183 (comment), looks to be a kernel issue?

moby/moby#38352 suggested that the bug maybe in kernel.

Then found torvalds/linux@f9c6456 which was included in 4.19 kernel from 4.19.77 (and in newer ones since 5.2.19 and 5.3.4)

Until now i had only tested this issue with kernels older than 4.19.77.

To verify this I compiled kernels 4.19.76 and 4.19.77 on my debian buster host (which is using XFS filesystem) to verify that this fix really helps. And it did, with kernel 4.19.76 i was able to reproduce this oom issue after reboot with no problems wih command

docker run --name mem_test --init -d --rm -m 50m --memory-swap 50m python python -c "import itertools; list(itertools.count())"

and after booting up kernel 4.19.77 wasn't able to reproduce this issue anymore.

So to fix this issue upgrade your kernel to 4.19.77, 5.2.19 and 5.3.4 or above.

@tambetliiv
Copy link

If this issue experiences the same symptoms as I did, then yeah, I suggest trying kernel 4.19.77 or newer to verify that this really is the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants