Increasing CPU usage from `dockerd` when system is idle #641

xb4ucNy · 2019-04-03T19:52:52Z

This is a bug report
This is a feature request
I searched existing issues before opening this one

Expected behavior

dockerd uses very little CPU resources when idle.

Actual behavior

dockerd uses increasing amounts of CPU resources when idle.

The behavior

I've recently discovered concerning CPU usage coming from dockerd; it seems to be using more and more CPU as time goes on, but mostly when the system is otherwise idle.

The system is primarily running docker (18.06-ce) with node, redis, mongod, nginx, and datadog agent containers. The system sees very steady weekday traffic but drops during the night, as shown in the chart below.

This graph shows the total container CPU usage (in gray) and the total system CPU usage (in orange). The difference in docker container and system usage is always the dockerd process itself. Here's a sample top output during one of the more recent nights showing dockerd using 1214% of the CPU.

> top
top - 22:34:02 up 11 days, 21:36,  1 user,  load average: 3.36, 3.64, 3.29
Tasks: 583 total,   1 running, 582 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.1 us,  1.0 sy,  0.0 ni, 61.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65545100 total, 11877028 free, 34665424 used, 19002648 buff/cache
KiB Swap: 32899068 total, 32899068 free,        0 used. 29659532 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 8943 root      20   0 4553336 117740  35180 S  1214  0.2  14885:33 dockerd
10251 root      20   0 4053688  35224  13608 S   1.0  0.1 248:16.61 docker-containe
13482 systemd+  20   0   27.9g  26.6g  16664 S   1.0 42.5   5854:05 mongod
18568 root      20   0 6474300 132028  29384 S   1.0  0.2 659:18.65 agent
19550 xxxxxxxx  20   0  172892   2924   1632 R   1.0  0.0   0:00.04 top
16265 100       20   0  366668 283456  11980 S   0.7  0.4 661:06.14 nginx
21509 100       20   0   72024  44640    596 S   0.7  0.1 523:53.57 nginx
 8014 root      20   0  790444 153688  15104 S   0.3  0.2 539:34.40 node
 8347 root      20   0  790208 150316  15100 S   0.3  0.2 548:11.29 node
 8354 root      20   0  793180 155120  15100 S   0.3  0.2 549:36.05 node
 8404 root      20   0  628168  43352  14332 S   0.3  0.1   1:46.84 node
 9053 root      20   0  725352  89252  15076 S   0.3  0.1   1827:09 node

And for good measure, here's a closeup of the worst night

The docker daemon was restarted and updated (to 18.09-ce) last weekend and the CPU usage dropped back to normal, but is already showing the same symptoms.

I do not have the knowledge to figure out what dockerd is doing at these times. Other similar issues had hinted at long-running log or stat related issues, but all our logs are capped and I hope the datadog agent is causing the load somehow.

Output of docker version:

Client:
 Version:           18.09.4
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        d14af54266
 Built:             Wed Mar 27 18:34:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.4
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       d14af54
  Built:            Wed Mar 27 18:04:46 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 29
 Running: 29
 Paused: 0
 Stopped: 0
Images: 61
Server Version: 18.09.4
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-957.5.1.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.51GiB
Name: REDACTED
ID: REDACTED
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

The text was updated successfully, but these errors were encountered:

lmirguet · 2019-06-06T10:10:23Z

Hello,

I'm encountering the same problem with more or less the same config than you.

Did you find any issue to your problem ?

Regards,
Laurent

villesundell · 2019-08-08T10:19:48Z

I started to have same kind of a problem recently. My docker --version says:
Docker version 18.06.1-ce, build e68fc7a

None of the docker interaction commands work: docker ps, version, info, etc.

CPU usage is about 16% when idle, which is much more than it used to be (I don't remember even seeing dockerd on "top" before this started to happen)

villesundell · 2019-08-08T10:50:21Z

For me pruning docker manually helped, instructions here: https://coderwall.com/p/-vsmba/manually-remove-docker-containers-on-ubuntu

(I also updated my docker-ce to "Docker version 18.06.3-ce, build d7080c1", but I had to prune before reinstallation was successful)

ryflow · 2019-10-23T11:02:27Z

We have exactly the same issue as this also, as seen on the graph below:

dockerd using progressively more and more resources everynight. It is just strange that the CPU usage is the inverse of our application workload (i.e. high at night when there is very little traffic going through the system).

We are also running Datadog and our cluster is hosted in EKS.

Docker info:
Client: Version: 18.06.1-ce API version: 1.38 Go version: go1.10.3 Git commit: e68fc7a215d7133c34aa18e3b72b4a21fd0c6136 Built: Mon Mar 4 21:25:41 2019 OS/Arch: linux/amd64 Experimental: false

This has become such an issue that it is now affecting production workloads. Did anyone else work out what may be causing this issue?

cpuguy83 · 2019-10-23T18:29:51Z

Can you take a CPU profile?
You can use pprof to do this, profile is at /debug/pprof/profile?seconds=<int>

msvbhat · 2020-01-05T11:30:46Z

I am seeing similar problems in production.

One of our machines is running full 100% CPU and about 96% of that is being used by the docker daemon. There are about 15 docker containers running all of them being orchestrated by HashiCorp Nomad. But there is no increase in traffic which can relate to this. And even if traffic increases the docker container should use high CPU not the docker daemon.

Below are the details
OS: Amazon Linux 2 with 4.14.138 kernel
docker version: 18.09.9-ce
This is a default installation and haven't made any changes to the package as installed by the package manager.

Also, I'm not sure what is the purpose of this "for-linux" repo. Looks like this issue should be in the main moby project?

jenil · 2020-02-16T17:48:07Z

This is happening to me as well when I run the Datadog agent on docker the CPU is always high. It seems to work fine when I pause the Datadog agent.

chrisgray-vertex · 2020-02-20T18:02:05Z

I am seeing very similar behavior as well. No clue yet as to why. How would I determine what the dockerd is spending time on?

anshul0915zinnia · 2020-02-28T06:07:19Z

I am seeing the similar behavior after restarting the docker deamon cpu usage drop down

cpuguy83 · 2020-02-28T18:48:30Z

anshul0915zinnia · 2020-03-03T06:45:38Z

can you please provide cpu profile command

anshul0915zinnia · 2020-03-03T07:17:02Z

I find the issue this is due to datadog agent when log is enabled in datadog agent config

ryflow · 2020-03-03T11:07:07Z

That was also our suspicion... Any idea what the agent is doing, as it's literally killing our prod system (but we can't turn DD logging off as we rely on it!).

cpuguy83 · 2020-03-03T16:44:17Z

@anshul0915

can you please provide cpu profile command

curl --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60

jenil · 2020-03-03T17:07:20Z

FWIW, I spoke to the containers team at Datadog and they said they haven't seen any issues. They recommended I just reinstall Docker and the agent, so I did just reinstall both and things are working fine 💯

chabou · 2020-05-25T23:26:20Z

I can confirm this issue. I reproduced it 2 times (in staging and production). I enabled the log collection on the datadog agent. And dockerd has increased to 90-100%. I stopped the datadog agent, but dockerd continued to consume 90% CPU. I waited 30-45min without any success. I had to restart the docker service.
Here is my 60s CPU profile during the issue:
debug.pprof.gz

I can reproduce it systematically. Do not hesitate to ask me some details.

cpuguy83 · 2020-05-26T22:40:41Z

@chabou Thanks! Your issue seems to be during json decoding of the container logs.
I made a proposal some time ago to throttle this so that log readers couldn't exhaust CPU, but it was declined.
Only way around this is to throttle your log consumer.

You might try the "local" log driver which isn't as CPU intensive as json (uses protobuf to encode log messages), but probably what will happen is it will just consume faster and still use up CPU.

cpuguy83 · 2020-05-26T22:42:41Z

Probably the best way to handle this is to not stream every message, and do bulk processing, which is probably something that requires a major change, like moby/moby#40517

cpuguy83 · 2020-05-26T23:01:15Z

I'm going to close this issue since it seems too generic to be actionable.

kchernoff · 2020-07-29T01:25:03Z

I recently came across what I suspect is the same root issue here - I don't use DataDog, but we have a service that's capturing logs in a stream from the docker API for a group of 20-40 containers, and long-lived instances were showing very high CPU usage for the dockerd process combined with pretty high read I/O for the same process.

This problem seems to be exacerbated by the fact that docker does not, by default, place any constraint on the size of the .json log file for each container; as time goes on, it takes progressively more CPU and I/O time to parse the gigantic json files (hundreds of megs to a gig or two per container in our case, across 20-40 containers) that result from this. By setting up the max-file and max-size options for your containers (either via the run command, or as a parameter of the logging options section of your compose file) to reasonably small values, it seems to keep the processing required by dockerd to handle log requests to a minimum without having to change to a different logging driver.

As a gotcha, it appears that you must provide the max-files parameter in the compose file as a quoted string, rather than a bare integer, otherwise our compose fails to parse correctly. Pretty strange, but easy enough to work around.

The default driver can be inefficient [1] as it reads / parses / formats / writes a large JSON file over and over. Since all of the playground's communication goes over stdin / stdout, that can be a lot of junk logged! The `local` driver should be more efficient. [1]: docker/for-linux#641

cpuguy83 closed this as completed May 26, 2020

nivekuil mentioned this issue Dec 29, 2020

docker source causing high IO read/CPU usage vectordotdev/vector#5404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing CPU usage from `dockerd` when system is idle #641

Increasing CPU usage from `dockerd` when system is idle #641

xb4ucNy commented Apr 3, 2019

lmirguet commented Jun 6, 2019

villesundell commented Aug 8, 2019

villesundell commented Aug 8, 2019

ryflow commented Oct 23, 2019

cpuguy83 commented Oct 23, 2019

msvbhat commented Jan 5, 2020

jenil commented Feb 16, 2020

chrisgray-vertex commented Feb 20, 2020

anshul0915zinnia commented Feb 28, 2020

cpuguy83 commented Feb 28, 2020

anshul0915zinnia commented Mar 3, 2020

anshul0915zinnia commented Mar 3, 2020

ryflow commented Mar 3, 2020

cpuguy83 commented Mar 3, 2020

jenil commented Mar 3, 2020

chabou commented May 25, 2020

cpuguy83 commented May 26, 2020

cpuguy83 commented May 26, 2020

cpuguy83 commented May 26, 2020

kchernoff commented Jul 29, 2020

Increasing CPU usage from dockerd when system is idle #641

Increasing CPU usage from dockerd when system is idle #641

Comments

xb4ucNy commented Apr 3, 2019

Expected behavior

Actual behavior

The behavior

lmirguet commented Jun 6, 2019

villesundell commented Aug 8, 2019

villesundell commented Aug 8, 2019

ryflow commented Oct 23, 2019

cpuguy83 commented Oct 23, 2019

msvbhat commented Jan 5, 2020

jenil commented Feb 16, 2020

chrisgray-vertex commented Feb 20, 2020

anshul0915zinnia commented Feb 28, 2020

cpuguy83 commented Feb 28, 2020

anshul0915zinnia commented Mar 3, 2020

anshul0915zinnia commented Mar 3, 2020

ryflow commented Mar 3, 2020

cpuguy83 commented Mar 3, 2020

jenil commented Mar 3, 2020

chabou commented May 25, 2020

cpuguy83 commented May 26, 2020

cpuguy83 commented May 26, 2020

cpuguy83 commented May 26, 2020

kchernoff commented Jul 29, 2020

Increasing CPU usage from `dockerd` when system is idle #641

Increasing CPU usage from `dockerd` when system is idle #641