Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing CPU usage from dockerd when system is idle #641

Closed
2 of 3 tasks
xb4ucNy opened this issue Apr 3, 2019 · 20 comments
Closed
2 of 3 tasks

Increasing CPU usage from dockerd when system is idle #641

xb4ucNy opened this issue Apr 3, 2019 · 20 comments

Comments

@xb4ucNy
Copy link

xb4ucNy commented Apr 3, 2019

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Expected behavior

dockerd uses very little CPU resources when idle.

Actual behavior

dockerd uses increasing amounts of CPU resources when idle.

The behavior

I've recently discovered concerning CPU usage coming from dockerd; it seems to be using more and more CPU as time goes on, but mostly when the system is otherwise idle.

The system is primarily running docker (18.06-ce) with node, redis, mongod, nginx, and datadog agent containers. The system sees very steady weekday traffic but drops during the night, as shown in the chart below.

CPU usage over last 2 weeks

This graph shows the total container CPU usage (in gray) and the total system CPU usage (in orange). The difference in docker container and system usage is always the dockerd process itself. Here's a sample top output during one of the more recent nights showing dockerd using 1214% of the CPU.

> top
top - 22:34:02 up 11 days, 21:36,  1 user,  load average: 3.36, 3.64, 3.29
Tasks: 583 total,   1 running, 582 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.1 us,  1.0 sy,  0.0 ni, 61.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65545100 total, 11877028 free, 34665424 used, 19002648 buff/cache
KiB Swap: 32899068 total, 32899068 free,        0 used. 29659532 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 8943 root      20   0 4553336 117740  35180 S  1214  0.2  14885:33 dockerd
10251 root      20   0 4053688  35224  13608 S   1.0  0.1 248:16.61 docker-containe
13482 systemd+  20   0   27.9g  26.6g  16664 S   1.0 42.5   5854:05 mongod
18568 root      20   0 6474300 132028  29384 S   1.0  0.2 659:18.65 agent
19550 xxxxxxxx  20   0  172892   2924   1632 R   1.0  0.0   0:00.04 top
16265 100       20   0  366668 283456  11980 S   0.7  0.4 661:06.14 nginx
21509 100       20   0   72024  44640    596 S   0.7  0.1 523:53.57 nginx
 8014 root      20   0  790444 153688  15104 S   0.3  0.2 539:34.40 node
 8347 root      20   0  790208 150316  15100 S   0.3  0.2 548:11.29 node
 8354 root      20   0  793180 155120  15100 S   0.3  0.2 549:36.05 node
 8404 root      20   0  628168  43352  14332 S   0.3  0.1   1:46.84 node
 9053 root      20   0  725352  89252  15076 S   0.3  0.1   1827:09 node

And for good measure, here's a closeup of the worst night

CPU usage on worst night

The docker daemon was restarted and updated (to 18.09-ce) last weekend and the CPU usage dropped back to normal, but is already showing the same symptoms.

I do not have the knowledge to figure out what dockerd is doing at these times. Other similar issues had hinted at long-running log or stat related issues, but all our logs are capped and I hope the datadog agent is causing the load somehow.

Output of docker version:

Client:
 Version:           18.09.4
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        d14af54266
 Built:             Wed Mar 27 18:34:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.4
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       d14af54
  Built:            Wed Mar 27 18:04:46 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 29
 Running: 29
 Paused: 0
 Stopped: 0
Images: 61
Server Version: 18.09.4
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-957.5.1.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.51GiB
Name: REDACTED
ID: REDACTED
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
@lmirguet
Copy link

lmirguet commented Jun 6, 2019

Hello,

I'm encountering the same problem with more or less the same config than you.

Did you find any issue to your problem ?

Regards,
Laurent

@villesundell
Copy link

I started to have same kind of a problem recently. My docker --version says:
Docker version 18.06.1-ce, build e68fc7a

None of the docker interaction commands work: docker ps, version, info, etc.

CPU usage is about 16% when idle, which is much more than it used to be (I don't remember even seeing dockerd on "top" before this started to happen)

@villesundell
Copy link

For me pruning docker manually helped, instructions here: https://coderwall.com/p/-vsmba/manually-remove-docker-containers-on-ubuntu

(I also updated my docker-ce to "Docker version 18.06.3-ce, build d7080c1", but I had to prune before reinstallation was successful)

@ryflow
Copy link

ryflow commented Oct 23, 2019

We have exactly the same issue as this also, as seen on the graph below:

docker-vs-host-cpu

dockerd using progressively more and more resources everynight. It is just strange that the CPU usage is the inverse of our application workload (i.e. high at night when there is very little traffic going through the system).

We are also running Datadog and our cluster is hosted in EKS.

Docker info:
Client: Version: 18.06.1-ce API version: 1.38 Go version: go1.10.3 Git commit: e68fc7a215d7133c34aa18e3b72b4a21fd0c6136 Built: Mon Mar 4 21:25:41 2019 OS/Arch: linux/amd64 Experimental: false

This has become such an issue that it is now affecting production workloads. Did anyone else work out what may be causing this issue?

@cpuguy83
Copy link
Collaborator

Can you take a CPU profile?
You can use pprof to do this, profile is at /debug/pprof/profile?seconds=<int>

@msvbhat
Copy link

msvbhat commented Jan 5, 2020

I am seeing similar problems in production.

One of our machines is running full 100% CPU and about 96% of that is being used by the docker daemon. There are about 15 docker containers running all of them being orchestrated by HashiCorp Nomad. But there is no increase in traffic which can relate to this. And even if traffic increases the docker container should use high CPU not the docker daemon.

Below are the details
OS: Amazon Linux 2 with 4.14.138 kernel
docker version: 18.09.9-ce
This is a default installation and haven't made any changes to the package as installed by the package manager.

Also, I'm not sure what is the purpose of this "for-linux" repo. Looks like this issue should be in the main moby project?

@jenil
Copy link

jenil commented Feb 16, 2020

This is happening to me as well when I run the Datadog agent on docker the CPU is always high. It seems to work fine when I pause the Datadog agent.

@chrisgray-vertex
Copy link

I am seeing very similar behavior as well. No clue yet as to why. How would I determine what the dockerd is spending time on?

@anshul0915zinnia
Copy link

I am seeing the similar behavior after restarting the docker deamon cpu usage drop down

@cpuguy83
Copy link
Collaborator

image

@anshul0915zinnia
Copy link

can you please provide cpu profile command

@anshul0915zinnia
Copy link

I find the issue this is due to datadog agent when log is enabled in datadog agent config

@ryflow
Copy link

ryflow commented Mar 3, 2020

That was also our suspicion... Any idea what the agent is doing, as it's literally killing our prod system (but we can't turn DD logging off as we rely on it!).

@cpuguy83
Copy link
Collaborator

cpuguy83 commented Mar 3, 2020

@anshul0915

can you please provide cpu profile command

curl --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60

@jenil
Copy link

jenil commented Mar 3, 2020

FWIW, I spoke to the containers team at Datadog and they said they haven't seen any issues. They recommended I just reinstall Docker and the agent, so I did just reinstall both and things are working fine 💯

@chabou
Copy link

chabou commented May 25, 2020

I can confirm this issue. I reproduced it 2 times (in staging and production). I enabled the log collection on the datadog agent. And dockerd has increased to 90-100%. I stopped the datadog agent, but dockerd continued to consume 90% CPU. I waited 30-45min without any success. I had to restart the docker service.
Here is my 60s CPU profile during the issue:
debug.pprof.gz

I can reproduce it systematically. Do not hesitate to ask me some details.

@cpuguy83
Copy link
Collaborator

@chabou Thanks! Your issue seems to be during json decoding of the container logs.
I made a proposal some time ago to throttle this so that log readers couldn't exhaust CPU, but it was declined.
Only way around this is to throttle your log consumer.

You might try the "local" log driver which isn't as CPU intensive as json (uses protobuf to encode log messages), but probably what will happen is it will just consume faster and still use up CPU.

@cpuguy83
Copy link
Collaborator

Probably the best way to handle this is to not stream every message, and do bulk processing, which is probably something that requires a major change, like moby/moby#40517

@cpuguy83
Copy link
Collaborator

I'm going to close this issue since it seems too generic to be actionable.

@kchernoff
Copy link

I recently came across what I suspect is the same root issue here - I don't use DataDog, but we have a service that's capturing logs in a stream from the docker API for a group of 20-40 containers, and long-lived instances were showing very high CPU usage for the dockerd process combined with pretty high read I/O for the same process.

This problem seems to be exacerbated by the fact that docker does not, by default, place any constraint on the size of the .json log file for each container; as time goes on, it takes progressively more CPU and I/O time to parse the gigantic json files (hundreds of megs to a gig or two per container in our case, across 20-40 containers) that result from this. By setting up the max-file and max-size options for your containers (either via the run command, or as a parameter of the logging options section of your compose file) to reasonably small values, it seems to keep the processing required by dockerd to handle log requests to a minimum without having to change to a different logging driver.

As a gotcha, it appears that you must provide the max-files parameter in the compose file as a quoted string, rather than a bare integer, otherwise our compose fails to parse correctly. Pretty strange, but easy enough to work around.

shepmaster added a commit to rust-lang/simpleinfra that referenced this issue Mar 7, 2024
The default driver can be inefficient [1] as it reads / parses /
formats / writes a large JSON file over and over. Since all of the
playground's communication goes over stdin / stdout, that can be a lot
of junk logged!

The `local` driver should be more efficient.

[1]: docker/for-linux#641
shepmaster added a commit to rust-lang/simpleinfra that referenced this issue Apr 2, 2024
The default driver can be inefficient [1] as it reads / parses /
formats / writes a large JSON file over and over. Since all of the
playground's communication goes over stdin / stdout, that can be a lot
of junk logged!

The `local` driver should be more efficient.

[1]: docker/for-linux#641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests