Skip to content
This repository has been archived by the owner on Nov 15, 2017. It is now read-only.

CPU spike when pulling big containers can kill nodes & the whole cluster #163

Open
namliz opened this issue Nov 17, 2016 · 3 comments
Open

Comments

@namliz
Copy link
Contributor

namliz commented Nov 17, 2016

A Very Large Container can cause a huge CPU spike.

This is hard to pin point exactly, could be just docker pull working very hard, a kubelet bug, or something else.

cpu-spike

Cloudwatch doesn't quite capture how bad this is, nodes freeze up to the point where you can't ssh into them. Everything becomes totally unresponsive, 'etc. Eventually (after 7 minutes in this case) it finally revs down and recovers. Except the Weave pods. Now the cluster is shot.

nodes-overloaded

kubectl delete -f https://git.io/weave-kube, kubectl apply -f https://git.io/weave-kube does not help.

kubectl logs weave-net-sbbsm --namespace=kube-system weave-npc

..
time="2016-11-17T04:16:44Z" level=fatal msg="add pod: ipset [add weave-k?Z;25^M}|1s7P3|H9i;*;MhG 10.40.0.2] failed: ipset v6.29: Element cannot be added to the set: it's already added\n: exit status 1"

To be fair, the nodes are t2.micro and have handled everything so far. Perhaps this is their natural limit, retrying with larger instances.

@namliz
Copy link
Contributor Author

namliz commented Nov 17, 2016

Bigger instance types (m4) for the nodes did appear to help at first but one node froze up (albeit faster, after 2 minutes) and brought down the Weave daemon set. Seems to be a pathological container image.

I'll try redoing with a FROM debian:wheezy image. It is quite interesting that a container can freeze a node!

@namliz
Copy link
Contributor Author

namliz commented Nov 17, 2016

Smaller image (210mb) worked just fine. I'm not at all sure this has to do with size, something was causing very bad cpu spikes on the nodes when they pulled them and quite possibly a kernel panic on the hosts.

I'm going to save it to the side and see if anybody is interested in exploring a pathological image. Could be a docker thing or a kubernetes thing.

@jeremyeder
Copy link

It's probably just resource contention, but perhaps we can narrow it down. On the problematic machine, can you share dmesg and the journal content from around the time of the pull? Also do a perf record -a -g -F 100 while the issue is occuring on the problematic machine. Let it run for perhaps 10-20 seconds, then ctrl+c. Then do perf report > out.txt ... Let's see what the CPUs are doing. What version of docker and kube do you have? What storage graph driver are you using for docker?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants