CPU spike when pulling big containers can kill nodes & the whole cluster #163

namliz · 2016-11-17T04:20:31Z

A Very Large Container can cause a huge CPU spike.

This is hard to pin point exactly, could be just docker pull working very hard, a kubelet bug, or something else.

Cloudwatch doesn't quite capture how bad this is, nodes freeze up to the point where you can't ssh into them. Everything becomes totally unresponsive, 'etc. Eventually (after 7 minutes in this case) it finally revs down and recovers. Except the Weave pods. Now the cluster is shot.

kubectl delete -f https://git.io/weave-kube, kubectl apply -f https://git.io/weave-kube does not help.

kubectl logs weave-net-sbbsm --namespace=kube-system weave-npc

..
time="2016-11-17T04:16:44Z" level=fatal msg="add pod: ipset [add weave-k?Z;25^M}|1s7P3|H9i;*;MhG 10.40.0.2] failed: ipset v6.29: Element cannot be added to the set: it's already added\n: exit status 1"

To be fair, the nodes are t2.micro and have handled everything so far. Perhaps this is their natural limit, retrying with larger instances.

The text was updated successfully, but these errors were encountered:

namliz · 2016-11-17T05:15:20Z

Bigger instance types (m4) for the nodes did appear to help at first but one node froze up (albeit faster, after 2 minutes) and brought down the Weave daemon set. Seems to be a pathological container image.

I'll try redoing with a FROM debian:wheezy image. It is quite interesting that a container can freeze a node!

namliz · 2016-11-17T06:01:28Z

Smaller image (210mb) worked just fine. I'm not at all sure this has to do with size, something was causing very bad cpu spikes on the nodes when they pulled them and quite possibly a kernel panic on the hosts.

I'm going to save it to the side and see if anybody is interested in exploring a pathological image. Could be a docker thing or a kubernetes thing.

jeremyeder · 2016-11-17T12:41:23Z

It's probably just resource contention, but perhaps we can narrow it down. On the problematic machine, can you share dmesg and the journal content from around the time of the pull? Also do a perf record -a -g -F 100 while the issue is occuring on the problematic machine. Let it run for perhaps 10-20 seconds, then ctrl+c. Then do perf report > out.txt ... Let's see what the CPUs are doing. What version of docker and kube do you have? What storage graph driver are you using for docker?

dmoneil2 mentioned this issue Jan 30, 2017

Containers, Not Virtual Machines (bloated container images) Metaswitch/clearwater-docker#66

Open

jenkins-clearwater mentioned this issue Mar 23, 2017

Containers, Not Virtual Machines (bloated container images) Metaswitch/project-clearwater-issues#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU spike when pulling big containers can kill nodes & the whole cluster #163

CPU spike when pulling big containers can kill nodes & the whole cluster #163

namliz commented Nov 17, 2016 •

edited

Loading

namliz commented Nov 17, 2016 •

edited

Loading

namliz commented Nov 17, 2016 •

edited

Loading

jeremyeder commented Nov 17, 2016

CPU spike when pulling big containers can kill nodes & the whole cluster #163

CPU spike when pulling big containers can kill nodes & the whole cluster #163

Comments

namliz commented Nov 17, 2016 • edited Loading

namliz commented Nov 17, 2016 • edited Loading

namliz commented Nov 17, 2016 • edited Loading

jeremyeder commented Nov 17, 2016

namliz commented Nov 17, 2016 •

edited

Loading

namliz commented Nov 17, 2016 •

edited

Loading

namliz commented Nov 17, 2016 •

edited

Loading