-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stops resolving running containers after skydock: EOF under heavy system loads #57
Comments
I attempted to address this in #58 |
Just noticed the 'no-expire' branch and notes at skynetservices/skydns#84 which should just get rid of the whole issue of tweaking timers and error counters. As long as skydock is always running and catches all the notifications from the docker API then skydns should stay in sync. |
Yes, skydns is also in the middle of a rewrite to use etcd as the backend so it should be more robust and I will not add any TTL with skydock so that should solve your issue. I just don't know if I should ship skydns, skydock, and etcd in 1 or 3 containers. |
Thanks for the update. Multi-host will be exciting! One idea would be to use a single image with all three binaries in it. Then you would instantiate three containers using different names and command line arguments for each entry point. Pros for this would be:
Cons:
Kind of like how one jar/tar.gz vendors logstash, elasticsearch, and kibana functionality? Just some thoughts. |
I was tinkering around with how to do deployment like these, one idea You could take that idea one step further by generalizing with a webbased For the logs would it be possible to extend the concept to use a fourth For the starting up/monitoring between shipyard, project atomic and perhaps On Tuesday, May 6, 2014, ubergarm notifications@github.com wrote:
|
I would prefer separate images, so you can use the images of skydns(2) and skydock in combination with CoreOS. |
+1 for separate images. |
Regardless of the images consideration, the solution to this original issue was taken care of in the skydock no-expire branch. I built a single image to vendor a solution which takes the form of two running containers (using whatever method you want to start them) here: https://index.docker.io/u/ubergarm/skyservices/ Hopefully this is agnostic enough to be useful in general, but it works for me and my runit based production system. I look forward to seeing the new skydns2 and skydock stuff when it is ready! |
Executive Summary:
When running a VPS and maxing out RAM and SWAP it might eventually cause skydns+skydock to stop resolving running containers. A manual
docker restart skydock
will get the services back into skydns and everything resolves happily again after that.Obvious solution
Preferred Solution
Create a way for skydock+skydns stack to gracefully recover after being temporarily strangled.
Production Setup
My production box in this case is a 2x CPU, 4GB RAM, 4GB SWAP, Digital Ocean VPS running an apache+php container, a mysql container, a skydock container, a skydns container, and nginx on the host.
Overview
I noticed problems right at the end of the month (as site usage peaks) when I started getting 500 errors which required a manual
docker restart skydock
to get skydns resolving properly again.Correlating the logs and metrics led me to observe EOF in the skydock logs right when the 500 errors started with graphs showing high system load.
The best reference I could find was from docker irc chat where a rather overloaded VPS system threw the same errors.
The error seems like it is coming from heartbeat() -> updateService(). So perhaps the connection between skydock and skydns craps out allowing the services to time out?
I was able to repeat it two out of two tries on a fresh VPS test install with the latest docker/skydock/skydns stack. Increasing the TTL from 30 up to 300 made the system stay up longer, but eventually it hung long enough to crap out.
Stress testing provided by by starting multiple gitlab containers: thanks Ruby! :)
Details
Steps to reproduce Skydock EOF and loss of registered services from skydns causing failure to resolve container names.
Test Hardware
Digital Ocean Droplet
1x CPU
~512MB RAM
~512MB Swap
Test OS
Ubuntu 12.04.4 LTS (GNU/Linux 3.8.0-29-generic x86_64)
Linux skydock-test 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Docker Version
Installed cgroup-lite and lxc-docker from
deb https://get.docker.io/ubuntu docker main
Docker Version:
Client version: 0.10.0
Client API version: 1.10
Go version (client): go1.2.1
Git commit (client): dc9c28f
Server version: 0.10.0
Server API version: 1.10
Git commit (server): dc9c28f
Go version (server): go1.2.1
Last stable version: 0.10.0
Docker Daemon OPTS
/etc/init/docker.conf --
DOCKER_OPTS="-dns 172.17.42.1 -bip 172.17.42.1/16"
ulimit -a
ulimit -a
as root:Pull Images
Start Containers
Setup hosts resolv.conf
Test Base Config
Everything starts out resolving fine.
Introduce Heavy Load
Keep spinning up gitlab containers to thrash the machine until DNS breaks.
Skydock logs
Skydns logs
skydns goes zombie
Docker Daemon Logs
ps aux
May be related to the kernel version
curl skydns
Repeat Everything with TTL=300 seconds
Same final result but it took more load and time before system hung long enough for problems to occur.
Skydock Log
Skydns Log
The text was updated successfully, but these errors were encountered: