Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

Open
jcmcote opened this issue Feb 19, 2018 · 8 comments
Assignees

Comments

@jcmcote
Copy link

jcmcote commented Feb 19, 2018

Following these steps you can reproduce the issues in a matter of minutes. All you need is to bring up a cluster of 2 nodes

create a manager node

docker-machine create --driver virtualbox manager
docker-machine ssh manager

add debug setting

echo '{ "debug": true }' > /etc/docker/daemon.json

get dockerd to reload the config

kill -HUP $(pidof dockerd)

check log for releasing of overlay network

tail -f /var/log/docker.log | grep 'releasing IPv4 pools'

start another terminal and do the steps above for a worker node

start another terminal init swarm manager

eval $(docker-machine env manager)
docker swarm init --advertise-addr 192.168.99.103

make the worker join the swarm

eval $(docker-machine env worker)
docker swarm join --token SWMTKN-1-2duh1guir5ywynuyz2p4w2 192.168.99.103:2377

you should now have a 2 node cluster

eval $(docker-machine env manager)
docker node ls

run this until the worker log inidicate it did not release the overlay network as it should

./up-and-down.sh

Monitor the nodes dockerd logs

tail -f /var/log/docker.log | grep 'releasing IPv4 pools'

You'll notice both nodes release the overlay network but sometimes (after a few cycles) the worker node does not release the overlay network and then your in a state where both nodes do not use the same overlay network id. At this point the services are unable to ping each other.

Files needed

up-and-down.sh script brings up and down the stack
ping.sh used to ping other service in the overlay network
Dockerfile create an image and put the ping.sh script into it
docker-stack.yml services to deploy to the swarm

files.tar.gz

@jcmcote jcmcote changed the title overlay network stops working after stack down/up cycles overlay network stops working after stack down/up cycles (possible race condition or locking issue) Feb 20, 2018
@jcmcote
Copy link
Author

jcmcote commented Feb 20, 2018

Related to #1765

@selansen
Copy link
Collaborator

@jcmcote , what docker CE version do you use ? I dont see version information here.

@selansen
Copy link
Collaborator

I am using 18.02 to reproduce this issue.
When I try to reproduce the issue I see below issue.
failed to create service x_serva: Error response from daemon: network x_mynet not found
Creating service x_serva
I think script needs modification. There is no delay between " docker stack down x" and "docker stack deploy -c docker-stack.yml x" . in general we need to wait until all cleanup is done when you want to redeploy again . log messages like below indicates clean up takes time.
"eb 20 14:54:06 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54:06.877864376-08:00" level=debug msg="Sending kill signal 15 to container 70060435bcaa63195e5b36051eee7da01c7005676832d9f6747c219acdf08f43"
Feb 20 14:54:08 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54".

@mavenugo mentioned in some #issue on how doing right way of script will avoid these kind of issues. I am trying to dig old issues and trying to find it out. Will update again soon.

@jcmcote
Copy link
Author

jcmcote commented Feb 21, 2018

@selansen I'm using docker version 18.02.0-ce. The latest version used by docker-machine with driver virtual box.

When I deploy there is an error saying the network is not yet created. That's fine. I should fail if it can't deploy just yet (after a tear down). However it will eventually deploy with no errors and you'll be in a state where the overlay networks are not cleaned up correctly.

I have reproduced this issue by aggressively deploying (not waiting for stack to come down) but my hunch is that this race condition issue is what we've been experiencing occasionally. Sometimes after an update to our stack (some services or network are changed) we get into a situation where some services can't ping or resolve each other's IP addresses.

I'm hoping someone will be able to use this scenario to explore potential race conditions in the docker network code that might show up occasionally under normal (less aggressive situations).

The point is when we deploy aggressively after a tear down it reports an error which again is fine. But then the system thinks all is ok and the deploy returns successful but leaving the overlay network in an inconsistent state. Why does the system report a successful deploy if it's not ready to deploy ?

@selansen
Copy link
Collaborator

May I know how long or how many iteration does it take for you to get into this state?

I have been running the same script for almost 45 mins, I am still able to ping between two containers.

@jcmcote
Copy link
Author

jcmcote commented Feb 22, 2018

it does not take too long (about 10min). But you have to monitor the release and stop the script as soon as there is a release missing. If you don't the script will bring things down again and put things up again.

However if you stop when you see a missing release of the overlay network. Then you'll notice you can't ping and will never be able to (the 2 nodes will not have the same overlay network id)

@jcmcote
Copy link
Author

jcmcote commented Feb 22, 2018

I've modified the up-and-down script. It now counts the number of releases in the manager and the worker. If the counts are not equal it will stop.

I'm at 22 iterations and it has not happened yet... It was much easier to reproduce a couple of days ago. I'll keep at it...

Also I added an init-swarm script which include the steps I use to create my 2 node swarm cluster.

init-swarm.sh.txt
up-and-down.sh.txt

@ashish235
Copy link

ashish235 commented Apr 4, 2018

Having this issue right now. Tried restarting docker, created new swarm, re-created the n/w still the issue exists.

Using docker version -
docker --version
Docker version 17.12.0-ce, build c97c6d6

OS- ubuntu 16.04
`
"Error": "subnet sandbox join failed for "10.0.0.0/24": error creating vxlan interface: file exists",

`
Even can't remove a netns file.
rm: cannot remove '/var/run/docker/netns/1-bbosggv6eg': Device or resource busy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants