overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

jcmcote · 2018-02-19T23:29:28Z

Following these steps you can reproduce the issues in a matter of minutes. All you need is to bring up a cluster of 2 nodes

create a manager node

docker-machine create --driver virtualbox manager
docker-machine ssh manager

add debug setting

echo '{ "debug": true }' > /etc/docker/daemon.json

get dockerd to reload the config

kill -HUP $(pidof dockerd)

check log for releasing of overlay network

tail -f /var/log/docker.log | grep 'releasing IPv4 pools'

start another terminal and do the steps above for a worker node

start another terminal init swarm manager

eval $(docker-machine env manager)
docker swarm init --advertise-addr 192.168.99.103

make the worker join the swarm

eval $(docker-machine env worker)
docker swarm join --token SWMTKN-1-2duh1guir5ywynuyz2p4w2 192.168.99.103:2377

you should now have a 2 node cluster

eval $(docker-machine env manager)
docker node ls

run this until the worker log inidicate it did not release the overlay network as it should

./up-and-down.sh

Monitor the nodes dockerd logs

tail -f /var/log/docker.log | grep 'releasing IPv4 pools'

You'll notice both nodes release the overlay network but sometimes (after a few cycles) the worker node does not release the overlay network and then your in a state where both nodes do not use the same overlay network id. At this point the services are unable to ping each other.

Files needed

up-and-down.sh script brings up and down the stack
ping.sh used to ping other service in the overlay network
Dockerfile create an image and put the ping.sh script into it
docker-stack.yml services to deploy to the swarm

files.tar.gz

jcmcote · 2018-02-20T15:57:46Z

Related to #1765

selansen · 2018-02-20T17:01:54Z

@jcmcote , what docker CE version do you use ? I dont see version information here.

selansen · 2018-02-20T23:18:21Z

I am using 18.02 to reproduce this issue.
When I try to reproduce the issue I see below issue.
failed to create service x_serva: Error response from daemon: network x_mynet not found
Creating service x_serva
I think script needs modification. There is no delay between " docker stack down x" and "docker stack deploy -c docker-stack.yml x" . in general we need to wait until all cleanup is done when you want to redeploy again . log messages like below indicates clean up takes time.
"eb 20 14:54:06 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54:06.877864376-08:00" level=debug msg="Sending kill signal 15 to container 70060435bcaa63195e5b36051eee7da01c7005676832d9f6747c219acdf08f43"
Feb 20 14:54:08 ELANGO-CE18-2-ubuntu-0 dockerd[21841]: time="2018-02-20T14:54".

@mavenugo mentioned in some #issue on how doing right way of script will avoid these kind of issues. I am trying to dig old issues and trying to find it out. Will update again soon.

jcmcote · 2018-02-21T00:58:26Z

@selansen I'm using docker version 18.02.0-ce. The latest version used by docker-machine with driver virtual box.

When I deploy there is an error saying the network is not yet created. That's fine. I should fail if it can't deploy just yet (after a tear down). However it will eventually deploy with no errors and you'll be in a state where the overlay networks are not cleaned up correctly.

I have reproduced this issue by aggressively deploying (not waiting for stack to come down) but my hunch is that this race condition issue is what we've been experiencing occasionally. Sometimes after an update to our stack (some services or network are changed) we get into a situation where some services can't ping or resolve each other's IP addresses.

I'm hoping someone will be able to use this scenario to explore potential race conditions in the docker network code that might show up occasionally under normal (less aggressive situations).

The point is when we deploy aggressively after a tear down it reports an error which again is fine. But then the system thinks all is ok and the deploy returns successful but leaving the overlay network in an inconsistent state. Why does the system report a successful deploy if it's not ready to deploy ?

selansen · 2018-02-21T21:34:55Z

May I know how long or how many iteration does it take for you to get into this state?

I have been running the same script for almost 45 mins, I am still able to ping between two containers.

jcmcote · 2018-02-22T01:56:57Z

it does not take too long (about 10min). But you have to monitor the release and stop the script as soon as there is a release missing. If you don't the script will bring things down again and put things up again.

However if you stop when you see a missing release of the overlay network. Then you'll notice you can't ping and will never be able to (the 2 nodes will not have the same overlay network id)

jcmcote · 2018-02-22T03:44:36Z

I've modified the up-and-down script. It now counts the number of releases in the manager and the worker. If the counts are not equal it will stop.

I'm at 22 iterations and it has not happened yet... It was much easier to reproduce a couple of days ago. I'll keep at it...

Also I added an init-swarm script which include the steps I use to create my 2 node swarm cluster.

init-swarm.sh.txt
up-and-down.sh.txt

ashish235 · 2018-04-04T11:18:52Z

Having this issue right now. Tried restarting docker, created new swarm, re-created the n/w still the issue exists.

Using docker version -
docker --version
Docker version 17.12.0-ce, build c97c6d6

OS- ubuntu 16.04
`
"Error": "subnet sandbox join failed for "10.0.0.0/24": error creating vxlan interface: file exists",

`
Even can't remove a netns file.
rm: cannot remove '/var/run/docker/netns/1-bbosggv6eg': Device or resource busy

jcmcote changed the title ~~overlay network stops working after stack down/up cycles~~ overlay network stops working after stack down/up cycles (possible race condition or locking issue) Feb 20, 2018

fcrisciani assigned selansen Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

jcmcote commented Feb 19, 2018 •

edited

jcmcote commented Feb 20, 2018

selansen commented Feb 20, 2018

selansen commented Feb 20, 2018

jcmcote commented Feb 21, 2018 •

edited

selansen commented Feb 21, 2018

jcmcote commented Feb 22, 2018

jcmcote commented Feb 22, 2018 •

edited

ashish235 commented Apr 4, 2018 •

edited

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

Comments

jcmcote commented Feb 19, 2018 • edited

create a manager node

add debug setting

get dockerd to reload the config

check log for releasing of overlay network

start another terminal and do the steps above for a worker node

start another terminal init swarm manager

make the worker join the swarm

you should now have a 2 node cluster

run this until the worker log inidicate it did not release the overlay network as it should

Monitor the nodes dockerd logs

Files needed

jcmcote commented Feb 20, 2018

selansen commented Feb 20, 2018

selansen commented Feb 20, 2018

jcmcote commented Feb 21, 2018 • edited

selansen commented Feb 21, 2018

jcmcote commented Feb 22, 2018

jcmcote commented Feb 22, 2018 • edited

ashish235 commented Apr 4, 2018 • edited

jcmcote commented Feb 19, 2018 •

edited

jcmcote commented Feb 21, 2018 •

edited

jcmcote commented Feb 22, 2018 •

edited

ashish235 commented Apr 4, 2018 •

edited