Connection to VIP fails exactly every other time #26946

Closed
colinmollenhour opened this Issue Sep 27, 2016 · 19 comments

Projects

None yet

7 participants

@colinmollenhour
colinmollenhour commented Sep 27, 2016 edited

I've been using a three node cluster of bare-metal machines (packet) to test docker and different versions of MySQL. After much frustration with networking "not working" I finally figured out the issue was with the VIP and I could consistently repeat the failed connection. Here is my setup (may not be exactly same steps, since this all took place over several days):

Setup nodes: test1, test2, test3

test1 # docker swarm init --advertise-addr bond0:0
test2 # docker swarm join --token SWMTKN-1-5ob... --advertise-addr bond0:0 test1:2377
test3 # docker swarm join --token SWMTKN-1-5ob... --advertise-addr bond0:0 test1:2377
test1  # docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
2lhelr6zmx9421j6b512pbd1v    test3     Ready   Active
4ymw90spyob7xmb5nqnzg3rch    test2     Ready   Active
b2oo4lplrbeiei5rw7kb4tuht *  test1     Ready   Active        Leader
test1 # docker network create --driver overlay mytest
test1 # docker service create --replicas 1 --constraint 'node.hostname==test2' --name mysql55 \
  --network mytest -e MYSQL_ROOT_PASSWORD=pass mysql:5.5
test1 # docker service create --replicas 2 --constraint 'node.hostname!=test2' --name php-fpm \
  --network mytest php:fpm
# installed dnsutils and  mysql-client packages in php-fpm containers for testing

From test1 machine (which is swarm manager):

test1 # docker exec -it php-fpm.2.... bash
-----------------------
root@363eac7aa037:/var/www# nslookup mysql55
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   mysql55
Address: 10.0.0.9
root@363eac7aa037:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
1
1
root@363eac7aa037:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
1
1

From test3 machine:

test3 # docker exec -it php-fpm.1.... bash
-----------------------
root@363eac7aa037:/var/www# nslookup mysql55
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   mysql55
Address: 10.0.0.9
root@156e32ec3853:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.9' (111)
root@156e32ec3853:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h mysql55 -ppass -Be 'select 1'
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.9' (111)
root@156e32ec3853:/var/www# mysql -h 10.0.0.9 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h 10.0.0.9 -ppass -Be 'select 1'
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.9' (111)
root@156e32ec3853:/var/www# mysql -h 10.0.0.9 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h 10.0.0.9 -ppass -Be 'select 1'
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.9' (111)
root@156e32ec3853:/var/www# nslookup tasks.mysql55
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   tasks.mysql55
Address: 10.0.0.12
root@156e32ec3853:/var/www# mysql -h 10.0.0.12 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h 10.0.0.12 -ppass -Be 'select 1'
1
1
root@156e32ec3853:/var/www# mysql -h 10.0.0.12 -ppass -Be 'select 1'
1
1

So the network connection fails every other time when using "mysql55" or "10.0.0.9" but it succeeds every time if using the container IP "10.0.0.12". Result is the same using a php script with the php mysql client so it is not the mysql client binary. DNS resolution is not the issue either as the same issue occurs when using the IP address so it is an issue with the VIP.

Output of docker version:

~ # docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Output of docker info:

~ # docker info
Containers: 27
 Running: 2
 Paused: 0
 Stopped: 25
Images: 16
Server Version: 1.12.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 154
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null overlay
Swarm: active
 NodeID: b2oo4lplrbeiei5rw7kb4tuht
 Is Manager: true
 ClusterID: e23rayi2kifksru4ickovjy9r
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.99.159.129
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-36-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.784 GiB
Name: test1
ID: GUJH:7NXP:OLTV:MADS:WGKO:24JC:7QU3:CAJW:YETD:YF4Y:EFRN:4YYQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: bssllc
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=generic
 role=app
 specs=type0
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
Bare-metal on packet.net.

@cpuguy83
Contributor

I have a feeling the LB is routing between two containers and for some reason the node can't reach one of them.

ping @mrjana

@colinmollenhour

@cpuguy83 Any idea how I would go about testing your hunch?

@mrjana
Contributor
mrjana commented Sep 27, 2016

@colinmollenhour Can you post docker service ps mysql55 and the daemon logs from the node were you are having the issue?

@aluzzardi
Contributor
aluzzardi commented Sep 27, 2016 edited

@colinmollenhour What's bond0:0?

May we see an ifconfig bond0:0 on all machines?

Also, why haven't you used --advertise-addr bond0:0 on docker swarm init as well?

@colinmollenhour
colinmollenhour commented Sep 28, 2016 edited

@aluzzardi bond0 is the public interface and bond0:0 is the private interface; that's just how packet.net names their interfaces but they are akin to eth0 and eth1. The exact command I used is no longer in my history but I think I did use --advertise-addr with the private IP. The docker swarm join-token worker command does show the private IP:

~ # docker swarm join-token worker
To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-5oblfvj..... \
    10.99.159.129:2377

I created a new network "mytest2" and tried the same steps using new docker services "mysql56" and "php-fpm56". I stopped the host docker service on all three machines and started them in the order test1, test2, test3. At first the test1:php-fpm56 container could reach test2:mysql56 but test3:php-fpm56 could not reach test2:mysql56 and also could not resolve "mysql56" or "tasks.mysql56". However, as I was collecting the log data and pasting it here it seems to have healed and now I can access test2:mysql56 from both php-fpm56 containers... I'd guess it was at least 5-10 minutes. I'll check the logs immediately after it happens if it pops up again.

test1

~ # docker service ps mysql56
ID                         NAME           IMAGE      NODE   DESIRED STATE  CURRENT STATE               ERROR
045wb5x652x5uerfh58rjk046  mysql56.1      mysql:5.6  test2  Running        Running 3 minutes ago
0y4pdujeq7dgko4o9q33vyq5y   \_ mysql56.1  mysql:5.6  test2  Shutdown       Complete 3 minutes ago
1pd9p5526b45h3rk0hnadsw11   \_ mysql56.1  mysql:5.6  test2  Shutdown       Shutdown 12 minutes ago
brjjdoz9o0yd5lby1kz1t8pg6   \_ mysql56.1  mysql:5.6  test2  Shutdown       Shutdown about an hour ago
c7xuixg7wlzlf2lg4y3an0lys   \_ mysql56.1  mysql:5.6  test2  Shutdown       Shutdown 2 hours ago

journalctl -u docker.service

Sep 27 20:29:09 test1 systemd[1]: Started docker.service.
Sep 27 20:29:09 test1 docker[11688]: time="2016-09-27T20:29:09.154731257Z" level=info msg="libcontainerd: new containerd process, pid: 11703"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.360724259Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.361696869Z" level=warning msg="Your kernel does not support swap memory limit."
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.361949624Z" level=warning msg="mountpoint for pids not found"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.362405875Z" level=info msg="Loading containers: start."
Sep 27 20:29:10 test1 docker[11688]: .......time="2016-09-27T20:29:10.378390716Z" level=info msg="Firewalld running: false"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.610138109Z" level=info msg="Removing stale sandbox 036451c0b2a84ff90faa534b301b60dac6eb43aab84e33f0443955c49206b47f (ingress-sbox)"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.630547290Z" level=info msg="Removing stale endpoint gateway_ingress-sbox (11004f5faa1d8e360a281a68a861e35a89a31dc7fccf995749ac897e603262e6)"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.633547858Z" level=info msg="Fixing inconsistent endpoint_cnt for network docker_gwbridge. Expected=0, Actual=1"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.690624047Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.768906239Z" level=info msg="Loading containers: done."
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.843640685Z" level=error msg="controller resolution failed" module=worker task.id=0h3723ptk6k48u03sf7h6tsow
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.843635745Z" level=info msg="Listening for connections" addr="[::]:2377" proto=tcp
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.843700674Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844332641Z" level=error msg="controller resolution failed" module=worker task.id=3fju13h1el7kkdem9jvc34ok1
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844383590Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844649914Z" level=error msg="controller resolution failed" module=worker task.id=4r05q0i7i3c7p1yv75gcotb6r
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844695603Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844810051Z" level=error msg="controller resolution failed" module=worker task.id=5d4t6sku5l65uxae46d321rgb
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844847740Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844961678Z" level=error msg="controller resolution failed" module=worker task.id=5i2z01eu0b1ne9glnxm27uwtt
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.844993647Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.845100885Z" level=error msg="controller resolution failed" module=worker task.id=5tvhruwlycpr3swpln797n0t1
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.845133744Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.845809941Z" level=error msg="controller resolution failed" module=worker task.id=8lyck4z90bgoyxb845e003b7f
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.845856940Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.845975097Z" level=error msg="controller resolution failed" module=worker task.id=8nyxm2py70vmxbssx4nl66009
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846019446Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846216082Z" level=error msg="controller resolution failed" module=worker task.id=8rgzt9e9mplecuciz1zsclogw
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846249822Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846367579Z" level=error msg="controller resolution failed" module=worker task.id=9agqck8n5968ds0ph7x04ywla
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846403238Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846525386Z" level=error msg="controller resolution failed" module=worker task.id=9gg3c1av6f2f2o51z41fek6pd
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.846557925Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.847113504Z" level=error msg="controller resolution failed" module=worker task.id=ejb2jofz2n2gxty62xn6rt7i9
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.847156993Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.847274440Z" level=error msg="controller resolution failed" module=worker task.id=emb7296tgnjbbm2j2ab7f7iws
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.847318700Z" level=error msg="failed to start taskManager" error="invalid volume mount source, must not be an absolute path: /var/www" module=worker
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.857545620Z" level=info msg="Listening for local connections" addr="/var/lib/docker/swarm/control.sock" proto=unix
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.952729786Z" level=info msg="667660f3b01e40a2 became follower at term 11"
Sep 27 20:29:10 test1 docker[11688]: time="2016-09-27T20:29:10.952811374Z" level=info msg="newRaft 667660f3b01e40a2 [peers: [], term: 11, commit: 6742, applied: 0, lastindex: 6742, lastterm: 11]"
Sep 27 20:29:15 test1 docker[11688]: time="2016-09-27T20:29:15.907125547Z" level=error msg="agent: session failed" error="session initiation timed out" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.843337179Z" level=info msg="667660f3b01e40a2 is starting a new election at term 11"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.843451987Z" level=info msg="667660f3b01e40a2 became candidate at term 12"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.843483106Z" level=info msg="667660f3b01e40a2 received vote from 667660f3b01e40a2 at term 12"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.843528735Z" level=info msg="667660f3b01e40a2 became leader at term 12"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.843562304Z" level=info msg="raft.node: 667660f3b01e40a2 elected leader 667660f3b01e40a2 at term 12"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.868789816Z" level=error msg="sending task status update failed" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.868892964Z" level=error msg="failed reporting status to agent" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.869943242Z" level=error msg="sending task status update failed" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.870043470Z" level=error msg="failed reporting status to agent" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.871073779Z" level=error msg="sending task status update failed" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.871161917Z" level=error msg="failed reporting status to agent" error="rpc error: code = 3 desc = session invalid" module=agent
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.874129986Z" level=info msg="Initializing Libnetwork Agent Local-addr=10.99.159.129 Adv-addr=10.99.159.129 Remote-addr ="
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.874299553Z" level=info msg="Initializing Libnetwork Agent Local-addr=10.99.159.129 Adv-addr=10.99.159.129 Remote-addr ="
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.875018598Z" level=info msg="Daemon has completed initialization"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.875090956Z" level=info msg="Docker daemon" commit=23cf638 graphdriver=aufs version=1.12.1
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.875609676Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=5tvhruwlycp
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.888439092Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=0h3723ptk6k
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.890545779Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=8nyxm2py70v
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.898430517Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=3fju13h1el7
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.902507423Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=8rgzt9e9mpl
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.903425304Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=emb7296tgnj
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.906387323Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=ejb2jofz2n2
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.907257105Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=5d4t6sku5l6
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.908124058Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=9agqck8n596
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.910126246Z" level=info msg="API listen on /var/run/docker.sock"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.910596767Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=4r05q0i7i3c
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.911108626Z" level=info msg="API listen on [::]:2376"
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.911627466Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=9gg3c1av6f2
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.913954818Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=5i2z01eu0b1
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.915615384Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=b2oo4lplrbeiei5rw7kb4tuht node.session=evfl5mwb1d5y5c37gfa3p8389 task.id=8lyck4z90bg
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.928957110Z" level=error msg="Could not open netlink handle during vni population for ns /var/run/docker/netns/1-akc4lvxut0: failed to set into network namespace 28 while creating netlink socket: in
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.966294463Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=8lyck4z90bgoyxb845e003b7f
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.968852091Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=emb7296tgnjbbm2j2ab7f7iws
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.969325021Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=5i2z01eu0b1ne9glnxm27uwtt
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.970360520Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=5tvhruwlycpr3swpln797n0t1
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.970464398Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=3fju13h1el7kkdem9jvc34ok1
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.971451608Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=0h3723ptk6k48u03sf7h6tsow
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.971963737Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=9agqck8n5968ds0ph7x04ywla
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.972352899Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=4r05q0i7i3c7p1yv75gcotb6r
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.973068095Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=8rgzt9e9mplecuciz1zsclogw
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.973824089Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=ejb2jofz2n2gxty62xn6rt7i9
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.974257390Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=9gg3c1av6f2f2o51z41fek6pd
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.974615823Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=8nyxm2py70vmxbssx4nl66009
Sep 27 20:29:16 test1 docker[11688]: time="2016-09-27T20:29:16.975058224Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=5d4t6sku5l65uxae46d321rgb
Sep 27 20:29:17 test1 docker[11688]: time="2016-09-27T20:29:17Z" level=info msg="Firewalld running: false"
Sep 27 20:29:17 test1 docker[11688]: time="2016-09-27T20:29:17.361306164Z" level=error msg="Attempting next endpoint for pull after error: unauthorized: authentication required"
Sep 27 20:29:17 test1 docker[11688]: time="2016-09-27T20:29:17.560869307Z" level=error msg="Not continuing with pull after error: Error: image luckyg/php-fpm:latest not found"
Sep 27 20:29:17 test1 docker[11688]: time="2016-09-27T20:29:17.560994835Z" level=error msg="pulling image failed" error="Error: image luckyg/php-fpm:latest not found" module=taskmanager task.id=c5fzb8l8iho9hwugrfllmpwfa
Sep 27 20:29:20 test1 docker[11688]: time="2016-09-27T20:29:20.586220780Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=2lhelr6zmx9421j6b512pbd1v node.session=7w80sq25huxdjb7v20lyddzc8 task.id=8jcdbdkp9k3
Sep 27 20:29:20 test1 docker[11688]: time="2016-09-27T20:29:20.595456851Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=8jcdbdkp9k38hckksklwauh5r
Sep 27 20:29:22 test1 docker[11688]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:29:22 test1 docker[11688]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:29:22 test1 docker[11688]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:31:16 test1 docker[11688]: time="2016-09-27T20:31:16.876593800Z" level=error msg="Bulk sync to node test2 timed out"
Sep 27 20:31:46 test1 docker[11688]: time="2016-09-27T20:31:46.878088077Z" level=error msg="Bulk sync to node test2 timed out"
Sep 27 20:33:46 test1 docker[11688]: time="2016-09-27T20:33:46.876865363Z" level=error msg="Bulk sync to node test2 timed out"
Sep 27 20:36:16 test1 docker[11688]: time="2016-09-27T20:36:16.876835396Z" level=error msg="Bulk sync to node test2 timed out"

test2

journalctl -u docker.service

Sep 27 20:29:12 test2 systemd[1]: Started docker.service.
Sep 27 20:29:12 test2 docker[21382]: time="2016-09-27T20:29:12.291887272Z" level=info msg="libcontainerd: new containerd process, pid: 21395"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.384452088Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.384572430Z" level=warning msg="Your kernel does not support swap memory limit."
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.384649155Z" level=warning msg="mountpoint for pids not found"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.384808148Z" level=info msg="Loading containers: start."
Sep 27 20:29:13 test2 docker[21382]: ........time="2016-09-27T20:29:13.391599986Z" level=info msg="Firewalld running: false"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.448416134Z" level=info msg="Removing stale sandbox 044458a2a3b92a3fea6e59a122fb78fc881a12b0228eadea35306c4740e87473 (ingress-sbox)"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.457047068Z" level=info msg="Removing stale endpoint gateway_ingress-sbox (27a996e1a253586d9f753cac619b8602cf5b8f7dba5aa85a86f63309d1534126)"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.458137953Z" level=info msg="Fixing inconsistent endpoint_cnt for network docker_gwbridge. Expected=0, Actual=1"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.471941364Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Sep 27 20:29:13 test2 docker[21382]: time="2016-09-27T20:29:13.921280376Z" level=info msg="Loading containers: done."
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.875584488Z" level=info msg="Initializing Libnetwork Agent Local-addr=10.99.159.131 Adv-addr=10.99.159.131 Remote-addr =10.99.159.129"
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.875628873Z" level=info msg="Daemon has completed initialization"
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.875644782Z" level=info msg="Docker daemon" commit=23cf638 graphdriver=aufs version=1.12.1
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.884797862Z" level=info msg="API listen on /var/run/docker.sock"
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.884855362Z" level=info msg="API listen on [::]:2376"
Sep 27 20:29:16 test2 docker[21382]: time="2016-09-27T20:29:16.889540606Z" level=error msg="Could not open netlink handle during vni population for ns /var/run/docker/netns/1-akc4lvxut0: failed to set into network namespace 23 while creating netlink socket: in
Sep 27 20:29:17 test2 docker[21382]: time="2016-09-27T20:29:17Z" level=info msg="Firewalld running: false"
Sep 27 20:29:22 test2 docker[21382]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:29:22 test2 docker[21382]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:29:22 test2 docker[21382]: time="2016-09-27T20:29:22Z" level=info msg="Firewalld running: false"
Sep 27 20:31:16 test2 docker[21382]: time="2016-09-27T20:31:16.876628000Z" level=error msg="Bulk sync to node test1 timed out"
Sep 27 20:31:46 test2 docker[21382]: time="2016-09-27T20:31:46.877508826Z" level=error msg="Bulk sync to node test1 timed out"
Sep 27 20:33:46 test2 docker[21382]: time="2016-09-27T20:33:46.876946224Z" level=error msg="Bulk sync to node test1 timed out"

test3

journalctl -u docker.service

Sep 27 20:29:19 test3 systemd[1]: Started docker.service.
Sep 27 20:29:19 test3 docker[5512]: time="2016-09-27T20:29:19.127476937Z" level=info msg="libcontainerd: new containerd process, pid: 5526"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.180087920Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.180174015Z" level=warning msg="Your kernel does not support swap memory limit."
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.180261731Z" level=warning msg="mountpoint for pids not found"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.180500882Z" level=info msg="Loading containers: start."
Sep 27 20:29:20 test3 docker[5512]: ........time="2016-09-27T20:29:20.185023964Z" level=info msg="Firewalld running: false"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.239770969Z" level=info msg="Removing stale sandbox 8c22b9c57a546c262d9865e3e93e755b8d98bb8440712e242e3072b120a91604 (ingress-sbox)"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.249191957Z" level=info msg="Removing stale endpoint gateway_ingress-sbox (fedeaaaf14bca88452c218fa5f6a592f797a3b26ead3b1b989c3dda1494a5f50)"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.250177705Z" level=info msg="Fixing inconsistent endpoint_cnt for network docker_gwbridge. Expected=0, Actual=1"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.265978236Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.291612797Z" level=info msg="Loading containers: done."
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.329128207Z" level=info msg="Initializing Libnetwork Agent Local-addr=10.99.159.133 Adv-addr=10.99.159.133 Remote-addr =10.99.159.129"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.329197927Z" level=info msg="Daemon has completed initialization"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.329216730Z" level=info msg="Docker daemon" commit=23cf638 graphdriver=aufs version=1.12.1
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.331033053Z" level=error msg="network mytest2 remove failed: network mytest2 not found" module=taskmanager task.id=8jcdbdkp9k38hckksklwauh5r
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.331320476Z" level=error msg="remove task failed" error="network mytest2 not found" module=taskmanager task.id=8jcdbdkp9k38hckksklwauh5r
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.586071572Z" level=info msg="API listen on /var/run/docker.sock"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.586112026Z" level=info msg="API listen on [::]:2376"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.592113859Z" level=error msg="Could not open netlink handle during vni population for ns /var/run/docker/netns/2-akc4lvxut0: failed to set into network namespace 19 while creating netlink socket: inv
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20Z" level=info msg="Firewalld running: false"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.762660005Z" level=error msg="Attempting next endpoint for pull after error: unauthorized: authentication required"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.861001696Z" level=error msg="Not continuing with pull after error: Error: image luckyg/php-fpm:latest not found"
Sep 27 20:29:20 test3 docker[5512]: time="2016-09-27T20:29:20.861036067Z" level=error msg="pulling image failed" error="Error: image luckyg/php-fpm:latest not found" module=taskmanager task.id=0pkfehzj3tt11510fqcf8bsfw
Sep 27 20:29:25 test3 docker[5512]: time="2016-09-27T20:29:25Z" level=info msg="Firewalld running: false"
Sep 27 20:29:25 test3 docker[5512]: time="2016-09-27T20:29:25Z" level=info msg="Firewalld running: false"
Sep 27 20:29:25 test3 docker[5512]: time="2016-09-27T20:29:25Z" level=info msg="Firewalld running: false"
@dongluochen
Contributor

Libnetwork uses bulksync to propagate network db info between nodes. If bulksync fails, containers on these nodes may not be able to access each other. It's not clear to me why bulksync fails in your nodes. If you do a search in your logs, does it happen often? Consistent?

Sep 27 20:31:46 test2 docker[21382]: time="2016-09-27T20:31:46.877508826Z" level=error msg="Bulk sync to node test1 timed out"
@colinmollenhour

@dongluochen No idea why it would fail but I was having enough issues to make it unusable on a daily basis. However, I have been using 1.12.2-rc1 for a couple days now with no issues so it may already be resolved somehow. Will update if it comes back on the testing version.

@aluzzardi
Contributor

@colinmollenhour Yeah, 1.12.2 is packed with networking fixes.

We were going to suggest to upgrade to that version to confirm the issue has been resolved but you're already one step ahead :)

Glad to hear the issue is fixed - could you please keep us updated if the problem ever returns with 1.12.2?

We're focused in making that release rock solid in terms of networking.

/cc @mrjana @mavenugo

@mrjana
Contributor
mrjana commented Oct 12, 2016

Fixed in 1.12.2. @colinmollenhour Feel free to reopen if you think otherwise.

@mrjana mrjana closed this Oct 12, 2016
@bigwhite

@mrjana

After upgrading to 1.12.2,the problem seems still exists.

I start a two-node swarm cluster on aliyun(www.aliyun.com),my env is:

ubuntu 14.04.4

# uname -a
Linux iZ25cn4xxnvZ 3.19.0-70-generic #78~14.04.1-Ubuntu SMP Fri Sep 23 17:39:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 17:00:50 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 17:00:50 2016
 OS/Arch:      linux/amd64

manager:  10.47.136.60
worker:10.46.181.146

swarm init process is below:

manager init:

# docker swarm init --advertise-addr 10.47.136.60
Swarm initialized: current node (d3ciu3niwdkgmh8kqpug9p0t6) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-0cfasg75t7ktrvebpc48sxllgxxspzh52kjnqhuxwsbaiokhyc-baj6q7xdf4v1ijtmkjdtktg81 \
    10.47.136.60:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.


 worker node join:

docker swarm join --advertise-addr 10.46.181.146 \
    --token SWMTKN-1-0cfasg75t7ktrvebpc48sxllgxxspzh52kjnqhuxwsbaiokhyc-baj6q7xdf4v1ijtmkjdtktg81 \
    --listen-addr 10.46.181.146:2377  10.47.136.60:2377 

This node joined a swarm as a worker.

create my own overlay network:

# docker network create -d overlay mynet1

root@iZ25cn4xxnvZ:~# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
a0550b0a04e8        bridge              bridge              local
278abdc5e675        docker_gwbridge     bridge              local
c01fe62bffc1        host                host                local
ezeevy9t326v        ingress             overlay             swarm
78xq69wanmpj        mynet1              overlay             swarm
b0ef5c850072        none                null                local

I started two nginx services and a client service:

docker service create --replicas 2 --name mynginx --network mynet1 --mount type=bind,source=/root/dockertest/staticcontents,dst=/usr/share/nginx/html,ro=true  nginx:1.10.1



docker service create --replicas 1 --name myclient --network mynet1 registry.cn-hangzhou.aliyuncs.com/mioss/test tail -f /var/log/bootstrap.log

attach to the myclient container:

root@iZ25mjza4msZ:~# docker exec -it a2bafbb71c68 /bin/bash

ping mynginx service is ok

# ping mynginx
PING mynginx (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.197 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.094 ms

but When I access the nginx cluster with curl, problems occurs:

root@a2bafbb71c68:/# curl mynginx
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title> title< /title>
</head>
<body>
<p>hello world, i am manager</p>
</body>
</html>
root@a2bafbb71c68:/# curl mynginx
^C

curl can only get the result of the nginx service running on the same node with myclient. When curl the nginx serivice running on the other node, curl block and finally timed out.

ping the container which running the nginx service on the other node is unreachable:

# ping 10.0.0.3
PING 10.0.0.3 (10.0.0.3) 56(84) bytes of data.
^C
--- 10.0.0.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

ping vip and the container on the same node is ok

root@a2bafbb71c68:/# ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.142 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.096 ms
^C
--- 10.0.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.096/0.119/0.142/0.023 ms
root@a2bafbb71c68:/# ping 10.0.0.4
PING 10.0.0.4 (10.0.0.4) 56(84) bytes of data.
64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=0.075 ms
^C
--- 10.0.0.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.075/0.075/0.075/0.000 ms
@mrjana
Contributor
mrjana commented Oct 13, 2016

@bigwhite I don't think you are hitting this problem. Can you please make sure you have opened up ports 7946/udp, 7946/tcp and 4789/udp in your firewall or security group if you are running one? Basically gossip cluster is not up and running in your case.

@bigwhite

@mrjana Thanks for reply.

I'm sure there are no firewalls. All inner data flow are allowed. all ports are opened. the ports status are below:

manager:

root@iZ25cn4xxnvZ:~# netstat -tunlp |grep 7946
tcp6       0      0 :::7946                 :::*                    LISTEN      6220/dockerd
udp6       0      0 :::7946                 :::*                                6220/dockerd
root@iZ25cn4xxnvZ:~# netstat -tunlp |grep 4789
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           -

worker node:

root@iZ25mjza4msZ:~# netstat -an |grep 7946
tcp        0      0 10.46.181.146:51158     10.47.136.60:7946       TIME_WAIT
tcp6       0      0 :::7946                 :::*                    LISTEN
udp6       0      0 :::7946                 :::*
root@iZ25mjza4msZ:~# netstat -an |grep 4789
udp     3840      0 0.0.0.0:4789            0.0.0.0:*
root@iZ25mjza4msZ:~#

could you tell me how to check whether gossip cluster is up? In /var/sys/log, I indeed have not seen words like "serf" appear.

@bigwhite

@mrjana I check the status of these ports with nc tool:

on worker node:

root@iZ25mjza4msZ:~#  nc -vu 10.47.136.60 4789
Connection to 10.47.136.60 4789 port [udp/*] succeeded!
^C
root@iZ25mjza4msZ:~# nc -vu 10.47.136.60 7946
Connection to 10.47.136.60 7946 port [udp/*] succeeded!
^C
root@iZ25mjza4msZ:~# nc -v 10.47.136.60 7946
Connection to 10.47.136.60 7946 port [tcp/*] succeeded!


on master node:

root@iZ25cn4xxnvZ:~# nc -vu 10.46.181.146 4789
Connection to 10.46.181.146 4789 port [udp/*] succeeded!

^C
root@iZ25cn4xxnvZ:~# nc -vu 10.46.181.146 7946
Connection to 10.46.181.146 7946 port [udp/*] succeeded!

^C
root@iZ25cn4xxnvZ:~# nc -v 10.46.181.146 7946
Connection to 10.46.181.146 7946 port [tcp/*] succeeded!

@mrjana
Contributor
mrjana commented Oct 13, 2016

@bigwhite Can you please attach the daemon logs from one of the problem node so that I can take a look?

@bigwhite

@mrjana
restart daemon for a fresh env, then

1、create swarm:

manager(10.47.136.60):

# docker swarm init --advertise-addr 10.47.136.60
Swarm initialized: current node (e49xf41bc6kun3um1icyrkbvd) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-4warfcowj6xcyrqzy0t4s3zchlovu667ybs9dzzggarszlil3f-55laitx3fmrgdw7lbyetferct \
    10.47.136.60:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

manager log:

time="2016-10-14T06:03:46.749058615+08:00" level=info msg="Listening for connections" addr="[::]:2377" proto=tcp
time="2016-10-14T06:03:46.749632031+08:00" level=info msg="Listening for local connections" addr="/var/lib/docker/swarm/control.sock" proto=unix
time="2016-10-14T06:03:46.752911453+08:00" level=info msg="315e7100a49d3045 became follower at term 0"
time="2016-10-14T06:03:46.753002835+08:00" level=info msg="newRaft 315e7100a49d3045 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"
time="2016-10-14T06:03:46.753044046+08:00" level=info msg="315e7100a49d3045 became follower at term 1"
time="2016-10-14T06:03:46.753119453+08:00" level=info msg="315e7100a49d3045 is starting a new election at term 1"
time="2016-10-14T06:03:46.753198593+08:00" level=info msg="315e7100a49d3045 became candidate at term 2"
time="2016-10-14T06:03:46.753235479+08:00" level=info msg="315e7100a49d3045 received vote from 315e7100a49d3045 at term 2"
time="2016-10-14T06:03:46.753280612+08:00" level=info msg="315e7100a49d3045 became leader at term 2"
time="2016-10-14T06:03:46.753317981+08:00" level=info msg="raft.node: 315e7100a49d3045 elected leader 315e7100a49d3045 at term 2"
time="2016-10-14T06:03:46.790856855+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.47.136.60 Adv-addr=10.47.136.60 Remote-addr ="
time="2016-10-14T06:03:46.791272706+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.47.136.60 Adv-addr=10.47.136.60 Remote-addr ="
time="2016-10-14T06:03:46.791388096+08:00" level=info msg="Gossip cluster hostname iZ25cn4xxnvZ-30a92b329723"
INFO[0000] Firewalld running: false

worker (10.46.181.146)log:

none

2、worker node join

# docker swarm join --advertise-addr 10.46.181.146  \
>     --token SWMTKN-1-4warfcowj6xcyrqzy0t4s3zchlovu667ybs9dzzggarszlil3f-55laitx3fmrgdw7lbyetferct \
>     --listen-addr 10.46.181.146:2377  10.47.136.60:2377
This node joined a swarm as a worker.

manager node log:

none

worker node log:

time="2016-10-14T06:06:07.108863540+08:00" level=info msg="Waiting for TLS certificate to be issued..."
time="2016-10-14T06:06:07.115366439+08:00" level=info msg="Downloaded new TLS credentials with role: swarm-worker."
time="2016-10-14T06:06:07.159138241+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.46.181.146 Local-addr=10.46.181.146 Adv-addr=10.46.181.146 Remote-addr =10.47.136.60"
time="2016-10-14T06:06:07.159264954+08:00" level=info msg="Gossip cluster hostname iZ25mjza4msZ-e4780e30c9bb"
time="2016-10-14T06:06:07.168271411+08:00" level=error msg="Error getting node 12t3053c3mz0f0niq00217bfi: This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
time="2016-10-14T06:06:07.168320999+08:00" level=error msg="Handler for GET /v1.24/nodes/12t3053c3mz0f0niq00217bfi returned error: This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
INFO[0000] Firewalld running: false

3、create mynet1 network

docker network create -d overlay mynet1

both manager node and worker node have no new log

4、creat sevice mynginx

# docker service create --replicas 2 --name mynginx --network mynet1 --mount type=bind,source=/root/dockertest/staticcontents,dst=/usr/share/nginx/html,ro=true  nginx:1.10.1
bfde4z61hxw8uf4ld0bazl6pg

manger node:

INFO[0000] Firewalld running: false
INFO[0000] Firewalld running: false

worker node:

INFO[0000] Firewalld running: false
INFO[0000] Firewalld running: false

5、create service myclient

root@iZ25cn4xxnvZ:~# docker service create --replicas 1 --name myclient --network mynet1 registry.cn-hangzhou.aliyuncs.com/mioss/test tail -f /var/log/bootstrap.log
9l73xy8fjb1nf0c2scg6xv0v4

manger node:

INFO[0000] Firewalld running: false
INFO[0000] Firewalld running: false

worker node:

INFO[0000] Firewalld running: false
INFO[0000] Firewalld running: false

6、attach the myclient container on the worker node

root@iZ25mjza4msZ:/var/log/upstart#  docker exec -it b1af90176f99  /bin/bash
root@b1af90176f99:/#

both manager node and worker node have no new log

7、curl mynginx service from myclient container

root@b1af90176f99:/# curl mynginx
^C
root@b1af90176f99:/# curl mynginx
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title> title < /title>
</head>
<body>
<p>hello world, i am manager</p>
</body>
</html>
root@b1af90176f99:/# curl mynginx
^C
root@b1af90176f99:/# curl mynginx
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title> title < /title>
</head>
<body>
<p>hello world, i am manager</p>
</body>
</html>

the same problem occurs ,but both manager node and worker node have no new log

I do some pings:

root@b1af90176f99:/# ping mynginx
PING mynginx (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.136 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.161 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=0.094 ms
^C
--- mynginx ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.094/0.130/0.161/0.029 ms
root@b1af90176f99:/# ping 10.0.0.3
PING 10.0.0.3 (10.0.0.3) 56(84) bytes of data.
^C
--- 10.0.0.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

root@b1af90176f99:/# ping 10.0.0.4
PING 10.0.0.4 (10.0.0.4) 56(84) bytes of data.
64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=0.150 ms
64 bytes from 10.0.0.4: icmp_seq=2 ttl=64 time=0.120 ms
^C
--- 10.0.0.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.120/0.135/0.150/0.015 ms

but also no new logs appear.

wait for a min:

manager log:

time="2016-10-14T06:15:30.791931594+08:00" level=warning msg="2016/10/14 06:15:30 [WARN] memberlist: Was able to reach iZ25mjza4msZ-e4780e30c9bb via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"

But I have seen a lot of wired log from yesterday's docker daemon log:

manager:

time="2016-10-13T09:08:42.553935428+08:00" level=info msg="raft.node: 717e805ce56d0f83 elected leader 717e805ce56d0f83 at term 2"
time="2016-10-13T09:08:42.808145597+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.47.136.60 Adv-addr=10.47.136.60 Remote-addr ="
time="2016-10-13T09:08:42.809587963+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.47.136.60 Adv-addr=10.47.136.60 Remote-addr ="
time="2016-10-13T09:08:42.809695681+08:00" level=info msg="Gossip cluster hostname iZ25cn4xxnvZ-d3df8ddef5d7"
^[[34mINFO^[[0m[0000] Firewalld running: false
^[[34mINFO^[[0m[0000] Firewalld running: false
^[[34mINFO^[[0m[0000] Firewalld running: false
^[[34mINFO^[[0m[0000] Firewalld running: false
time="2016-10-13T09:19:12.810640800+08:00" level=warning msg="2016/10/13 09:19:12 [WARN] memberlist: Was able to reach iZ25mjza4msZ-229ea9552cf9 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T09:31:13.810337335+08:00" level=warning msg="2016/10/13 09:31:13 [WARN] memberlist: Was able to reach iZ25mjza4msZ-229ea9552cf9 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T09:43:14.810540928+08:00" level=warning msg="2016/10/13 09:43:14 [WARN] memberlist: Was able to reach iZ25mjza4msZ-229ea9552cf9 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T10:02:01.892152425+08:00" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=7ahspgfi6r8rfsny22atfmlyu
time="2016-10-13T10:02:01.894714564+08:00" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=d3ciu3niwdkgmh8kqpug9p0t6 node.session=ewzkm7jjv7l2nfh26db6zxrie task.id=7ahspgfi6r8rfsny22atfmlyu
time="2016-10-13T10:02:01.903424827+08:00" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=718x4fozzq04lteliaq9p6k5a node.session=bb1c2v0k24cz5wzz3ao9hh2mz task.id=8qg65qbiazqw20mowrln226l0
time="2016-10-13T10:02:01.905533998+08:00" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=718x4fozzq04lteliaq9p6k5a node.session=bb1c2v0k24cz5wzz3ao9hh2mz task.id=bhv773f6t377dbwi82cs97c6z
time="2016-10-13T10:02:01.931999732+08:00" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=7ahspgfi6r8rfsny22atfmlyu
time="2016-10-13T10:02:01.932102926+08:00" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=8qg65qbiazqw20mowrln226l0
time="2016-10-13T10:02:01.932132347+08:00" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=bhv773f6t377dbwi82cs97c6z
time="2016-10-13T10:02:28.345852041+08:00" level=info msg="Stopping manager"
time="2016-10-13T10:02:28.346339268+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.47.136.60 Adv-addr=10.47.136.60 Remote-addr ="
time="2016-10-13T10:02:28.346478018+08:00" level=info msg="Manager shut down"
time="2016-10-13T10:02:33.365152606+08:00" level=error msg="failed to send node leave: timed out broadcasting node event"
time="2016-10-13T11:46:05.918096101+08:00" level=error msg="Handler for GET /v1.24/images/search returned error: invalid registry endpoint https://gcr.io/v1/: Get https://gcr.io/v1/_ping: dial tcp 74.125.204.82:443: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add `--insecure-registry gcr.io` to the daemon's arguments. In the case of HTTPS, if you have access to the registry's CA certificate, no need for the flag; simply place the CA certificate at /etc/docker/certs.d/gcr.io/ca.crt"
time="2016-10-13T11:48:01.586783072+08:00" level=info msg="Processing signal 'terminated'"
^[[34mINFO^[[0m[10095] stopping containerd after receiving terminated

worker:

time="2016-10-13T09:55:15.586557060+08:00" level=warning msg="2016/10/13 09:55:15 [WARN] memberlist: Was able to reach iZ25cn4xxnvZ-d3df8ddef5d7 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T10:02:01.898140147+08:00" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=bhv773f6t377dbwi82cs97c6z
time="2016-10-13T10:02:01.898697933+08:00" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=8qg65qbiazqw20mowrln226l0
^[[34mINFO^[[0m[0000] Firewalld running: false
time="2016-10-13T10:02:11.905699988+08:00" level=info msg="Container a2bafbb71c68257f75cc7509fc4c96044a58621a01a5411b4c5b9b8dfb5d2d58 failed to exit within 10 seconds of signal 15 - using the force"
time="2016-10-13T15:01:39.848898897+08:00" level=info msg="Waiting for TLS certificate to be issued..."
time="2016-10-13T15:01:39.856178372+08:00" level=info msg="Downloaded new TLS credentials with role: swarm-worker."
time="2016-10-13T15:01:39.886159930+08:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.46.181.146 Adv-addr=10.46.181.146 Remote-addr =10.47.136.60"
time="2016-10-13T15:01:39.886449774+08:00" level=info msg="Gossip cluster hostname iZ25mjza4msZ-678e253f934d"
time="2016-10-13T15:01:39.893167995+08:00" level=error msg="Error getting node 9xvnq9fy2kx20buk5ybvs5za0: This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
time="2016-10-13T15:01:39.893228097+08:00" level=error msg="Handler for GET /v1.24/nodes/9xvnq9fy2kx20buk5ybvs5za0 returned error: This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
^[[34mINFO^[[0m[0000] Firewalld running: false
time="2016-10-13T15:09:35.460617068+08:00" level=warning msg="2016/10/13 15:09:35 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:35.460781705+08:00" level=warning msg="2016/10/13 15:09:35 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:36.460705548+08:00" level=warning msg="2016/10/13 15:09:36 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:37.460840478+08:00" level=warning msg="2016/10/13 15:09:37 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:38.460920498+08:00" level=warning msg="2016/10/13 15:09:38 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 88 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:39.307263013+08:00" level=warning msg="2016/10/13 15:09:39 [ERR] memberlist: Decrypt packet failed: Unsupported encryption version 10 from=10.47.136.60:41842\n"
time="2016-10-13T15:09:44.713457460+08:00" level=warning msg="2016/10/13 15:09:44 [ERR] memberlist: failed to receive: EOF from=10.47.136.60:47507\n"
time="2016-10-13T15:09:53.887432394+08:00" level=warning msg="2016/10/13 15:09:53 [WARN] memberlist: Was able to reach iZ25cn4xxnvZ-adca83651ced via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T16:04:24.887449744+08:00" level=warning msg="2016/10/13 16:04:24 [WARN] memberlist: Was able to reach iZ25cn4xxnvZ-adca83651ced via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
time="2016-10-13T16:39:36.690443167+08:00" level=error msg="agent: session failed" error="rpc error: code = 1 desc = context canceled" module=agent
time="2016-10-13T16:39:39.523964320+08:00" level=error msg="agent: session failed" error="rpc error: code = 2 desc = grpc: the client connection is closing" module=agent
time="2016-10-13T16:39:39.887732582+08:00" level=info msg="2016/10/13 16:39:39 [INFO] memberlist: Suspect iZ25cn4xxnvZ-adca83651ced has failed, no acks received\n"
time="2016-10-13T16:39:39.888386143+08:00" level=error msg="periodic bulk sync failure for network b9iazxt3w6512hiq6x19kdi7b: bulk sync failed on node iZ25cn4xxnvZ-adca83651ced: failed to send a TCP message during bulk sync: dial tcp 10.47.136.60:7946: getsockopt: connection refused"
time="2016-10-13T16:39:41.887348099+08:00" level=info msg="2016/10/13 16:39:41 [INFO] memberlist: Suspect iZ25cn4xxnvZ-adca83651ced has failed, no acks received\n"
@mrjana
Contributor
mrjana commented Oct 13, 2016

@bigwhite It looks there is definitely some problem in getting through 7946/udp in your clusters. This log confirms the problem in your infrastructure in a way.

time="2016-10-14T06:15:30.791931594+08:00" level=warning msg="2016/10/14 06:15:30 [WARN] memberlist: Was able to reach iZ25mjza4msZ-e4780e30c9bb via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"

nc is not a good tool to verify it because it will only verify if it is successfully able to send a UDP packet out but will never know if it was received by the remote listener because it has no way of knowing that. You might have to do some packet dumping (using say tcpdump) between your nodes to confirm if you are not dropping the UDP packets somewhere. Also you need to do the same thing for 4789/udp because that is the port in the data packets are sent.

@bigwhite
bigwhite commented Oct 14, 2016 edited

@mrjana

I did some traffic sniffer test:

  1. 4789 udp test

I did sniffer both on manager node and worker node:

manager node:
# tcpdump -i eth0 udp|grep 4789
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

worker node:
# tcpdump -i eth0 udp|grep 4789
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

then I send traffic to the 4789 port of worker node on manager node:

# nc -vu 10.46.181.146 4789
Connection to 10.46.181.146 4789 port [udp/*] succeeded!

tcpdump output:

manager node:


14:01:31.508856 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:31.508886 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:32.509039 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:33.509188 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:34.509344 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.115603 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.274991 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.454160 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.872673 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:03.074618 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:03.254164 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]

worker node:

no output

reversely, I send traffic to the 4789 port of manager node on the worker node:

# nc -vu 10.47.136.60  4789
Connection to 10.47.136.60 4789 port [udp/*] succeeded!

tcpdump output:

manager node:
none

worker node:


14:09:14.352300 IP iZ25mjza4msZ.58761 > 10.47.136.60.4789: [|VXLAN]
14:09:14.352320 IP iZ25mjza4msZ.58761 > 10.47.136.60.4789: [|VXLAN]
14:09:15.352502 IP iZ25mjza4msZ.58761 > 10.47.136.60.4789: [|VXLAN]
14:09:16.352674 IP iZ25mjza4msZ.58761 > 10.47.136.60.4789: [|VXLAN]
14:09:17.352849 IP iZ25mjza4msZ.58761 > 10.47.136.60.4789: [|VXLAN]

Is it prove that the listener of 4789 on both nodes has any problem?

  1. 7946 udp test

I also did the sniffer on 7946:

manager node:

13:54:24.160099 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 71
13:54:24.160525 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 49
13:54:24.792061 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 71
13:54:24.792693 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 49
13:54:25.159980 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 71
13:54:25.160547 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 49
13:54:26.792122 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 71
13:54:26.794426 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 49
13:54:27.160025 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 71
13:54:27.160299 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 49
13:54:28.792065 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 71
13:54:28.792664 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 49
13:54:29.160172 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 71
13:54:29.160465 IP iZ25cn4xxnvZ.7946 > 10.46.181.146.7946: UDP, length 49
13:54:30.160249 IP 10.46.181.146.7946 > iZ25cn4xxnvZ.7946: UDP, length 71
... ...


worker node:

13:54:51.159860 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 71
13:54:51.160354 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 49
13:54:51.792048 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 71
13:54:51.792333 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 49
13:54:52.159991 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 71
13:54:52.160538 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 49
13:54:53.792066 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 71
13:54:53.792340 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 49
13:54:54.159834 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 71
13:54:54.160331 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 49
13:54:54.791996 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 71
13:54:54.792318 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 49
13:54:56.159881 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 71
13:54:56.160420 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 49
13:54:56.792085 IP 10.47.136.60.7946 > iZ25mjza4msZ.7946: UDP, length 71
13:54:56.792417 IP iZ25mjza4msZ.7946 > 10.47.136.60.7946: UDP, length 49

7946 udp on both nodes seems ok.

@dongluochen
Contributor

From your tcpdump it looks UDP port 4789 is not open between manager and worker. If UDP port 4789 is open between them, you should see the same packets on manager and worker. I'm not familiar with Aliyun's network firewall settings. In AWS you need to open following TCP/UDP ports in VPC setting. In Azure there is network firewall setting.

Protocol    Port    Description
udp 4789    Data plane (VXLAN)
tcp/udp 7946    Control plane
manager node:

14:01:31.508856 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:31.508886 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:32.509039 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:33.509188 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:01:34.509344 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.115603 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.274991 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.454160 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:02.872673 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:03.074618 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]
14:02:03.254164 IP iZ25cn4xxnvZ.44749 > 10.46.181.146.4789: [|VXLAN]

worker node:

no output
@bigwhite

@mrjana @dongluochen Thanks for your advice. I will go deep into checking whether it is the problem of aliyun ECS setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment