New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"serf instance not initialized" in overlay netwrok on Docker Swarm cluster backed up with Consul #2268

Open
vlaskinvlad opened this Issue May 24, 2016 · 10 comments

Comments

Projects
None yet
7 participants
@vlaskinvlad

vlaskinvlad commented May 24, 2016

I have a swarm cloud that works just fine.
If I try to use overlay networking it stops working with message:

could not resolve peer: serf instance not initialized

In my usecase I create an overlay network:

docker network create -d overlay --subnet=192.168.0.0/16 --gateway=192.168.0.100 --ip-range=192.168.1.0/24 net-stage-1

And than provision 2 containers of elastic-search referring each other in a cluster (they use overlay network ip addresses). Containers cannot reach each other via overlay network

Destination host unreachable

in aws virtual machine syslog:

level=debug msg="miss notification for dest IP, 192.168.1.2"
level=error msg="could not resolve peer \"192.168.1.2\": could not resolve peer: serf instance not initialized"

Details of configuration:

Configuration: 4 nodes in AWS VPC (all ports are allowed for inter communication)
Ubuntu 16.04, kernel version: 4.4.0-22-generic

Each swarm-daemon has configs like:

--dns 172.17.0.1 --dns 8.8.8.8 --dns-search service.consul --userland-proxy=false -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-store=consul://10.0.1.241:8500 --tlsverify --tlscacert=/etc/docker/tls/ca.pem --tlscert=/etc/docker/tls/server-cert.pem --tlskey=/etc/docker/tls/server-key.pem --log-level=debug

Each node has a swarm container, running as (10.0.1.239 aws machine ip)

/swarm join --addr=10.0.1.239:2375 consul://10.0.1.239:8500/awsswarm

And consul container running as (10.0.1.239 aws machine ip)

/bin/start -advertise 10.0.1.239 -bootstrap-expect 3 -config-file /etc/consul.d/server/config.json

With file * /etc/consul.d/server/config.json* being (10.0.1.241 aws machine ip):

{
    "bootstrap": false,
    "server": true,
    "datacenter": "dc_aws_eu_central",
    "client_addr": "0.0.0.0",
    "data_dir": "/data",
    "ui_dir": "/ui",
    "ports": {
      "https": 8543
    },
    "encrypt": "0rqhZz9afppaN2aVf1IgGw==",
    "ca_file": "/etc/consul.d/ssl/ca.cert",
    "cert_file": "/etc/consul.d/ssl/consul.cert",
    "key_file": "/etc/consul.d/ssl/consul.key",
    "verify_incoming": true,
    "verify_outgoing": true,
    "log_level": "INFO",
    "enable_syslog": false,
    "start_join_wan": [
                    "10.0.1.241"            ]
}

(of course there is swarm-manager too)
Info looks ok and containers can be provisioned normally on non-overlay networks:

 Containers: 20
 Running: 15
 Paused: 0
 Stopped: 5
Images: 16
Server Version: swarm/1.2.2
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 4
 ip-10-0-1-239: 10.0.1.239:2375
  └ ID: UXAE:JZFC:655I:OA33:P4IH:YE2C:EZOP:ENMG:ONB3:ZLLD:MV5X:R4SH
  └ Status: Healthy
  └ Containers: 6
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 32.99 GiB
  └ Labels: dc_aws_eu_central=1, docker_consul=1, docker_ssl=1, docker_swarm=1, executiondriver=, kernelversion=4.4.0-22-generic, n4=1, operatingsystem=Ubuntu 16.04 LTS, storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-05-24T14:40:13Z
  └ ServerVersion: 1.11.1
 ip-10-0-1-240: 10.0.1.240:2375
  └ ID: NRFU:POGO:CMLX:MSJS:NYLL:OPE2:NKQA:FKZO:KG2F:GEWZ:OUAW:XHCR
  └ Status: Healthy
  └ Containers: 4
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 32.99 GiB
  └ Labels: dc_aws_eu_central=1, docker_consul=1, docker_ssl=1, docker_swarm=1, executiondriver=, kernelversion=4.4.0-22-generic, n3=1, operatingsystem=Ubuntu 16.04 LTS, storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-05-24T14:40:28Z
  └ ServerVersion: 1.11.1
 ip-10-0-1-241: 10.0.1.241:2375
  └ ID: JOQO:LA4F:JGEA:QZ6R:3E63:N7QH:NOGK:IA47:4CSU:YR45:RQJY:HTJ7
  └ Status: Healthy
  └ Containers: 5
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 32.99 GiB
  └ Labels: dc_aws_eu_central=1, docker_consul=1, docker_ssl=1, docker_swarm=1, docker_swarm_manager=1, executiondriver=, kernelversion=4.4.0-22-generic, n1=1, operatingsystem=Ubuntu 16.04 LTS, storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-05-24T14:40:03Z
  └ ServerVersion: 1.11.1
 ip-10-0-1-242: 10.0.1.242:2375
  └ ID: 2CS5:3H2M:K4CU:SFMW:74MA:KW2C:UK74:AUGZ:SQQZ:WKMB:JNWZ:2HX7
  └ Status: Healthy
  └ Containers: 5
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 32.99 GiB
  └ Labels: dc_aws_eu_central=1, docker_consul=1, docker_ssl=1, docker_swarm=1, executiondriver=, kernelversion=4.4.0-22-generic, n2=1, operatingsystem=Ubuntu 16.04 LTS, storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-05-24T14:40:14Z
  └ ServerVersion: 1.11.1
Plugins:
 Volume:
 Network:
Kernel Version: 4.4.0-22-generic
Operating System: linux
Architecture: amd64
CPUs: 32
Total Memory: 132 GiB
Name: ip-10-0-1-241

I'm a network newbie and really struggled to make it work without any result.
Some fruitless effort I googled around:

  • Port changing from 2376 to 2375 (ssl to non-ssl didn't work)
  • Magic settings like (com.docker.network.driver.overlay.*) didn't let it happen
  • Disabling source-dest check for nodes in AWS

Any help would be appreciated.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan May 24, 2016

Contributor

Hi @vlaskinvlad, just to double check but did you make sure you opened the following ports for networking to work on AWS:

  • udp 4789 Data plane (VXLAN)
  • tcp/udp 7946 Control plane (Serf)

My guess is that your AWS instances security rules are probably misconfigured and don't let the traffic go through for the above two requirements.

Contributor

abronan commented May 24, 2016

Hi @vlaskinvlad, just to double check but did you make sure you opened the following ports for networking to work on AWS:

  • udp 4789 Data plane (VXLAN)
  • tcp/udp 7946 Control plane (Serf)

My guess is that your AWS instances security rules are probably misconfigured and don't let the traffic go through for the above two requirements.

@abronan

This comment has been minimized.

Show comment
Hide comment
@abronan

abronan May 24, 2016

Contributor

/cc @sanimej

Contributor

abronan commented May 24, 2016

/cc @sanimej

@abronan abronan added kind/question and removed kind/bug labels May 24, 2016

@vlaskinvlad

This comment has been minimized.

Show comment
Hide comment
@vlaskinvlad

vlaskinvlad May 24, 2016

Hi @abronan many thanks for following up
All machines are in one security group (in one az) with all traffic allowed to route through:
inbound:
image
outbound:
image

To double check this:
Machine A:

nc -l 4789  
nc -l 7946

Machine B works (udp protocol should not be an issue)

telnet <ip-machine-A> 4789
telnet <ip-machine-A> 7946

vlaskinvlad commented May 24, 2016

Hi @abronan many thanks for following up
All machines are in one security group (in one az) with all traffic allowed to route through:
inbound:
image
outbound:
image

To double check this:
Machine A:

nc -l 4789  
nc -l 7946

Machine B works (udp protocol should not be an issue)

telnet <ip-machine-A> 4789
telnet <ip-machine-A> 7946
@vlaskinvlad

This comment has been minimized.

Show comment
Hide comment
@vlaskinvlad

vlaskinvlad May 24, 2016

If it helps here is part of dmesg messages:

[188139.968815] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188139.968822] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.100353] eth0: renamed from veth3e94df9
[188140.120443] docker_gwbridge: port 4(veth0c1eef6) entered disabled state
[188140.120481] IPv6: ADDRCONF(NETDEV_CHANGE): veth5: link becomes ready
[188140.120504] br0: port 5(veth5) entered forwarding state
[188140.120512] br0: port 5(veth5) entered forwarding state
[188140.140362] eth1: renamed from vethe7919f9
[188140.164277] IPv6: ADDRCONF(NETDEV_CHANGE): veth0c1eef6: link becomes ready
[188140.164303] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.164310] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.664128] IPv6: eth1: IPv6 duplicate address fe80::42:acff:fe12:5 detected!
[188155.164068] br0: port 5(veth5) entered forwarding state
[188155.196076] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state

vlaskinvlad commented May 24, 2016

If it helps here is part of dmesg messages:

[188139.968815] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188139.968822] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.100353] eth0: renamed from veth3e94df9
[188140.120443] docker_gwbridge: port 4(veth0c1eef6) entered disabled state
[188140.120481] IPv6: ADDRCONF(NETDEV_CHANGE): veth5: link becomes ready
[188140.120504] br0: port 5(veth5) entered forwarding state
[188140.120512] br0: port 5(veth5) entered forwarding state
[188140.140362] eth1: renamed from vethe7919f9
[188140.164277] IPv6: ADDRCONF(NETDEV_CHANGE): veth0c1eef6: link becomes ready
[188140.164303] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.164310] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
[188140.664128] IPv6: eth1: IPv6 duplicate address fe80::42:acff:fe12:5 detected!
[188155.164068] br0: port 5(veth5) entered forwarding state
[188155.196076] docker_gwbridge: port 4(veth0c1eef6) entered forwarding state
@gavin-hu

This comment has been minimized.

Show comment
Hide comment
@gavin-hu

gavin-hu Jun 23, 2016

same problem

gavin-hu commented Jun 23, 2016

same problem

@vlaskinvlad

This comment has been minimized.

Show comment
Hide comment
@vlaskinvlad

vlaskinvlad Jun 28, 2016

i'd really appreciate any guesses here

vlaskinvlad commented Jun 28, 2016

i'd really appreciate any guesses here

@ChristianKniep

This comment has been minimized.

Show comment
Hide comment
@ChristianKniep

ChristianKniep Jun 29, 2016

You guys miss the -p in the nc command: nc -l -p 4789

ChristianKniep commented Jun 29, 2016

You guys miss the -p in the nc command: nc -l -p 4789

@pilgrim2go

This comment has been minimized.

Show comment
Hide comment
@pilgrim2go

pilgrim2go Jul 21, 2016

Any update here since I have same problem.

my error

docker run -it --rm --net=my-net2 --env="constraint:node==node1" busybox wget -O- http://web
Connecting to web (40.0.9.2:80)
wget: can't connect to remote host (40.0.9.2): No route to host

from docker log

time="2016-07-21T08:38:15.105759687Z" level=debug msg="Calling GET /v1.23/containers/json?all=1&filters=%7B%22id%22%3A%7B%22acde3e06bdb4907dc0430dc165ec4515b19f5b3a6bc5636a1422e7dd3d2ce45c%22%3Atrue%7D%7D&limit=0"
time="2016-07-21T08:38:15.303415424Z" level=error msg="could not resolve peer \"40.0.9.2\": timed out resolving peer by querying the cluster

pilgrim2go commented Jul 21, 2016

Any update here since I have same problem.

my error

docker run -it --rm --net=my-net2 --env="constraint:node==node1" busybox wget -O- http://web
Connecting to web (40.0.9.2:80)
wget: can't connect to remote host (40.0.9.2): No route to host

from docker log

time="2016-07-21T08:38:15.105759687Z" level=debug msg="Calling GET /v1.23/containers/json?all=1&filters=%7B%22id%22%3A%7B%22acde3e06bdb4907dc0430dc165ec4515b19f5b3a6bc5636a1422e7dd3d2ce45c%22%3Atrue%7D%7D&limit=0"
time="2016-07-21T08:38:15.303415424Z" level=error msg="could not resolve peer \"40.0.9.2\": timed out resolving peer by querying the cluster
@sswierczyna

This comment has been minimized.

Show comment
Hide comment
@sswierczyna

sswierczyna commented Jul 30, 2016

IMO related with: moby/moby#25219

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Aug 2, 2016

Contributor

There might be different problems in this thread. In @vlaskinvlad's case, --cluster-store=consul://10.0.1.241:8500 is set on daemon. I suspect the target consul service may not be started successfully at this port. You may take a look at this doc for consul set up.

In @pilgrim2go's case, please provide your cluster-store info, and looks at docker logs to see if there are errors.

Contributor

dongluochen commented Aug 2, 2016

There might be different problems in this thread. In @vlaskinvlad's case, --cluster-store=consul://10.0.1.241:8500 is set on daemon. I suspect the target consul service may not be started successfully at this port. You may take a look at this doc for consul set up.

In @pilgrim2go's case, please provide your cluster-store info, and looks at docker logs to see if there are errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment