HA Consul leaves all Swarm nodes in Pending state #2320

krishamoud · 2016-06-02T19:23:21Z

I set up a 3 node consul cluster as follows.

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $AWS_ACCESS_KEY \
                    --amazonec2-secret-key $AWS_SECRET_KEY \
                    --amazonec2-vpc-id $VPC_ID \
                    --amazonec2-zone $AZ \
                    --amazonec2-instance-type="t2.small" \
                    aws-consul-$i

Then I ran the progrium/consul image and got a running cluster.

They are running on the private network so I created an internal loadbalancer.

Then I created 3 swarm nodes as follows:

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $AWS_ACCESS_KEY \
                    --amazonec2-secret-key $AWS_SECRET_KEY \
                    --amazonec2-vpc-id $VPC_ID \
                    --amazonec2-zone $AZ \
                    --swarm-discovery="consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-store=consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-advertise=eth1:2376" \
                    --amazonec2-instance-type="m3.medium" \
                    aws-test-node$i

To make sure it was accessible I ran curl -I internal-consul-628946685.us-east-1.elb.amazonaws.com/v1/catalog/datacenters on all of the swarm nodes and they all returned 200.

Then I created 3 swarm managers as follows

docker run --restart=unless-stopped -d -p 3376:3376 -t \
              -v /etc/docker:/certs:ro swarm manage -H 0.0.0.0:3376 \
              --replication --advertise $(docker-machine ip aws-test-node0):3376 \
              --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/server.pem \
              --tlskey=/certs/server-key.pem \
              consul://internal-consul-628946685.us-east-1.elb.amazonaws.com

and I did the same to join the 3 nodes to the cluster.

docker run -d swarm join --advertise=$(docker-machine ip aws-test-node0):2375 consul://internal-consul-628946685.us-east-1.elb.amazonaws.com

Now when I docker info against the manager I get the follow:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.3
Role: replica
Primary: 52.91.156.153:3376
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 3
 (unknown): 54.152.253.232:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:49Z
  └ ServerVersion:
 (unknown): 54.84.112.44:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:59Z
  └ ServerVersion:
 (unknown): 52.91.156.153:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:59Z
  └ ServerVersion:
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: ed41e5c22521

Tailing the logs of the manager I see that a new manager is elected every minute or so. Consul knows that the nodes exist because I can see them in the UI view.

and I can also see that a manager has been elected

I'm fairly sure this is a problem with HA consul because I have run single node consul before with swarm and everything works just fine.

I've googled about everything I can but I haven't been able to find a solution. Any help would be appreciated.

EDIT: I have tried the same thing with the official consul image by running this command :

docker run -d --net=host -e CONSUL_BIND_INTERFACE="eth0" -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -client=0.0.0.0 -bind=172.30.0.77 -retry-join=172.30.3.253 -bootstrap-expect=3

The results are the same. Swarm still has all three nodes stuck in a pending state.

The text was updated successfully, but these errors were encountered:

krishamoud · 2016-06-03T23:23:50Z

Update:

I was able to bypass this issue for the most part by creating nodes with the following command:

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $access_key \
                    --amazonec2-secret-key $secret_key \
                    --amazonec2-vpc-id $vpc \
                    --amazonec2-zone e \
                    --swarm \
                    --swarm-discovery="consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-store=consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-advertise=eth0:2376" \
                    --amazonec2-instance-type="m3.medium" \
                    aws-test-node2

Just by adding the --swarm flag I was able to discover the nodes. For some reason though, the managers still never hold leadership for more than a minute or two. Here are the logs.

INFO[0000] Initializing discovery without TLS
INFO[0000] Listening for HTTP                            addr=0.0.0.0:3376 proto=tcp
INFO[0000] Leader Election: Cluster leadership lost
INFO[0000] New leader elected: 54.173.232.243:3376
INFO[0000] Registered Engine aws-test-node0 at 54.173.232.243:2376
INFO[0000] Registered Engine aws-test-node2 at 54.84.212.105:2376
INFO[0000] Registered Engine aws-test-node1 at 54.165.144.173:2376
INFO[0026] New leader elected: 54.84.212.105:3376
INFO[0085] Leader Election: Cluster leadership acquired
INFO[0145] Leader Election: Cluster leadership lost
INFO[0145] New leader elected: 54.84.212.105:3376
INFO[0204] Leader Election: Cluster leadership acquired
INFO[0264] Leader Election: Cluster leadership lost
INFO[0264] New leader elected: 54.173.232.243:3376
INFO[0324] Leader Election: Cluster leadership acquired
INFO[0384] Leader Election: Cluster leadership lost
INFO[0384] New leader elected: 54.173.232.243:3376
INFO[0444] Leader Election: Cluster leadership acquired
INFO[0504] Leader Election: Cluster leadership lost
INFO[0504] New leader elected: 54.173.232.243:3376
INFO[0564] Leader Election: Cluster leadership acquired
INFO[0624] Leader Election: Cluster leadership lost
INFO[0624] New leader elected: 54.84.212.105:3376
INFO[0683] New leader elected: 54.173.232.243:3376

I can now run HA consul and HA swarm but I still think this is an issue and not an optimal solution.

jmzwcn · 2016-06-05T03:14:09Z

issue on docker machine?

krishamoud · 2016-06-05T03:48:22Z

It's possible. The only thing I found that's similar is docker/machine#3321

mixman · 2016-06-06T11:08:33Z

I had the same issue with Consul behind ELB. Perhaps some special ELB configuration needed? I opted to dnsmasq until solved.

krishamoud · 2016-06-07T05:14:10Z

Thanks @mixman! The ELB seems to have been the problem. Before I had a CNAME point to the ELB that I had made. I seem to have fixed it by changing it from a CNAME pointing to an ELB to an A record pointing to the private IP's of the consul nodes. Then when I was running the swarm managers/nodes I simply put docker run -d swarm join --advertise=$(docker-machine ip node2):2376 consul://consul.example.com:8500

Closing this as it seems to be a consul issue and not a swarm issue.

krishamoud closed this as completed Jun 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA Consul leaves all Swarm nodes in Pending state #2320

HA Consul leaves all Swarm nodes in Pending state #2320

krishamoud commented Jun 2, 2016 •

edited

krishamoud commented Jun 3, 2016

jmzwcn commented Jun 5, 2016

krishamoud commented Jun 5, 2016

mixman commented Jun 6, 2016

krishamoud commented Jun 7, 2016

HA Consul leaves all Swarm nodes in Pending state #2320

HA Consul leaves all Swarm nodes in Pending state #2320

Comments

krishamoud commented Jun 2, 2016 • edited

krishamoud commented Jun 3, 2016

jmzwcn commented Jun 5, 2016

krishamoud commented Jun 5, 2016

mixman commented Jun 6, 2016

krishamoud commented Jun 7, 2016

krishamoud commented Jun 2, 2016 •

edited