New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Consul leaves all Swarm nodes in Pending state #2320

Closed
krishamoud opened this Issue Jun 2, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@krishamoud

krishamoud commented Jun 2, 2016

I set up a 3 node consul cluster as follows.

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $AWS_ACCESS_KEY \
                    --amazonec2-secret-key $AWS_SECRET_KEY \
                    --amazonec2-vpc-id $VPC_ID \
                    --amazonec2-zone $AZ \
                    --amazonec2-instance-type="t2.small" \
                    aws-consul-$i

Then I ran the progrium/consul image and got a running cluster.

screen shot 2016-06-02 at 12 04 27 pm

They are running on the private network so I created an internal loadbalancer.

Then I created 3 swarm nodes as follows:

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $AWS_ACCESS_KEY \
                    --amazonec2-secret-key $AWS_SECRET_KEY \
                    --amazonec2-vpc-id $VPC_ID \
                    --amazonec2-zone $AZ \
                    --swarm-discovery="consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-store=consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-advertise=eth1:2376" \
                    --amazonec2-instance-type="m3.medium" \
                    aws-test-node$i

To make sure it was accessible I ran curl -I internal-consul-628946685.us-east-1.elb.amazonaws.com/v1/catalog/datacenters on all of the swarm nodes and they all returned 200.

Then I created 3 swarm managers as follows

docker run --restart=unless-stopped -d -p 3376:3376 -t \
              -v /etc/docker:/certs:ro swarm manage -H 0.0.0.0:3376 \
              --replication --advertise $(docker-machine ip aws-test-node0):3376 \
              --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/server.pem \
              --tlskey=/certs/server-key.pem \
              consul://internal-consul-628946685.us-east-1.elb.amazonaws.com

and I did the same to join the 3 nodes to the cluster.

docker run -d swarm join --advertise=$(docker-machine ip aws-test-node0):2375 consul://internal-consul-628946685.us-east-1.elb.amazonaws.com

Now when I docker info against the manager I get the follow:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.3
Role: replica
Primary: 52.91.156.153:3376
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 3
 (unknown): 54.152.253.232:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:49Z
  └ ServerVersion:
 (unknown): 54.84.112.44:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:59Z
  └ ServerVersion:
 (unknown): 52.91.156.153:2375
  └ ID:
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  └ UpdatedAt: 2016-06-02T19:15:59Z
  └ ServerVersion:
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: ed41e5c22521

Tailing the logs of the manager I see that a new manager is elected every minute or so. Consul knows that the nodes exist because I can see them in the UI view.

screen shot 2016-06-02 at 12 18 30 pm

and I can also see that a manager has been elected
screen shot 2016-06-02 at 12 19 33 pm

I'm fairly sure this is a problem with HA consul because I have run single node consul before with swarm and everything works just fine.

I've googled about everything I can but I haven't been able to find a solution. Any help would be appreciated.

EDIT: I have tried the same thing with the official consul image by running this command :

docker run -d --net=host -e CONSUL_BIND_INTERFACE="eth0" -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -client=0.0.0.0 -bind=172.30.0.77 -retry-join=172.30.3.253 -bootstrap-expect=3

The results are the same. Swarm still has all three nodes stuck in a pending state.

@krishamoud

This comment has been minimized.

krishamoud commented Jun 3, 2016

Update:

I was able to bypass this issue for the most part by creating nodes with the following command:

docker-machine create -d amazonec2 \
                    --amazonec2-access-key $access_key \
                    --amazonec2-secret-key $secret_key \
                    --amazonec2-vpc-id $vpc \
                    --amazonec2-zone e \
                    --swarm \
                    --swarm-discovery="consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-store=consul://internal-consul-628946685.us-east-1.elb.amazonaws.com" \
                    --engine-opt="cluster-advertise=eth0:2376" \
                    --amazonec2-instance-type="m3.medium" \
                    aws-test-node2

Just by adding the --swarm flag I was able to discover the nodes. For some reason though, the managers still never hold leadership for more than a minute or two. Here are the logs.

INFO[0000] Initializing discovery without TLS
INFO[0000] Listening for HTTP                            addr=0.0.0.0:3376 proto=tcp
INFO[0000] Leader Election: Cluster leadership lost
INFO[0000] New leader elected: 54.173.232.243:3376
INFO[0000] Registered Engine aws-test-node0 at 54.173.232.243:2376
INFO[0000] Registered Engine aws-test-node2 at 54.84.212.105:2376
INFO[0000] Registered Engine aws-test-node1 at 54.165.144.173:2376
INFO[0026] New leader elected: 54.84.212.105:3376
INFO[0085] Leader Election: Cluster leadership acquired
INFO[0145] Leader Election: Cluster leadership lost
INFO[0145] New leader elected: 54.84.212.105:3376
INFO[0204] Leader Election: Cluster leadership acquired
INFO[0264] Leader Election: Cluster leadership lost
INFO[0264] New leader elected: 54.173.232.243:3376
INFO[0324] Leader Election: Cluster leadership acquired
INFO[0384] Leader Election: Cluster leadership lost
INFO[0384] New leader elected: 54.173.232.243:3376
INFO[0444] Leader Election: Cluster leadership acquired
INFO[0504] Leader Election: Cluster leadership lost
INFO[0504] New leader elected: 54.173.232.243:3376
INFO[0564] Leader Election: Cluster leadership acquired
INFO[0624] Leader Election: Cluster leadership lost
INFO[0624] New leader elected: 54.84.212.105:3376
INFO[0683] New leader elected: 54.173.232.243:3376

I can now run HA consul and HA swarm but I still think this is an issue and not an optimal solution.

@jmzwcn

This comment has been minimized.

jmzwcn commented Jun 5, 2016

issue on docker machine?

@krishamoud

This comment has been minimized.

krishamoud commented Jun 5, 2016

It's possible. The only thing I found that's similar is docker/machine#3321

@mixman

This comment has been minimized.

mixman commented Jun 6, 2016

I had the same issue with Consul behind ELB. Perhaps some special ELB configuration needed? I opted to dnsmasq until solved.

@krishamoud

This comment has been minimized.

krishamoud commented Jun 7, 2016

Thanks @mixman! The ELB seems to have been the problem. Before I had a CNAME point to the ELB that I had made. I seem to have fixed it by changing it from a CNAME pointing to an ELB to an A record pointing to the private IP's of the consul nodes. Then when I was running the swarm managers/nodes I simply put docker run -d swarm join --advertise=$(docker-machine ip node2):2376 consul://consul.example.com:8500

Closing this as it seems to be a consul issue and not a swarm issue.

@krishamoud krishamoud closed this Jun 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment