Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Pending status although nodes healthy #2044

Closed
byrnedo opened this issue Mar 29, 2016 · 17 comments
Closed

Pending status although nodes healthy #2044

byrnedo opened this issue Mar 29, 2016 · 17 comments

Comments

@byrnedo
Copy link

byrnedo commented Mar 29, 2016

After a few days of bringing a swarm cluster up, I end up getting 'pending' status on all nodes.

I'm running on aws using consul as the kv store.

In the swarm-master logs I see this:

time="2016-03-24T16:15:40Z" level=error msg="Discovery error: Unexpected watch error" 
time="2016-03-24T18:15:42Z" level=error msg="Discovery error: Unexpected watch error" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-node-2" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-master" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-node-1" 
time="2016-03-26T12:39:34Z" level=info msg="Removed Engine aws-prod-node-2" 
time="2016-03-26T12:39:34Z" level=info msg="Removed Engine aws-prod-master" 
time="2016-03-26T12:39:34Z" level=info msg="Removed Engine aws-prod-node-1" 
time="2016-03-28T01:15:45Z" level=info msg="Removed Engine aws-prod-node-2" 
time="2016-03-28T01:15:45Z" level=info msg="Removed Engine aws-prod-master" 
time="2016-03-28T01:15:45Z" level=info msg="Removed Engine aws-prod-node-1" 
time="2016-03-28T12:15:54Z" level=error msg="Discovery error: Unexpected watch error" 

I've individually checked nodes 1 and 2 and they're fine. Also consul is reachable from both.

Docker version on the nodes:

Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Docker version from my machine talking to swarm:

Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64
Server:
 Version:      swarm/1.1.3
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   7e9c6bd
 Built:        Wed Mar  2 00:15:12 UTC 2016
 OS/Arch:      linux/amd64
@dongluochen
Copy link
Contributor

Can you show docker -H manager:port info? We should see the reason of Pending in the Error message. Make sure you can run "docker -H node:port info" successfully from manager.

@byrnedo
Copy link
Author

byrnedo commented Mar 29, 2016

Completely forgot to add that.

I'm connecting using tls. Once I reboot a host it goes back to health.
All nodes are ubuntu lts 14.04.
This has happened recurringly and also on a different vpc (all on aws).

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.1.3
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 aws-prod-master: xx.xx.xx.xx:2376
  └ Status: Pending
  └ Containers: 5
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.018 GiB
  └ Labels: ec2.instance.type.t2.micro=true, executiondriver=native-0.2, kernelversion=3.13.0-79-generic, nodeindex=0, operatingsystem=Ubuntu 14.04.3 LTS, provider=generic, storagedriver=aufs
  └ Error: (none)
  └ UpdatedAt: 2016-03-29T17:56:05Z
 aws-prod-node-1: xx.xx.xx.xx:2376
  └ Status: Pending
  └ Containers: 10
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.051 GiB
  └ Labels: ec2.instance.type.t2.small=true, executiondriver=native-0.2, kernelversion=3.13.0-79-generic, nodeindex=1, operatingsystem=Ubuntu 14.04.3 LTS, provider=generic, role.infra=true, role.service=true, storagedriver=aufs
  └ Error: (none)
  └ UpdatedAt: 2016-03-29T17:56:07Z
 aws-prod-node-2: xx.xx.xx.xx:2376
  └ Status: Pending
  └ Containers: 9
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.051 GiB
  └ Labels: ec2.instance.type.t2.small=true, executiondriver=native-0.2, kernelversion=3.13.0-79-generic, nodeindex=2, operatingsystem=Ubuntu 14.04.3 LTS, provider=generic, role.infra=true, role.service=true, storagedriver=aufs
  └ Error: (none)
  └ UpdatedAt: 2016-03-29T17:56:06Z
Plugins: 
 Volume: 
 Network: 
Kernel Version: 3.13.0-79-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws-prod-master

@dongluochen
Copy link
Contributor

Thanks for the info. I suspect the connections to the engines have problem. Can you restart your Swarm manager with debug enabled, i.e., swarm -l debug manage ...? And see what do you get from logs.

@byrnedo
Copy link
Author

byrnedo commented Mar 29, 2016

docker info still showing pending

time="2016-03-29T18:13:03Z" level=debug msg="Initializing discovery service" name=consul uri="xx.xx.xx.xx:8500" 
time="2016-03-29T18:13:03Z" level=info msg="Initializing discovery without TLS" 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing strategy" name=spread 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing filter" name=health 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing filter" name=port 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing filter" name=dependency 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing filter" name=affinity 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing filter" name=constraint 
time="2016-03-29T18:13:03Z" level=debug msg="Initializing cluster" name=swarm 
time="2016-03-29T18:13:03Z" level=debug msg="Registering HTTP route" method=HEAD route="/containers/{name:.*}/archive" 
time="2016-03-29T18:13:03Z" level=debug msg="Registering HTTP route" method=GET route="/images/{name:.*}/get" 
...
...
time="2016-03-29T18:13:03Z" level=debug msg="Registering HTTP route" method=DELETE route="/networks/{networkid:.*}" 
time="2016-03-29T18:13:03Z" level=debug msg="Registering HTTP route" method=DELETE route="/volumes/{name:.*}" 
time="2016-03-29T18:13:03Z" level=info msg="Listening for HTTP" addr="0.0.0.0:3376" proto=tcp 
time="2016-03-29T18:13:03Z" level=debug msg="Watch triggered with 3 nodes" discovery=consul 
time="2016-03-29T18:13:03Z" level=debug msg="Start monitoring events" id="5O3I:Z7DQ:WICB:T2S6:XQOE:XMYL:WB7C:AGMX:2IU4:NX53:EH6T:CC45" name=aws-prod-master 
time="2016-03-29T18:13:03Z" level=debug msg="Start monitoring events" id="D53J:6IKO:WKGJ:ABI7:PSVV:NO4R:CJVC:67UB:GU5E:MLFP:ELXP:BXI7" name=aws-prod-node-2 
time="2016-03-29T18:13:04Z" level=debug msg="Start monitoring events" id="ME3G:Z2UU:ZP2W:7QG6:WV2Y:WIX5:6EJB:2YQH:MZJG:52NE:EBJD:ESZF" name=aws-prod-node-1 
time="2016-03-29T18:13:21Z" level=debug msg="Watch triggered with 3 nodes" discovery=consul 

@dongluochen
Copy link
Contributor

I run Swarm 1.1.3 on AWS and do not have a repro of your problem. When manager reaches the step of Start monitoring events, it means it can connect to the docker engine. But somehow it cannot finish validation. We have a backoff strategy that nodes failing validation need to wait some time before next validation. I wonder if that might be the case.

Once I reboot a host it goes back to health.

But I do not understand why rebooting a host would change it.

Can you post your commands to start docker manager and to start docker engine?

@byrnedo
Copy link
Author

byrnedo commented Mar 29, 2016

As I said before, this happens after some time, a few days.

I used docker-machine and the 'generic' driver to provision docker on my
servers. From inspecting the container, this is the swarm-manage command
line :

swarm -debug manage \
      --tlsverify \
      --tlscacert=/etc/docker/ca.pem \
      --tlscert=/etc/docker/server.pem \
      --tlskey=/etc/docker/server-key.pem \
      -H \
      tcp://0.0.0.0:3376 \
      --strategy \
      spread \
      --advertise \
      xx.xx.xx.xx:2376 \
      consul://yy.yy.yy.yy:8500

Here is the ps output of the daemon on the master

docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock
--storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert
/etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label
nodeindex=0 --label provider=generic
--cluster-store=consul://yy.yy.yy.yy:8500 --cluster-advertise=eth0:2376
--dns=172.17.0.1 --label=ec2.instance.type.t2.micro=true

I run consul agent for dns and bind it to the docker bridge, hence the
--dns 172.17.0.1

On 29 March 2016 at 22:54, Dongluo Chen notifications@github.com wrote:

I run Swarm 1.1.3 on AWS and do not have a repro of your problem. When
manager reaches the step of Start monitoring events, it means it can
connect to the docker engine. But somehow it cannot finish validation. We
have a backoff strategy that nodes failing validation need to wait some
time before next validation. I wonder if that might be the case.

Once I reboot a host it goes back to health.

But I do not understand why rebooting a host would change it.

Can you post your commands to start docker manager and to start docker
engine?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#2044 (comment)

@byrnedo
Copy link
Author

byrnedo commented Mar 29, 2016

Is it perhaps to do with the -H tcp://0.0.0.0:2376 along with --cluster-advertise eth0:2376?

@dongluochen
Copy link
Contributor

Back to your original post, the following error means discovery (Consul) removes the nodes. If you keep seeing these messages, it means the discovery pipeline has problem. Either the nodes are failing to register themselves, or Consul is failing for some reason. From manager point of view, when Consul reports a node dropped, it'll remove it from cluster; when Consul puts it back, the node would start from pending state again. You may take a look at Consul logs.

time="2016-03-24T16:15:40Z" level=error msg="Discovery error: Unexpected watch error" 
time="2016-03-24T18:15:42Z" level=error msg="Discovery error: Unexpected watch error" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-node-2" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-master" 
time="2016-03-26T12:15:42Z" level=info msg="Removed Engine aws-prod-node-1" 
time="2016-03-26T12:39:34Z" level=info msg="Removed Engine aws-prod-node-2" 

@byrnedo
Copy link
Author

byrnedo commented Mar 30, 2016

Is it possible to debug this via the values in the kv-store? I have one working cluster and one failing and I'm finding it hard to spot any differences.

From looking at the consul logs I do see that there is a re-election now and again due to small network outages, maybe once a day.

When you say that "when Consul puts it back, the node would start from pending", is there any confirmation I can look for ? Just a "Registered node" log in the swarm master logs?

@byrnedo
Copy link
Author

byrnedo commented Mar 30, 2016

This occurred again today where I restarted the master docker daemon. The master became healthy but the other nodes were now pending. All had been healthy before I restarted the master daemon.

@byrnedo
Copy link
Author

byrnedo commented Mar 30, 2016

This appears in the docker daemon logs on the master a few days ago I see. It has appeared a few times since too:

...
INFO[556348] 2016/03/28 01:15:38 [INFO] serf: EventMemberJoin: aws-prod-node-2 10.0.15.98
INFO[556352] 2016/03/28 01:15:42 [INFO] serf: EventMemberFailed: aws-prod-node-1 10.0.2.222
INFO[556359] 2016/03/28 01:15:49 [INFO] serf: EventMemberJoin: aws-prod-node-1 10.0.2.222
WARN[556361] Registering as "10.0.9.0:2376" in discovery failed: cannot set or renew session for ttl, unable to operate on sessions 

@dongluochen
Copy link
Contributor

@byrnedo Thanks for the logs. I suspect there is reachability issue, either network or program in your cluster. How long does it take to resolve, nodes get back to healthy state after master restart? If it's long enough. I think you can debug each component separately. To test engine, use docker command docker -H engine:port info. To test discovery, use docker command docker run swarm list consul://yy.yy.yy.yy:8500. And inspect corresponding logs.

I have a test cluster where restarting manager has no issue.

@byrnedo
Copy link
Author

byrnedo commented Apr 1, 2016

Right now the nodes never become healthy after a master restart. Will have a look at what swarm list gives me.
I'm running on kernel 4.2 if that for some reason might be affecting things?

Update:
swarm list gives me the public ips for my cluster.
I confirmed that I can telnet to those ips and ports.

One thing that I've been wondering is that these are the public aws ips, whereas the daemon's --cluster-advertise value is set to eth0:2376, should that option instead be explicitly set to <public ip>:2376 ?

@byrnedo
Copy link
Author

byrnedo commented Apr 5, 2016

I got to the bottom of this in the end. It's related to moby/moby#20686, some file limit is getting eaten up and then all the nodes fall into pending (explains why it happens after a few days). Restarting the plugin (in my case rexray) sends nodes health again.

Thanks for the help @dongluochen.

@dongluochen
Copy link
Contributor

@byrnedo Thanks for sharing the root cause!

@casertap
Copy link

I am having the same issue I think. I have open an issue on stackoverflow because I am not sure if it is a bug or if I am doing something wrong.
I think it is a bug but anyway here is the issue: http://stackoverflow.com/questions/36637600/docker-swarm-tls-failed-to-validate-pending-node

Basically, when I set up my swarm cluster using aliases for my nodes host, I get this log than is actually looking up my node1 on the wrong network.

time="2016-04-15T02:47:59Z" level=debug msg="Failed to validate pending node: lookup node1 on 10.0.2.3:53: server misbehaving" Addr="node1:2376"

It seems that docker is trying to lookup the node1 alias on the wrong bridge network?
my node1 is actually:

192.168.33.10 node1

@dongluochen
Copy link
Contributor

@casertap I just answered your question on stackoverflow. I paste it below.

Although you define node1 in your machine's /etc/hosts, the container that swarm manager is running doesn't have node1 in its /etc/hosts file. By default a container doesn't share the host's file system. See https://docs.docker.com/engine/userguide/containers/dockervolumes/. Swarm manager tries to look up node1 thru DNS resolver and fails.

There are several options to resolve this.

  1. Use a resolvable FQDN so Swarm manager in the container can resolve the node
  2. Or provide node1's IP in swarm join command
  3. Or pass /etc/hosts file from host to the Swarm manager container using -v option. See the link above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants