-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Pending status although nodes healthy #2044
Comments
Can you show |
Completely forgot to add that. I'm connecting using tls. Once I reboot a host it goes back to health.
|
Thanks for the info. I suspect the connections to the engines have problem. Can you restart your Swarm manager with debug enabled, i.e., |
|
I run Swarm 1.1.3 on AWS and do not have a repro of your problem. When manager reaches the step of
But I do not understand why rebooting a host would change it. Can you post your commands to start docker manager and to start docker engine? |
As I said before, this happens after some time, a few days. I used docker-machine and the 'generic' driver to provision docker on my
Here is the
I run consul agent for dns and bind it to the docker bridge, hence the On 29 March 2016 at 22:54, Dongluo Chen notifications@github.com wrote:
|
Is it perhaps to do with the |
Back to your original post, the following error means discovery (Consul) removes the nodes. If you keep seeing these messages, it means the discovery pipeline has problem. Either the nodes are failing to register themselves, or Consul is failing for some reason. From manager point of view, when Consul reports a node dropped, it'll remove it from cluster; when Consul puts it back, the node would start from
|
Is it possible to debug this via the values in the kv-store? I have one working cluster and one failing and I'm finding it hard to spot any differences. From looking at the consul logs I do see that there is a re-election now and again due to small network outages, maybe once a day. When you say that "when Consul puts it back, the node would start from |
This occurred again today where I restarted the master docker daemon. The master became healthy but the other nodes were now pending. All had been healthy before I restarted the master daemon. |
This appears in the docker daemon logs on the master a few days ago I see. It has appeared a few times since too:
|
@byrnedo Thanks for the logs. I suspect there is reachability issue, either network or program in your cluster. How long does it take to resolve, nodes get back to I have a test cluster where restarting manager has no issue. |
Right now the nodes never become healthy after a master restart. Will have a look at what swarm list gives me. Update: One thing that I've been wondering is that these are the public aws ips, whereas the daemon's |
I got to the bottom of this in the end. It's related to moby/moby#20686, some file limit is getting eaten up and then all the nodes fall into pending (explains why it happens after a few days). Restarting the plugin (in my case rexray) sends nodes health again. Thanks for the help @dongluochen. |
@byrnedo Thanks for sharing the root cause! |
I am having the same issue I think. I have open an issue on stackoverflow because I am not sure if it is a bug or if I am doing something wrong. Basically, when I set up my swarm cluster using aliases for my nodes host, I get this log than is actually looking up my node1 on the wrong network.
It seems that docker is trying to lookup the node1 alias on the wrong bridge network?
|
@casertap I just answered your question on stackoverflow. I paste it below. Although you define node1 in your machine's /etc/hosts, the container that swarm manager is running doesn't have node1 in its /etc/hosts file. By default a container doesn't share the host's file system. See https://docs.docker.com/engine/userguide/containers/dockervolumes/. Swarm manager tries to look up node1 thru DNS resolver and fails. There are several options to resolve this.
|
After a few days of bringing a swarm cluster up, I end up getting 'pending' status on all nodes.
I'm running on aws using consul as the kv store.
In the swarm-master logs I see this:
I've individually checked nodes 1 and 2 and they're fine. Also consul is reachable from both.
Docker version on the nodes:
Docker version from my machine talking to swarm:
The text was updated successfully, but these errors were encountered: