Daemon can't be started after swarm certificates expire #24132

tonistiigi · 2016-06-29T17:27:41Z

Swarm certificates automatically renew and have 90 day expiry period by default. Still, if you don't start the daemon during that time the certificates will expire and starting daemon will fail with time="2016-06-29T17:18:06.165656736Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid"

I think refusing to start and not ignoring this error is correct. We could provide --reset-swarm option to leave swarm so the user doesn't need to remove the state dir manually. Problem is that user must remember to remove this option as otherwise, it would clear the state on every next restart as well.

Maybe a good enough solution would be to add instructions for removing the state directory in the error message.

@nathanleclaire

The text was updated successfully, but these errors were encountered:

cpuguy83 · 2016-06-29T17:40:34Z

Could provide a dockerd subcommand to handle cleanup.

oflebbe · 2016-09-29T18:29:56Z

My docker for mac suddenly didn't start any more.

Luckly I found the $HOME/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/log/docker.log file stating that swarm-node.crt within the docker image is not valid any more. I was able to move the Docker.qcow2 image to a Linux Box, mount and remove the swarm-node.crt file within the container and moving back the image, and docker works again. But how does an average user is supposed to fix that issue? IMHO This urgently has to be fixed.

Could you please elaborate on "how to remove the state directory"

tonistiigi · 2016-09-29T21:29:25Z

@oflebbe Did you change the certificates expiry time? When did you last use docker before this happened? Any change some clock skew happened that would have caused this? Can you share us the daemon logs?

cc @diogomonica

oflebbe · 2016-09-30T06:39:28Z

@diogomonica @tonistiigi No I didn't change the expiry time.

Timeline: I used docker for mac Beta almost on a daily base since Jun 16th. I tested swarm shortly after swarm integration was anounced (Jul 29th?). The time stamp of the swarm-root-ca.pem is

-rw-r--r-- 1 root root 684 Jun 28 21:50 swarm-root-ca.crt

I looked at the expiration date of the swarm-node.crt and it was Sep 26th. My mac crashed on Sep 28th and docker didn't start any more. (Jun 28th + 90 == Sep 26th, btw)

I did a factory reset on my mac but secured Docker.qcow2 before. So the original logs are gone, unfortunately.

What I do have:

I have the Docker.qcow2. I can send you the original /lib/docker/swarm dir for instance (minus swarm-node.*) Maybe tasks.db is of interest for you.
A docker log from the failed boot of the Docker.qcow2 (Last Lines are)
time="2016-09-29T18:06:29.220768897Z" level=debug msg="/usr/local/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" time="2016-09-29T18:06:30.093105697Z" level=debug msg="successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm -root-ca.crt" time="2016-09-29T18:06:30.094457479Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid" time="2016-09-29T18:21:24.066271462Z" level=debug msg="docker group found. gid: 50" time="2016-09-29T18:21:24.066842256Z" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)"

Seems like the renewal of certificates never happend, rendering it to a time bomb.

oflebbe · 2016-09-30T06:45:21Z

Find the swarm dir at http://oflebbe.de/docker_swarm.tgz

tonistiigi · 2016-09-30T20:49:52Z

@oflebbe Thanks! I created moby/swarmkit#1594 to fix the issue that might have caused it and it should be included in v1.12.2. It's possible this was caused by the combination of renewal and daemon restart timings.

thaJeztah · 2016-09-30T23:28:37Z

It's included in #27077, which is for 1.12.2

thaJeztah · 2016-10-01T10:58:50Z

Closed by #27077 for 1.12.2

tonistiigi · 2016-10-03T17:50:25Z

This shouldn't be closed as the fix is unrelated to the original issue. The same message just appeared but for a different reason.

diogomonica · 2016-10-05T09:14:52Z

@tonistiigi any ideas here?

also:
I think refusing to start and not ignoring this error is correct. We could provide --reset-swarm option to leave swarm so the user doesn't need to remove the state dir manually, doesn't docker swarm leave remove all state anyway? Why would --reset-swarm be needed?

tonistiigi · 2016-10-05T17:07:27Z

@diogomonica Problem is that daemon doesn't start so you have no way to execute swarm leave.

I think our options are:

Add --reset-swarm and risk someone accidentally using it losing their swarm state. Also this as limited benefits for docker4mac unless they add it in the UI.
Log error and start daemon without swarm. If you run docker info you will see the certificate error. You can now do docker swarm leave or maybe docker swarm init --force-new-cluster.
As this is mostly for dev machines, unlikely to have multiple nodes, we could just use the ca key that is locally available to generate new certificate.

diogomonica · 2016-10-10T11:26:49Z

@tonistiigi I think solution 2 is the right one.

Just making sure: do you believe there are other issues that might be preventing certificate rotation?

thaJeztah · 2016-10-10T11:32:30Z

I think solution 2 is the right one.

Do we know what happens if I re-join the swarm on that node? IIUC, joining again, gives the node a new identity in the swarm; will the manager stop and remove existing tasks when joining again, or "not recognize" the old tasks, and keep them running?

diogomonica · 2016-10-10T11:45:45Z

This only happens when the node is already down. From the PoV of the manager the node is already dead, and all the tasks have already been scheduled somewhere else. Rejoining will give the node a new identity in the cluster, yes, but all the local tasks that might still be running will be killed.

banjocat · 2016-11-05T10:15:49Z

I ran into this issue. I ended up removing Docker.qcow2.

diogomonica · 2016-11-05T18:30:19Z

@tonistiigi any followup on 2?

robbyoconnor · 2016-11-07T09:26:56Z

I...just got bit by this...

robbyoconnor · 2016-11-07T09:41:42Z

This makes me very uneasy -- This should not prevent the engine from starting. It should make swarm features not work, but the engine should start.

thaJeztah · 2016-11-07T09:49:03Z

@robbyoconnor try renaming / removing the /var/lib/docker/swarm directory, which should disable swarm mode

thaJeztah · 2016-11-07T09:50:06Z

@robbyoconnor also, you ran into this because that daemon hadn't been running for a longer period, or did something else happen?

robbyoconnor · 2016-11-07T09:57:04Z

@thaJeztah, I haven't used docker in a while

robbyoconnor · 2016-11-07T10:01:13Z

That fixed it -- but why did I run into this? Hilariously, this is just after I read a piece on Docker issues in production -- some of which I see...some of which I feel is blown out of proportions.

thaJeztah · 2016-11-07T10:21:49Z

@robbyoconnor it's a security mechanism; swarm mode automatically regenerates certificates on a periodic interval so this issue never happens in production. Certificates should expire, to prevent (for example) a stolen or retired node from rejoining. The default certificate expiration time is 90 days, which means that it'll only expire after a daemon was shutdown for three months.

thaJeztah · 2016-11-07T10:43:55Z

Perhaps I should add that the expiration time is configurable through the --cert-expiry flag; https://github.com/docker/docker/blob/v1.12.3/docs/reference/commandline/swarm_init.md

@diogomonica I think we should add a section to the Swarm admin guide (https://docs.docker.com/engine/swarm/admin_guide/) to explain the (dis)advantages of setting a longer/shorter expiration time (I recall some people setting it to a really short time, e.g. 1 hour)

oflebbe · 2016-11-07T10:54:56Z

I am feeling very uneasy: IMHO the automatic ticket regeneration seems not to work. Why doesn't docker regenerate the ticket by itself, when it is expired ?

thaJeztah · 2016-11-07T11:06:46Z

Docker automatically regenerates the certificate before it expires. Once it has expired, it should never regenerate the certificate. Regenerating a certificate after it has expired would defeat the security that a certificate provides.

oflebbe · 2016-11-07T11:31:20Z

Are you absolutely shure that the regeneration algorithm is correct ?

thaJeztah · 2016-11-07T12:23:38Z

@oflebbe it should be, but possibly @tonistiigi @diogomonica are aware of reasons why it could fail (you mention your daemon has been running, and certificates didn't rotate?)

thaJeztah · 2016-11-07T15:59:54Z

I raised the priority; we should get a fix in 1.13 so that the daemon can start, allowing people to docker swarm leave to resolve this situation

oflebbe · 2016-11-07T19:12:53Z

That would be a conveniant way to handle the situation. +1

robbyoconnor · 2016-11-07T22:25:03Z

👍 -- But it shouldn't completely fail to start the engine -- it SHOULD however disable swarm features.....

tonistiigi · 2016-11-08T00:36:43Z

@oflebbe

Are you absolutely shure that the regeneration algorithm is correct ?

There was an issue that may delay renewal in certain restart intervals. That was fixed in v1.12.2.

tonistiigi · 2016-11-09T02:07:27Z

@aaronlehmann @diogomonica 3090be9 is the fix on top of #27967 . I think we should wait for #27967 first because otherwise one of them would need a bad rebase.

diogomonica · 2016-11-09T09:18:53Z

@tonistiigi looks good.

tonistiigi added priority/P3 Best effort: those are nice to have / minor issues. area/swarm labels Jun 29, 2016

vdemeester added this to the 1.12.0 milestone Jun 29, 2016

tonistiigi removed this from the 1.12.0 milestone Jun 29, 2016

thaJeztah added this to the 1.13.0 milestone Jul 15, 2016

thaJeztah modified the milestones: 1.12.2, 1.13.0 Sep 30, 2016

LK4D4 mentioned this issue Sep 30, 2016

vendor: update swarmkit for 1.12.2 #27077

Merged

thaJeztah closed this as completed Oct 1, 2016

tonistiigi reopened this Oct 3, 2016

thaJeztah modified the milestones: 1.13.0, 1.12.2 Oct 14, 2016

thaJeztah added priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed priority/P3 Best effort: those are nice to have / minor issues. labels Nov 7, 2016

tonistiigi self-assigned this Nov 7, 2016

tonistiigi mentioned this issue Nov 10, 2016

Start daemon if certificates have been expired #28228

Merged

lowenna closed this as completed in #28228 Nov 10, 2016

ijc mentioned this issue Nov 18, 2016

Docker for mac doesn't restart after upgrade to Version 1.12.3 (13776) docker/for-mac#954

Closed

tonistiigi mentioned this issue Dec 8, 2016

[1.13-rc3] getting "certification expired" error on 'docker inspect' #29242

Closed

grahamhoyes mentioned this issue Jul 7, 2021

Swarm certificate expired grahamhoyes/django-docker-swarm-example#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daemon can't be started after swarm certificates expire #24132

Daemon can't be started after swarm certificates expire #24132

tonistiigi commented Jun 29, 2016

cpuguy83 commented Jun 29, 2016

oflebbe commented Sep 29, 2016

tonistiigi commented Sep 29, 2016

oflebbe commented Sep 30, 2016

oflebbe commented Sep 30, 2016 •

edited

tonistiigi commented Sep 30, 2016

thaJeztah commented Sep 30, 2016

thaJeztah commented Oct 1, 2016

tonistiigi commented Oct 3, 2016

diogomonica commented Oct 5, 2016 •

edited

tonistiigi commented Oct 5, 2016

diogomonica commented Oct 10, 2016

thaJeztah commented Oct 10, 2016

diogomonica commented Oct 10, 2016 •

edited

banjocat commented Nov 5, 2016

diogomonica commented Nov 5, 2016

robbyoconnor commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016 •

edited

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016 •

edited

tonistiigi commented Nov 8, 2016

tonistiigi commented Nov 9, 2016

diogomonica commented Nov 9, 2016

Daemon can't be started after swarm certificates expire #24132

Daemon can't be started after swarm certificates expire #24132

Comments

tonistiigi commented Jun 29, 2016

cpuguy83 commented Jun 29, 2016

oflebbe commented Sep 29, 2016

tonistiigi commented Sep 29, 2016

oflebbe commented Sep 30, 2016

oflebbe commented Sep 30, 2016 • edited

tonistiigi commented Sep 30, 2016

thaJeztah commented Sep 30, 2016

thaJeztah commented Oct 1, 2016

tonistiigi commented Oct 3, 2016

diogomonica commented Oct 5, 2016 • edited

tonistiigi commented Oct 5, 2016

diogomonica commented Oct 10, 2016

thaJeztah commented Oct 10, 2016

diogomonica commented Oct 10, 2016 • edited

banjocat commented Nov 5, 2016

diogomonica commented Nov 5, 2016

robbyoconnor commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016 • edited

thaJeztah commented Nov 7, 2016

thaJeztah commented Nov 7, 2016

oflebbe commented Nov 7, 2016

robbyoconnor commented Nov 7, 2016 • edited

tonistiigi commented Nov 8, 2016

tonistiigi commented Nov 9, 2016

diogomonica commented Nov 9, 2016

oflebbe commented Sep 30, 2016 •

edited

diogomonica commented Oct 5, 2016 •

edited

diogomonica commented Oct 10, 2016 •

edited

oflebbe commented Nov 7, 2016 •

edited

robbyoconnor commented Nov 7, 2016 •

edited