Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon can't be started after swarm certificates expire #24132

Closed
tonistiigi opened this issue Jun 29, 2016 · 34 comments · Fixed by #28228
Closed

Daemon can't be started after swarm certificates expire #24132

tonistiigi opened this issue Jun 29, 2016 · 34 comments · Fixed by #28228
Assignees
Labels
area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release.
Milestone

Comments

@tonistiigi
Copy link
Member

Swarm certificates automatically renew and have 90 day expiry period by default. Still, if you don't start the daemon during that time the certificates will expire and starting daemon will fail with time="2016-06-29T17:18:06.165656736Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid"

I think refusing to start and not ignoring this error is correct. We could provide --reset-swarm option to leave swarm so the user doesn't need to remove the state dir manually. Problem is that user must remember to remove this option as otherwise, it would clear the state on every next restart as well.

Maybe a good enough solution would be to add instructions for removing the state directory in the error message.

@nathanleclaire

@tonistiigi tonistiigi added priority/P3 Best effort: those are nice to have / minor issues. area/swarm labels Jun 29, 2016
@vdemeester vdemeester added this to the 1.12.0 milestone Jun 29, 2016
@cpuguy83
Copy link
Member

Could provide a dockerd subcommand to handle cleanup.

@tonistiigi tonistiigi removed this from the 1.12.0 milestone Jun 29, 2016
@thaJeztah thaJeztah added this to the 1.13.0 milestone Jul 15, 2016
@oflebbe
Copy link

oflebbe commented Sep 29, 2016

My docker for mac suddenly didn't start any more.

Luckly I found the $HOME/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/log/docker.log file stating that swarm-node.crt within the docker image is not valid any more. I was able to move the Docker.qcow2 image to a Linux Box, mount and remove the swarm-node.crt file within the container and moving back the image, and docker works again. But how does an average user is supposed to fix that issue? IMHO This urgently has to be fixed.

Could you please elaborate on "how to remove the state directory"

@tonistiigi
Copy link
Member Author

@oflebbe Did you change the certificates expiry time? When did you last use docker before this happened? Any change some clock skew happened that would have caused this? Can you share us the daemon logs?

cc @diogomonica

@oflebbe
Copy link

oflebbe commented Sep 30, 2016

@diogomonica @tonistiigi No I didn't change the expiry time.

Timeline: I used docker for mac Beta almost on a daily base since Jun 16th. I tested swarm shortly after swarm integration was anounced (Jul 29th?). The time stamp of the swarm-root-ca.pem is

-rw-r--r-- 1 root root 684 Jun 28 21:50 swarm-root-ca.crt

I looked at the expiration date of the swarm-node.crt and it was Sep 26th. My mac crashed on Sep 28th and docker didn't start any more. (Jun 28th + 90 == Sep 26th, btw)

I did a factory reset on my mac but secured Docker.qcow2 before. So the original logs are gone, unfortunately.

What I do have:

  • I have the Docker.qcow2. I can send you the original /lib/docker/swarm dir for instance (minus swarm-node.*) Maybe tasks.db is of interest for you.
  • A docker log from the failed boot of the Docker.qcow2 (Last Lines are)
    time="2016-09-29T18:06:29.220768897Z" level=debug msg="/usr/local/sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]" time="2016-09-29T18:06:30.093105697Z" level=debug msg="successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm -root-ca.crt" time="2016-09-29T18:06:30.094457479Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid" time="2016-09-29T18:21:24.066271462Z" level=debug msg="docker group found. gid: 50" time="2016-09-29T18:21:24.066842256Z" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)"

Seems like the renewal of certificates never happend, rendering it to a time bomb.

@oflebbe
Copy link

oflebbe commented Sep 30, 2016

Find the swarm dir at http://oflebbe.de/docker_swarm.tgz

@tonistiigi
Copy link
Member Author

@oflebbe Thanks! I created moby/swarmkit#1594 to fix the issue that might have caused it and it should be included in v1.12.2. It's possible this was caused by the combination of renewal and daemon restart timings.

@thaJeztah
Copy link
Member

It's included in #27077, which is for 1.12.2

@thaJeztah
Copy link
Member

Closed by #27077 for 1.12.2

@tonistiigi tonistiigi reopened this Oct 3, 2016
@tonistiigi
Copy link
Member Author

This shouldn't be closed as the fix is unrelated to the original issue. The same message just appeared but for a different reason.

@diogomonica
Copy link
Contributor

diogomonica commented Oct 5, 2016

@tonistiigi any ideas here?

also:
I think refusing to start and not ignoring this error is correct. We could provide --reset-swarm option to leave swarm so the user doesn't need to remove the state dir manually, doesn't docker swarm leave remove all state anyway? Why would --reset-swarm be needed?

@tonistiigi
Copy link
Member Author

@diogomonica Problem is that daemon doesn't start so you have no way to execute swarm leave.

I think our options are:

  • Add --reset-swarm and risk someone accidentally using it losing their swarm state. Also this as limited benefits for docker4mac unless they add it in the UI.
  • Log error and start daemon without swarm. If you run docker info you will see the certificate error. You can now do docker swarm leave or maybe docker swarm init --force-new-cluster.
  • As this is mostly for dev machines, unlikely to have multiple nodes, we could just use the ca key that is locally available to generate new certificate.

@diogomonica
Copy link
Contributor

@tonistiigi I think solution 2 is the right one.

Just making sure: do you believe there are other issues that might be preventing certificate rotation?

@thaJeztah
Copy link
Member

I think solution 2 is the right one.

Do we know what happens if I re-join the swarm on that node? IIUC, joining again, gives the node a new identity in the swarm; will the manager stop and remove existing tasks when joining again, or "not recognize" the old tasks, and keep them running?

@diogomonica
Copy link
Contributor

diogomonica commented Oct 10, 2016

This only happens when the node is already down. From the PoV of the manager the node is already dead, and all the tasks have already been scheduled somewhere else. Rejoining will give the node a new identity in the cluster, yes, but all the local tasks that might still be running will be killed.

@thaJeztah thaJeztah modified the milestones: 1.13.0, 1.12.2 Oct 14, 2016
@banjocat
Copy link

banjocat commented Nov 5, 2016

I ran into this issue. I ended up removing Docker.qcow2.

@diogomonica
Copy link
Contributor

@tonistiigi any followup on 2?

@robbyoconnor
Copy link

I...just got bit by this...

@robbyoconnor
Copy link

This makes me very uneasy -- This should not prevent the engine from starting. It should make swarm features not work, but the engine should start.

@thaJeztah
Copy link
Member

@robbyoconnor try renaming / removing the /var/lib/docker/swarm directory, which should disable swarm mode

@thaJeztah
Copy link
Member

@robbyoconnor also, you ran into this because that daemon hadn't been running for a longer period, or did something else happen?

@robbyoconnor
Copy link

@thaJeztah, I haven't used docker in a while

@robbyoconnor
Copy link

That fixed it -- but why did I run into this? Hilariously, this is just after I read a piece on Docker issues in production -- some of which I see...some of which I feel is blown out of proportions.

@thaJeztah
Copy link
Member

@robbyoconnor it's a security mechanism; swarm mode automatically regenerates certificates on a periodic interval so this issue never happens in production. Certificates should expire, to prevent (for example) a stolen or retired node from rejoining. The default certificate expiration time is 90 days, which means that it'll only expire after a daemon was shutdown for three months.

@thaJeztah
Copy link
Member

Perhaps I should add that the expiration time is configurable through the --cert-expiry flag; https://github.com/docker/docker/blob/v1.12.3/docs/reference/commandline/swarm_init.md

@diogomonica I think we should add a section to the Swarm admin guide (https://docs.docker.com/engine/swarm/admin_guide/) to explain the (dis)advantages of setting a longer/shorter expiration time (I recall some people setting it to a really short time, e.g. 1 hour)

@oflebbe
Copy link

oflebbe commented Nov 7, 2016

I am feeling very uneasy: IMHO the automatic ticket regeneration seems not to work. Why doesn't docker regenerate the ticket by itself, when it is expired ?

@thaJeztah
Copy link
Member

Docker automatically regenerates the certificate before it expires. Once it has expired, it should never regenerate the certificate. Regenerating a certificate after it has expired would defeat the security that a certificate provides.

@oflebbe
Copy link

oflebbe commented Nov 7, 2016

Are you absolutely shure that the regeneration algorithm is correct ?

@thaJeztah
Copy link
Member

@oflebbe it should be, but possibly @tonistiigi @diogomonica are aware of reasons why it could fail (you mention your daemon has been running, and certificates didn't rotate?)

@thaJeztah thaJeztah added priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed priority/P3 Best effort: those are nice to have / minor issues. labels Nov 7, 2016
@thaJeztah
Copy link
Member

I raised the priority; we should get a fix in 1.13 so that the daemon can start, allowing people to docker swarm leave to resolve this situation

@oflebbe
Copy link

oflebbe commented Nov 7, 2016

That would be a conveniant way to handle the situation. +1

@tonistiigi tonistiigi self-assigned this Nov 7, 2016
@robbyoconnor
Copy link

robbyoconnor commented Nov 7, 2016

👍 -- But it shouldn't completely fail to start the engine -- it SHOULD however disable swarm features.....

@tonistiigi
Copy link
Member Author

@oflebbe

Are you absolutely shure that the regeneration algorithm is correct ?

There was an issue that may delay renewal in certain restart intervals. That was fixed in v1.12.2.

@tonistiigi
Copy link
Member Author

@aaronlehmann @diogomonica 3090be9 is the fix on top of #27967 . I think we should wait for #27967 first because otherwise one of them would need a bad rebase.

@diogomonica
Copy link
Contributor

@tonistiigi looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants