New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon can't be started after swarm certificates expire #24132
Comments
Could provide a dockerd subcommand to handle cleanup. |
My docker for mac suddenly didn't start any more. Luckly I found the $HOME/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/log/docker.log file stating that swarm-node.crt within the docker image is not valid any more. I was able to move the Docker.qcow2 image to a Linux Box, mount and remove the swarm-node.crt file within the container and moving back the image, and docker works again. But how does an average user is supposed to fix that issue? IMHO This urgently has to be fixed. Could you please elaborate on "how to remove the state directory" |
@oflebbe Did you change the certificates expiry time? When did you last use docker before this happened? Any change some clock skew happened that would have caused this? Can you share us the daemon logs? cc @diogomonica |
@diogomonica @tonistiigi No I didn't change the expiry time. Timeline: I used docker for mac Beta almost on a daily base since Jun 16th. I tested swarm shortly after swarm integration was anounced (Jul 29th?). The time stamp of the swarm-root-ca.pem is -rw-r--r-- 1 root root 684 Jun 28 21:50 swarm-root-ca.crt I looked at the expiration date of the swarm-node.crt and it was Sep 26th. My mac crashed on Sep 28th and docker didn't start any more. (Jun 28th + 90 == Sep 26th, btw) I did a factory reset on my mac but secured Docker.qcow2 before. So the original logs are gone, unfortunately. What I do have:
Seems like the renewal of certificates never happend, rendering it to a time bomb. |
Find the swarm dir at http://oflebbe.de/docker_swarm.tgz |
@oflebbe Thanks! I created moby/swarmkit#1594 to fix the issue that might have caused it and it should be included in |
It's included in #27077, which is for 1.12.2 |
Closed by #27077 for 1.12.2 |
This shouldn't be closed as the fix is unrelated to the original issue. The same message just appeared but for a different reason. |
@tonistiigi any ideas here? also: |
@diogomonica Problem is that daemon doesn't start so you have no way to execute I think our options are:
|
@tonistiigi I think solution 2 is the right one. Just making sure: do you believe there are other issues that might be preventing certificate rotation? |
Do we know what happens if I re-join the swarm on that node? IIUC, joining again, gives the node a new identity in the swarm; will the manager stop and remove existing tasks when joining again, or "not recognize" the old tasks, and keep them running? |
This only happens when the node is already down. From the PoV of the manager the node is already dead, and all the tasks have already been scheduled somewhere else. Rejoining will give the node a new identity in the cluster, yes, but all the local tasks that might still be running will be killed. |
I ran into this issue. I ended up removing Docker.qcow2. |
@tonistiigi any followup on 2? |
I...just got bit by this... |
This makes me very uneasy -- This should not prevent the engine from starting. It should make swarm features not work, but the engine should start. |
@robbyoconnor try renaming / removing the |
@robbyoconnor also, you ran into this because that daemon hadn't been running for a longer period, or did something else happen? |
@thaJeztah, I haven't used docker in a while |
That fixed it -- but why did I run into this? Hilariously, this is just after I read a piece on Docker issues in production -- some of which I see...some of which I feel is blown out of proportions. |
@robbyoconnor it's a security mechanism; swarm mode automatically regenerates certificates on a periodic interval so this issue never happens in production. Certificates should expire, to prevent (for example) a stolen or retired node from rejoining. The default certificate expiration time is 90 days, which means that it'll only expire after a daemon was shutdown for three months. |
Perhaps I should add that the expiration time is configurable through the @diogomonica I think we should add a section to the Swarm admin guide (https://docs.docker.com/engine/swarm/admin_guide/) to explain the (dis)advantages of setting a longer/shorter expiration time (I recall some people setting it to a really short time, e.g. 1 hour) |
I am feeling very uneasy: IMHO the automatic ticket regeneration seems not to work. Why doesn't docker regenerate the ticket by itself, when it is expired ? |
Docker automatically regenerates the certificate before it expires. Once it has expired, it should never regenerate the certificate. Regenerating a certificate after it has expired would defeat the security that a certificate provides. |
Are you absolutely shure that the regeneration algorithm is correct ? |
@oflebbe it should be, but possibly @tonistiigi @diogomonica are aware of reasons why it could fail (you mention your daemon has been running, and certificates didn't rotate?) |
I raised the priority; we should get a fix in 1.13 so that the daemon can start, allowing people to |
That would be a conveniant way to handle the situation. +1 |
👍 -- But it shouldn't completely fail to start the engine -- it SHOULD however disable swarm features..... |
There was an issue that may delay renewal in certain restart intervals. That was fixed in |
@aaronlehmann @diogomonica 3090be9 is the fix on top of #27967 . I think we should wait for #27967 first because otherwise one of them would need a bad rebase. |
@tonistiigi looks good. |
Swarm certificates automatically renew and have 90 day expiry period by default. Still, if you don't start the daemon during that time the certificates will expire and starting daemon will fail with
time="2016-06-29T17:18:06.165656736Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid"
I think refusing to start and not ignoring this error is correct. We could provide
--reset-swarm
option to leave swarm so the user doesn't need to remove the state dir manually. Problem is that user must remember to remove this option as otherwise, it would clear the state on every next restart as well.Maybe a good enough solution would be to add instructions for removing the state directory in the error message.
@nathanleclaire
The text was updated successfully, but these errors were encountered: