Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cert: bad certs that prevent node startup should log much louder #24621

Closed
nvanbenschoten opened this issue Apr 10, 2018 · 9 comments
Closed

cert: bad certs that prevent node startup should log much louder #24621

nvanbenschoten opened this issue Apr 10, 2018 · 9 comments

Comments

@nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Apr 10, 2018

Node certificates that do not fit an expected format can prevent a cluster from successfully forming. For instance, a user on gitter recently had a difficult time starting a cluster while using certs generated by https://github.com/cloudflare/cfssl because we failed to document that the subject line of the node certificates had to be O=Cockroach, CN=node. This was confusing because very little was logged while the nodes were struggling to connect. We can improve documentation to avoid specific issues like this, but in general, we should also log these startup-preventing errors much louder.

@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Apr 10, 2018
@nvanbenschoten nvanbenschoten self-assigned this Apr 10, 2018
@mberhault
Copy link
Contributor

@mberhault mberhault commented Apr 10, 2018

On the docs side, it's deep within the create certificates, openssl tab which has all the details needed to generate the same certs created by cockroach cert. Any tool other than cockroach cert should adapt the openssl docs.

As for the error, there's unfortunately not a whole lot get can get out of bad certificates, this is deep within the tls code. Just about the only way to get certificate details before handshake is to set InsecureSkipVerify and do all the cert checking ourselves (which is just asking for trouble).

@zllovesuki
Copy link

@zllovesuki zllovesuki commented Apr 10, 2018

@mberhault I think the subsection should have a red background instead of blue background to be honest

@bdarnell
Copy link
Member

@bdarnell bdarnell commented Apr 10, 2018

Some errors are out of our hands (including CA or SAN failure), but the issue here was the common name field, and isn't that part under our control? (Even the failures that are coming from within the TLS library should return errors that we ought to be able to log, although they may not be plumbed all the way through cmux/grpc).

@bdarnell
Copy link
Member

@bdarnell bdarnell commented Apr 10, 2018

Also, we should be able to validate the certs for this particular issue without even going over the network. The server should check at startup that $(certs-dir)/node.crt has the correct common name, etc, and crash if not.

@mberhault
Copy link
Contributor

@mberhault mberhault commented Apr 10, 2018

Good point. Filed #24632 to perform additional validation. It wouldn't crash, just exit with error since we load certificates before starting anything (eg: try running cockroach start --certs-dir=not-a-real-dir).

Also filed #24630 for additional (online and offline) debugging tools.

@mberhault
Copy link
Contributor

@mberhault mberhault commented Apr 10, 2018

I've also filed cockroachdb/docs#2899 to look into and maybe document cert generation using cfssl.

@knz knz added the C-enhancement label Apr 27, 2018
@wrouesnel
Copy link

@wrouesnel wrouesnel commented May 9, 2018

Can I +1 this loudly? I just lost about 16 hours of messing around with certificates which looked and validated correctly, but which would not allow a node to join at all.

The logs for this need to scream that this is the problem and why a cluster node is not joining. This is not at all obvious nor was it mentioned in any of the getting started/securing a cluster documentation.

@tbg tbg added this to On the horizon in KV Backlog May 14, 2018
@tbg tbg added the A-kv-security label May 15, 2018
@tbg tbg moved this from On the horizon to Milestone 2 in KV Backlog May 23, 2018
@tbg
Copy link
Member

@tbg tbg commented May 24, 2018

@mberhault could you summarize the state of the art here?

because we failed to document that the subject line of the node certificates had to be O=Cockroach, CN=node.

Also, we should be able to validate the certs for this particular issue without even going over the network. The server should check at startup that $(certs-dir)/node.crt has the correct common name, etc, and crash if not.

These were addressed in #24632, right?

What's left here? Can we close this for #24630?

@mberhault
Copy link
Contributor

@mberhault mberhault commented May 24, 2018

Sure. #24630 is a more focused issue and describes the work left to be done better. The common issues detectable without connection have been addressed.

@mberhault mberhault closed this May 24, 2018
KV Backlog automation moved this from Milestone 2 to Finished (milestone 1, ends 5/28) May 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.