New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cert: bad certs that prevent node startup should log much louder #24621

Closed
nvanbenschoten opened this Issue Apr 10, 2018 · 9 comments

Comments

Projects
None yet
7 participants
@nvanbenschoten
Member

nvanbenschoten commented Apr 10, 2018

Node certificates that do not fit an expected format can prevent a cluster from successfully forming. For instance, a user on gitter recently had a difficult time starting a cluster while using certs generated by https://github.com/cloudflare/cfssl because we failed to document that the subject line of the node certificates had to be O=Cockroach, CN=node. This was confusing because very little was logged while the nodes were struggling to connect. We can improve documentation to avoid specific issues like this, but in general, we should also log these startup-preventing errors much louder.

@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Apr 10, 2018

@nvanbenschoten nvanbenschoten self-assigned this Apr 10, 2018

@mberhault

This comment has been minimized.

Contributor

mberhault commented Apr 10, 2018

On the docs side, it's deep within the create certificates, openssl tab which has all the details needed to generate the same certs created by cockroach cert. Any tool other than cockroach cert should adapt the openssl docs.

As for the error, there's unfortunately not a whole lot get can get out of bad certificates, this is deep within the tls code. Just about the only way to get certificate details before handshake is to set InsecureSkipVerify and do all the cert checking ourselves (which is just asking for trouble).

@zllovesuki

This comment has been minimized.

zllovesuki commented Apr 10, 2018

@mberhault I think the subsection should have a red background instead of blue background to be honest

@bdarnell

This comment has been minimized.

Member

bdarnell commented Apr 10, 2018

Some errors are out of our hands (including CA or SAN failure), but the issue here was the common name field, and isn't that part under our control? (Even the failures that are coming from within the TLS library should return errors that we ought to be able to log, although they may not be plumbed all the way through cmux/grpc).

@bdarnell

This comment has been minimized.

Member

bdarnell commented Apr 10, 2018

Also, we should be able to validate the certs for this particular issue without even going over the network. The server should check at startup that $(certs-dir)/node.crt has the correct common name, etc, and crash if not.

@mberhault

This comment has been minimized.

Contributor

mberhault commented Apr 10, 2018

Good point. Filed #24632 to perform additional validation. It wouldn't crash, just exit with error since we load certificates before starting anything (eg: try running cockroach start --certs-dir=not-a-real-dir).

Also filed #24630 for additional (online and offline) debugging tools.

@mberhault

This comment has been minimized.

Contributor

mberhault commented Apr 10, 2018

I've also filed cockroachdb/docs#2899 to look into and maybe document cert generation using cfssl.

@knz knz added the C-enhancement label Apr 27, 2018

@wrouesnel

This comment has been minimized.

wrouesnel commented May 9, 2018

Can I +1 this loudly? I just lost about 16 hours of messing around with certificates which looked and validated correctly, but which would not allow a node to join at all.

The logs for this need to scream that this is the problem and why a cluster node is not joining. This is not at all obvious nor was it mentioned in any of the getting started/securing a cluster documentation.

@tbg tbg added this to On the horizon in Core Features & Stability May 14, 2018

@tbg tbg added the A-core-security label May 15, 2018

@tbg tbg moved this from On the horizon to Milestone 2 in Core Features & Stability May 23, 2018

@tbg

This comment has been minimized.

Member

tbg commented May 24, 2018

@mberhault could you summarize the state of the art here?

because we failed to document that the subject line of the node certificates had to be O=Cockroach, CN=node.

Also, we should be able to validate the certs for this particular issue without even going over the network. The server should check at startup that $(certs-dir)/node.crt has the correct common name, etc, and crash if not.

These were addressed in #24632, right?

What's left here? Can we close this for #24630?

@mberhault

This comment has been minimized.

Contributor

mberhault commented May 24, 2018

Sure. #24630 is a more focused issue and describes the work left to be done better. The common issues detectable without connection have been addressed.

@mberhault mberhault closed this May 24, 2018

Core Features & Stability automation moved this from Milestone 2 to Finished (milestone 1, ends 5/28) May 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment