Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: adding cert manager to allow cert rotation on SIGHUP #63616

Merged
merged 1 commit into from
Apr 15, 2021

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Apr 14, 2021

Previously the proxy didn't process SIGHUP and wasn't reloading the
certificates when SIGHUP is sent to the process.
This is a problem as rotating certificates requires restart and
this desirable.
This PR adds a certification manager that monitors for SIGHUP and
reloads the managed certificates automatically. A similar code exists
for the cockroach executable but the code there is very specific for the
set of certificates that cockroach db uses. The one provided in this PR
is generic and can be used as a building block in any application. The
code that utilized the cert manager is not part of this PR and will be
committed with the changes in the main proxy code. This is in separate PR
to make the review easier.

We currently have a cert manager that cockroachdb used in pkg/security.
This PR can be used as a base of the existing code. The first step of
refactoring the existing code is pulling aside a generic version of a
cert manager that can do part of what pkg/security needs and can also
do what proxy needs. So this PR does this. The second step however is
to change the existing code to create a instantiate the generic cert
manager, put in it the set of certs applicable to cockroach db and
remove the current signal handler. This can happen in another PR.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a certmanager package, with a certmanager.New method? I could imagine the existing security.NewCertificateManager becoming something like crdbcerts.New or security.NewCrdbCertManager, etc. Someone from the Server team should weigh in on where this code might go longer term.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @knz)


pkg/ccl/sqlproxyccl/security/cert.go, line 17 at r1 (raw file):

// Cert is a generic presentation on managed certificate that can be reloaded.
type Cert interface {
	Reload(ctx context.Context)

Need comments for these methods. Our pattern is to put the detailed comments here on the interface, and then use comments like, // Reload implements the Cert interface. on the implementation methods.

Also, add more detail for the interface comment, along lines of what you put in the commit message.


pkg/ccl/sqlproxyccl/security/cert.go, line 18 at r1 (raw file):

type Cert interface {
	Reload(ctx context.Context)
	Err() error

What's the purpose of having the Err and ClearErr methods? Why not just a single Reload(ctx context.Context) error method?


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 69 at r1 (raw file):

// Registers a signal handler that triggers on SIGHUP and reloads the
// certificates. The handler will shutdown when the context is done.
func (cm *CertManager) startMonitorUnlocked() {

I think this and the below method should be called startMonitorLocked rather than Unlocked, because you expect the lock to have already been acquired by the caller.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fairly easy to to rename the package and the constructor or move the code somewhere else. I'll leave it as is for now because I only need this for the proxy but if people feel that it should go somewhere else - I'll do it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @knz)


pkg/ccl/sqlproxyccl/security/cert.go, line 17 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Need comments for these methods. Our pattern is to put the detailed comments here on the interface, and then use comments like, // Reload implements the Cert interface. on the implementation methods.

Also, add more detail for the interface comment, along lines of what you put in the commit message.

Added more detailed comments on the interface.


pkg/ccl/sqlproxyccl/security/cert.go, line 18 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

What's the purpose of having the Err and ClearErr methods? Why not just a single Reload(ctx context.Context) error method?

Also added comments about this. But the main reason is that Reload is called by the cert manager's go routine and an error during the reload will not be passed back to the client code. So instead - the cert manager will store the error and let the user act on it (by fetching a new file for example). Once the user eliminates the reason for the error and clears the error, the manager will continue with the reloads.


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 69 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I think this and the below method should be called startMonitorLocked rather than Unlocked, because you expect the lock to have already been acquired by the caller.

Duh. Fixed.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Darin for the PR.
I agree we'll eventually want to move this out of the ccl tree. It's generic code that's not tied to a licensed feature.

More generally, I'm interested to know why you created a new cert manager from scratch, instead of using the cert manager that's already implemented in pkg/security? What shortcomings did the existing code have that prevented you from using it?

Also I'd like to invite @itsbilal to have a look as well since Bilal was active in this area recently.
Also @aaron-crl since Aaron is currently investigating how to improve our TLS cert management elsewhere in CockroachDB.

Reviewed 11 of 13 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @darinpp, and @knz)


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 79 at r2 (raw file):

			log.Ops.Infof(ctx, "received signal %q, triggering certificate reload", sig)
			cm.Reload()
		case <-ctx.Done():

This will need to be connected to a stop.Stopper and wait on stopper.ShouldQuiesce().

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested to know why you created a new cert manager from scratch, instead of using the cert manager that's already implemented in pkg/security

As I wrote in the comments, the one that is in pkg/security is hard coded to serve the use case that cockroach db has. None of this is applicable to the proxy for example:

	// Set of certs. These are swapped in during Load(), and never mutated afterwards.
	caCert         *CertInfo // default CA certificate
	clientCACert   *CertInfo // optional: certificate to verify client certificates
	uiCACert       *CertInfo // optional: certificate to verify UI certficates
	nodeCert       *CertInfo // certificate for nodes (always server cert, sometimes client cert)
	nodeClientCert *CertInfo // optional: client certificate for 'node' user. Also included in 'clientCerts'
	uiCert         *CertInfo // optional: server certificate for the admin UI.
	clientCerts    map[SQLUsername]*CertInfo

	// Certs only used with multi-tenancy.
	tenantClientCACert, tenantClientCert *CertInfo

This is not a general purpose cert manager and so it doesn't work for me. The code in this PR is general purpose, reusable and could be used as a base of pkg/security code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @darinpp, and @knz)

@knz
Copy link
Contributor

knz commented Apr 14, 2021

the one that is in pkg/security is hard coded to serve the use case that cockroach db has. None of this is applicable to the proxy

The immediate question this comment raises is -- why not modify and update the existing code to do what you need, instead of producing entirely new code? Was there an obstacle here? Is there something we can learn from?

This is not a general purpose cert manager and so it doesn't work for me. The code in this PR is general purpose, reusable and could be used as a base of pkg/security code.

Well then at the very least

  1. the motivation for this choice should be explained in the commit message.
  2. if you are clear that "reusable and could be used as a base of pkg/security code" then it would also be useful to advertise this fact in the commit message, and possibly file an issue with some suggestions as to how to achieve that.

@knz knz added this to In progress in DB Server & Security via automation Apr 14, 2021
Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about this PR as a start of modifying the existing code. The first step of refactoring the existing code is pulling aside a generic version of a cert manager that can do part of what pkg\security needs and can also do what proxy needs. So this PR does this. The second step however is to change the existing code to create a new generic cert manager, put in it the set of certs applicable to cockroach db and remove the current signal handler. I'm not sure that I have time to do this. Maybe after the proxy is migrated to the DB repo and we are unblocked on the k8s operator side.

I'll add this as a comment in the PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @knz)


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 79 at r2 (raw file):

Previously, knz (kena) wrote…

This will need to be connected to a stop.Stopper and wait on stopper.ShouldQuiesce().

The code doesn't use stop.Stopper at all. There is little benefit here to just add stopper as an extra argument so we can cancel also on stopper.ShouldQuiesce(). A client that uses the cert manager and uses the stopper can simply pass the stopper.CancelOnQuiesce() cancelable context and achieve the same thing without having an extra argument nor the dependency on stop.Stopper. I'll add comment.

Copy link

@aaron-crl aaron-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one that is in pkg/security is hard coded to serve the use case that cockroach db has. None of this is applicable to the proxy

I very much wanted to refactor the cert manager code in pkg/security during the last release process but didn't have time. It is high on my list of priorities this cycle due to the relation with certificate rotation and management. It's implementation isn't terribly suitable for that either.

The second step however is to change the existing code to create a new generic cert manager, put in it the set of certs applicable to cockroach db and remove the current signal handler. I'm not sure that I have time to do this. Maybe after the proxy is migrated to the DB repo and we are unblocked on the k8s operator side.

I think this PR makes sense. As someone who recently tangled with the old cert manager refactoring it will likely be a largish lift and shouldn't block this work. I've had a quick read through of this PR and think this creates a reasonable base for that refactor without adding much tech debt.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @knz)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure of the code is OK.
I love the tests, they are very good.

Just a few nits here and there and this is good to go.

Reviewed 5 of 5 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @darinpp, and @knz)


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 97 at r3 (raw file):

	errCount := 0
	for _, cert := range cm.certs {

How many certs do you envision a manager to handle? If more than a dozen, it may be worth checking for cancellation in-between iterations.


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 100 at r3 (raw file):

		cert.Reload(cm.ctx)
		if cert.Err() != nil {
			errCount++

May be worth logging the failure here.


pkg/ccl/sqlproxyccl/security/file_cert.go, line 53 at r3 (raw file):

	certBytes, err := ioutil.ReadFile(fc.certFile)
	if err != nil {
		log.Ops.Warningf(ctx, "could not reload cert file %s: %v", fc.certFile, err)

May be worth saying fc.err = errors.Wrapf(err, "could not reload cert file %s", fc.certFile)

then have the logging occur in the caller of Reload()

This ensures that the responsibility for logging is just in one place.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @knz)


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 97 at r3 (raw file):

Previously, knz (kena) wrote…

How many certs do you envision a manager to handle? If more than a dozen, it may be worth checking for cancellation in-between iterations.

Good suggestion. I added a check after each cert Reload to terminate the loop if the context gets canceled while reloading. Added also a test that verifies that it works.


pkg/ccl/sqlproxyccl/security/cert_manager.go, line 100 at r3 (raw file):

Previously, knz (kena) wrote…

May be worth logging the failure here.

Added.


pkg/ccl/sqlproxyccl/security/file_cert.go, line 53 at r3 (raw file):

Previously, knz (kena) wrote…

May be worth saying fc.err = errors.Wrapf(err, "could not reload cert file %s", fc.certFile)

then have the logging occur in the caller of Reload()

This ensures that the responsibility for logging is just in one place.

Good suggestion. I removed the logging from here and moved into the manager.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @knz)


pkg/ccl/sqlproxyccl/security/cert.go, line 18 at r1 (raw file):

Previously, darinpp wrote…

Also added comments about this. But the main reason is that Reload is called by the cert manager's go routine and an error during the reload will not be passed back to the client code. So instead - the cert manager will store the error and let the user act on it (by fetching a new file for example). Once the user eliminates the reason for the error and clears the error, the manager will continue with the reloads.

It might be a cleaner design to make Reload return an error and then have the cert manager store that error rather than the Cert implementation. You'd change the map to be map[string]cachedCert, where cachedCert would be a struct with Cert and Err fields. Doing it this way would make implementing the Cert interface simpler and more intuitive.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @knz)


pkg/ccl/sqlproxyccl/security/cert.go, line 18 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

It might be a cleaner design to make Reload return an error and then have the cert manager store that error rather than the Cert implementation. You'd change the map to be map[string]cachedCert, where cachedCert would be a struct with Cert and Err fields. Doing it this way would make implementing the Cert interface simpler and more intuitive.

Unless you feel very strongly about that - I prefer to keep it as is now. I agree that the interface is simpler but I like storing the reload state (the err) in the Cert itself vs the manager. Right now, I can manually reload a cert and if there is an error (imagine timeout waiting to get the cert from a network or something) the error will be visible to the manager so the manager won't attempt to reload on SIGHUP (and incur another timeout for example). Also I don't need to keep reference to the manager to be able to check the last err generated when that manager tried reloading. I just need the cert. I can also have multiple managers managing distinct sets of certs (some overlapping) that also share the err info. A manager that used inotify for example comes to mind. So keeping the err with the cert seems to me the more flexible setup even though it isn't as clean.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message: you wrote pkg\security two times instead of pkg/security.

Also now that the code is all right, it may be time to move the code to pkg/security (alongside the existing code) and update the license headers.

Reviewed 4 of 4 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @darinpp)


pkg/security/certmgr/cert.go, line 11 at r5 (raw file):

// licenses/APL.txt.

//go:generate mockgen -package=certmgr -destination=mocks_generated.go -source=cert.go . Cert

nit: we generally place the go:generate lines somewhere under the package directive, as the spot above it is reserved for package-level documentation.

Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has been moved into certmgr directory under pkg/security, the license headers are updated and the commit message is fixed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @darinpp, and @knz)

Previously the proxy didn't process SIGHUP and wasn't reloading the
certificates when SIGHUP is sent to the process.
This is a problem as rotating certificates requires restart and
this desirable.
This PR adds a certification manager that monitors for SIGHUP and
reloads the managed certificates automatically. A similar code exists
for the cockroach executable but the code there is very specific for the
set of certificates that cockroach db uses. The one provided in this PR
is generic and can be used as a building block in any application. The
code that utilized the cert manager is not part of this PR and will be
committed with the changes in the main proxy code. This is in separate PR
to make the review easier.

We currently have a cert manager that cockroachdb used in `pkg/security`.
This PR can be used as a base of the existing code. The first step of
refactoring the existing code is pulling aside a generic version of a
cert manager that can do part of what `pkg/security` needs and can also
do what proxy needs. So this PR does this. The second step however is
to change the existing code to create a instantiate the generic cert
manager, put in it the set of certs applicable to cockroach db and
remove the current signal handler. This can happen in another PR.

Release note: None
@darinpp
Copy link
Contributor Author

darinpp commented Apr 15, 2021

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 15, 2021

Build succeeded:

@craig craig bot merged commit b225345 into cockroachdb:master Apr 15, 2021
DB Server & Security automation moved this from In progress to Done 21.2 Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
DB Server & Security
  
Done 21.2
Development

Successfully merging this pull request may close these issues.

None yet

5 participants