Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid certificate update results in bad_certificate requests #8

Closed
jcvrabo opened this issue Feb 7, 2023 · 0 comments · Fixed by #12
Closed

Invalid certificate update results in bad_certificate requests #8

jcvrabo opened this issue Feb 7, 2023 · 0 comments · Fixed by #12

Comments

@jcvrabo
Copy link

jcvrabo commented Feb 7, 2023

We have noticed a situation where occasionally, a single instance of an app, will start raising bad_certificate exceptions when calling other apps in cloud foundry.

This error is due to a client certificate refusal by the go-router and only for apps that do not explicitly use mtls, meaning, the ones doing mtls with the go-router using the instance certificate.

We have thoroughly investigated this incident. Looking inside the app instance container shows a proper key/certificate pair, with proper validity. Restarting the app instance will eliminate the issue (at least for the following 24 hours), the errors don't start getting raised after the certificate expires, but after the certificate is renewed. Touching the certificate and/or key file in the mounted volume (through the diego-cell as root), will cause the filewatcher to be triggered and will also resolve the issue.

Our conclusion is that this is a racing-condition. The certificate file is always updated first and the key file follows. Most of the times, this update is almost instantaneous, but there are times where you can see a millisecond or two of difference. Since the watcher threads do not update the keystore in a synchronised block, we think this occurs because occasionally the first file watcher thread that is triggered, reads a mismatching private key file, and it finishes after the second file watching thread (which reads and updates the keystore with a matching key/certificate pair), updating the keystore with a mismatched key/certificate pair.

Such a situation will also match the observed behaviour (raised bad_certificate exceptions).

We have built our own custom security provider to try to log this behaviour (and other potential root causes), but (un)fortunately, all apps using our custom security provider always work. We do notice an occasional thread reading a mismatched key/certificate pair, but it's always the proper order, meaning, the thread reading the mismatched key/certificate pair always finishes first and the keystore is then properly updated with the second thread which will read a properly matched key/certificate pair.

This started occurring since 6 months ago, more or less. We run two foundations with 1600+ apps each and a total amount of instances above 3000 on each foundation. This can potentially happen each day to less than 1% of the instances. The main difference starting 6 months ago was using bionic stemcells and starting to use larger and newer vms for the diego-cells in azure (we also have aws foundations but they are not affected).

Since this began to become a real nuisance for us, we couldn't wait for our logging to show any hard proof of the racing condition, as it seems to be a typical heisenbug. The extra log entries and check operations might add just enough delay for the thread to always read a properly matched key/certificate pair.

I will add a pull request with a solution for this situation. While a simple synchronized block would already fix the issue, by guaranteeing that the second thread always finishes last, I prefer to instead update the keystore only if the private key matches the certificate, so that the theoretical case of a request being done in the microseconds that a key store with a mismatched key/certificate pair is updated doesn't occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant