Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle ssl certificate rotation scenario from client #2868

Open
4 of 7 tasks
adinigam opened this issue May 4, 2020 · 27 comments
Open
4 of 7 tasks

how to handle ssl certificate rotation scenario from client #2868

adinigam opened this issue May 4, 2020 · 27 comments

Comments

@adinigam
Copy link
Contributor

adinigam commented May 4, 2020

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

How is application supposed to use librdkafka to handle ssl certificate rotation scenario?

How to reproduce

Below is the flow of events

  1. client fetches a ssl certificate and gives it to librdkafka producer
  2. Sends few messages
  3. SSL certificate initially presented expires. At this time we would like to present to server/broker the rotated certificate. How do we do that today ?

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): 1.2

  • Apache Kafka version: 2.4

  • librdkafka client configuration: metadata.broker.list=broker1:9093, security.protocol=ssl ssl.certificate.pem=std::string, ssl.key.password=std::string, enable.ssl.certificate.verification=true, config->set_ssl_cert(RdKafka::CERT_PRIVATE_KEY, RdKafka::CERT_ENC_PKCS12, const void *buffer, size_t size, std::string &errstr), ssl_cert_verify_cb, SecurityVerifyCallback

    class SecurityVerifyCallback : public RdKafka::SslCertificateVerifyCb
    {
    public:
    bool ssl_cert_verify_cb(std::string const& brokerName, int32_t brokerId, int* error, int depth,
    const char* buffer, size_t size, std::string& errstr) override;
    }

  • Operating system: windows

  • Provide logs (with debug=.. as necessary) from librdkafka

  • Provide broker log excerpts

  • Critical issue

@edenhill
Copy link
Contributor

edenhill commented May 8, 2020

There is no client-side support for this, and I'm not sure whether the broker supports it either, but if the broker were to support TLS PAH we could enable it in the client as well with SSL_CTX_set_post_handshake_auth():

https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_post_handshake_auth.html

@adinigam
Copy link
Contributor Author

adinigam commented May 13, 2020

@edenhill Thanks for responding. Server has connections.max.reauth.ms. "The broker will disconnect any such connection that is not re-authenticated within the session lifetime and that is then subsequently used for any purpose other than re-authentication"
https://docs.confluent.io/current/installation/configuration/broker-configs.html#connections.max.reauth.ms

Ideally it would be good for a client application to be a good citizen and present rotated certificate to server before connection is forcibly disconnected.
Will it take librdkakfa support to call SSL_CTX_set_post_handshake_auth or application has a way to call it ?

@edenhill
Copy link
Contributor

From what I understand Kafka session reauth is for the SASL layer only, not SSL.

librdkafka currently does not call SSL_CTX_set_post_handshake_auth, it could but then it would need to provide a certificate callback to deliver the new certificate to the server, and this has implications for librdkafka daughter clients in other languages.
So until there is a definitive use-case for this feature I'd say we hold off.

@crazed
Copy link

crazed commented May 14, 2020

I think there is a pretty good argument that dynamic SSL rotation is a must have these days. Many systems out there support this and companies have introduced short lived certificates as a way to identify applications.

From a Kafka server standpoint, I have not tried this yet, but it does appear that they support dynamic TLS. With this certificate rotation it makes it possible to plug in existing PKI infrastructure as a source for ACLs.

@edenhill
Copy link
Contributor

We'll add support for this in the clients when the broker supports it.

@crazed
Copy link

crazed commented May 14, 2020

The brokers already do support this though:

Brokers may be configured with SSL key stores with short validity periods to reduce the risk of compromised certificates. Key stores may be updated dynamically without restarting the broker. The configuration name must be prefixed with the listener prefix listener.name.{listenerName}. so that only the key store configuration of a specific listener is updated. The following configurations may be updated in a single alter request at per-broker level:

ssl.keystore.type
ssl.keystore.location
ssl.keystore.password
ssl.key.password
If the listener is the inter-broker listener, then the update is allowed only if the new key store is trusted by the trust store configured for that listener. For other listeners, no trust validation is performed on the key store by the broker. Certificates must be signed by the same certificate authority that signed the old certificate to avoid any client authentication failures.

It was added as part of KIP-226

@edenhill
Copy link
Contributor

I believe that is not enforced for existing connections, just new ones:

Keystore will be updated by reconfiguring the channel builder to create a new SslFactory. Existing connections will not be affected, new connections will use the new keystore.

@gopalrajpurohit
Copy link

We are very much interested in this feature as our infrastructure does rotate out keys/certificates of all hosts every couple of hours. Hence there are few questions pertaining to the behavior of librdkafka consumer w.r.t. certificates/keys provided from file location. Does librdkafka store the certificates/key etc... read from file and re-use for other connections or it always read it afresh from file/disk when establishing new connections to kafka brokers?

More clarity on how the keys/certificates are used, when provided in-memory / from file, assumption regarding lifecycle of keys/certificates, errors received by clients when server fails to authenticate client etc... will be helpful in correct implementation of it's use in consumers.

Also what happens when once authenticated client connection is closed by broker to induce consumer to re-authenticate. Will consumer in this case reload keys/certificates from disk or will re-use them from the time previously successfully established connection.

@edenhill
Copy link
Contributor

librdkafka will load certificates, et.al., once per instance on rd_kafka_new() instantiation and they will be stored in memory on the SSL_CTX, the SSL_CTX is then reused for all broker connections the client instance is doing.

I think the SSL cert reauth can be done without reconnectiong using the PHA extension.

@gopalrajpurohit
Copy link

Does it need to be implemented in librdkafka or do we have to do it in client code?

Moreover, how do we handle this in client code where we do not get Error while reading message?

Also, does it handle both scenarios where key/certs are rotated not only for clients / consumers but also for the brokers.

@edenhill
Copy link
Contributor

We'll need to wait for the broker implementation before addressing this in librdkafka.

@KJTsanaktsidis
Copy link

Currently, we use TLS client auth and nothing else for our Kafka clients. If i've understood this correctly, the certificate validity is checked once by the broker when a connection is first made, and then never again; a connected client whose certificate expires can stay connected. This is what the broker supporting TLS post-handshake auth would fix, right?

At the moment, we use short-lived TLS client certificates anyway. They will eventually expire and stay connected, and I'm fine with that. But, if librdkafka decides to open a new broker connection after the certificate expires, that will of course fail. Then, we have to tear down the whole producer/consumer and restart it with up-to-date credentials.

Ideally, it'd be great if librdkafka supported passing in a callback to set the client cert/key instead, so that when asked, the application can always respond with an up-to-date credential. rd_kafka_conf_set_ssl_cert_verify_cb seems to provide this capability for CA certificates, so it'd be cool if there was a rd_kfak_conf_set_ssl_cert_get_cb or something equivalent as well.

@KJTsanaktsidis
Copy link

And since "it would be great if..." isn't that productive in open-source, I guess the question really should be "would you accept a patch that..." instead :)

@KJTsanaktsidis
Copy link

@edenhill @adinigam I had a look at what would be involved in asking librdkafka to re-read certificates off disk, and came up with this diff: master...zendesk:ssl_fork

Curious if you have any thoughts? Do you think it's worthwhile turning this approach into a PR? What other things would need to be supported?

@edenhill
Copy link
Contributor

Thanks for sharing your suggested solution.
In a previous comment you seem to favour a callback-based approach to retrieve certificates, while the code uses a more broad ssl_refresh() approach.
Which one would you prefer?

I think from a usability perspective it would be easier on the application to have a cert-retrieve-callback so it doesn't have to keep track of certificate lifetimes, and this retrieve would be called on each (re)connect attempt.
I'm guessing librdkafka could even track the lifetime of the cert to know when to refresh the certificate, or only trigger the callback if the SSL handshake fails, or a combination of the two?

@adinigam
Copy link
Contributor Author

Great work @KJTsanaktsidis .
I would be inclined towards librdkafka triggering the callback when ssl-hand-shake fails.
Even if librdkafka keeps track of cert lifetime that still is not enough because, certs are renewed in production environments much before they expire. In many cases even manually renewed for some reasons.

@edenhill
Copy link
Contributor

If we constructed the cert-refresh/retrieve callback (that will be called on each connect attempt) with an error parameter that if set indicates that the last connection attempt failed we would allow the application to distinguish between the case where refreshes are required or not, that is, if the callback is triggered without an error set the application could simply ignore it and have librdkafka use the current cert.

@KJTsanaktsidis
Copy link

I think a callback approach is probably more flexible, and avoids questions like "why can I get updating certificates with ssl.certificate.location and not ssl.certificate.pem". I'll be honest, the branch I posted here was pretty much just the minimum viable hacking I could do to see if I could make something work for our use case.

To put my Go glasses on, the callback interface is in fact what would be required to support having confluent-kafka-go accept a Golang stdlib tls.Config object (since callbacks-for-certificates is the model exposed there as well). The callback also lines up nicely with any hypothetical work around PAH, since the callback could be invoked again and a new certificate supplied there.

The only tricky thing about the callback is that it requires a little bit more surgery; at the moment, there's one SSL_CTX shared for the whole rdkafka object, but if we go with a callback-based approach, we'd have to construct a new SSL_CTX per connection. Requires re-jigging the lifecycle a bit, and I'm not sure if there are any other (resource-utilisation?) downsides to not re-using OpenSSL context objects.

I might play around with adding a callback & benchmarking it next time I get a few spare cycles.

@edenhill
Copy link
Contributor

edenhill commented Sep 3, 2020

IIRC it is problematic to call into Go (from C) from a thread that the Go runtime does not know about, in this case one of the librdkafka broker threads, that's also why the Go client uses the event queue interface which we could also use for this but that would block the broker thread for a pretty long time waiting for the Go event to be handled, and that'll lead to other problems.

@KJTsanaktsidis
Copy link

@edenhill @adinigam I did a bit of experimentation around what a callback interface for librdkafka fetching certificates might look like, and how that can be integrated with confluent-kafka-go.

It turns out OpenSSL can actually do most of the heavy lifting for us - there's SSL_CTX_set_cert_cb (https://www.openssl.org/docs/man1.1.1/man3/SSL_set_cert_cb.html) that works pretty much like the verify callback we're already using does.

Basically, we can implement the verify and fetch-client-cert callbacks in terms of a golang tls.Config, which means any certificate rotation can be handled by just returning a new certificate from GetClientCertificate.

Thoughts? An approach worth pursuing?

@KJTsanaktsidis
Copy link

So this took a little while, but I put together a PR for how this could work: #3180

@vctoriawu
Copy link

@KJTsanaktsidis @edenhill I'm looking to implement a similar feature to handle client SSL cert rotation, and am also wondering if using the OpenSSL SSL_CTX_set_cert_cb function is the best approach for this? Thanks!

@KJTsanaktsidis
Copy link

@vctoriawu well, we're planning to send my patch ☝️ it to production next week I think, so.... I'll let you know? 😂

@YuFengxin229
Copy link

Hi @KJTsanaktsidis , I'm looking forward to the solution of client SSL cert rotation, when could I get the solution in the master branch? Thanks.

@KJTsanaktsidis
Copy link

Merging is obviously something that’s up to the librdkafka maintainers. However, we deployed the change to production at Zendesk last week and it’s been working great.

One hitch, TLS added an extra ~40ms of latency to every publish acknowledgement, but that turned out to be related to nagle’s algorithm on the socket. Enabling the ‘socket.nagle.disable’ option to our producer configs seemed to fix this nicely though!

@StephanDollberg
Copy link

Any update on this?

One thing that I find a bit odd already in the current state is that seemingly errors from failed client auth are marked as non-fatal.

This means that if treating those as resolvable by the library you basically get stuck once client certs are expired and librdkafka tries to reconnect under the hood.

@Atheuz
Copy link

Atheuz commented Oct 24, 2022

Any update on this?

One thing that I find a bit odd already in the current state is that seemingly errors from failed client auth are marked as non-fatal.

This means that if treating those as resolvable by the library you basically get stuck once client certs are expired and librdkafka tries to reconnect under the hood.

We've run into this, where we had an Azure client secret that expired, and because authentication is non-fatal, the library just kept going. Then I reproduced it with a bad pair of certs for authentication, and again it just treats them as non-fatal and keeps retrying.

I wish we could mark authentication errors as fatal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants