Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface a metric to enable users to alert on when a client's cert age is close to expiring #100699

Closed
kevin-v-ngo opened this issue Apr 5, 2023 · 10 comments · Fixed by #103592
Closed
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@kevin-v-ngo
Copy link

kevin-v-ngo commented Apr 5, 2023

This issue tracks adding a metric that allows users to alert when client's cert age is close to expiring.

Per Sean-

Hypothetically, a server-side VerifyConnection function such as https://pkg.go.dev/crypto/tls#example-Config-VerifyConnection will allow us to poke at the peer's ConnectionState, which has the NotAfter attribute, and we can use time.Now().Sub(crt.NotAfter) to emit an aggregate time.Duration that can be polled via a prom endpoint that customers can alert on now that this information has been exposed._

Related issue: #99422

Jira issue: CRDB-26585

@kevin-v-ngo kevin-v-ngo added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 5, 2023
@kevin-v-ngo kevin-v-ngo added this to Triage in Cluster Observability via automation Apr 5, 2023
@kevin-v-ngo kevin-v-ngo added this to Triage in SQL Sessions - Deprecated via automation Apr 5, 2023
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Apr 5, 2023
@kevin-v-ngo kevin-v-ngo removed this from Triage in SQL Sessions - Deprecated Apr 5, 2023
@kevin-v-ngo
Copy link
Author

@abhinavg6, does identity have a project board on github?

@abhinavg6
Copy link
Contributor

I don't think so Kevin.

@zaneteh-crl - This is a CRDB specific certs related issue which is being escalated by a key SH customer. And based on internal discussions between @duoclikebook and others, Identity team maybe on the hook for such things. So Kevin is asking if he could assign it to a Identity team board on Github.

@zaneteh-crl
Copy link
Contributor

I'm not aware of a project board on github for Identity. We can track things through the JIRA ticket.

@maryliag
Copy link
Contributor

maryliag commented Apr 6, 2023

Some pointers for the Identity team:

There are already some metrics created for some types of certificates here:
https://github.com/xinhaoz/cockroach/blob/77d9d01aff6e42e58c9ea9b47e4be81a66a852a7/pkg/security/certificate_manager.go#L29

but there are no metrics for this one, which I assume if the ask for this issue:

clientCerts map[username.SQLUsername]*CertInfo

You who is working on this might start on that.

@rafiss rafiss added this to Triage in SQL Sessions - Deprecated via automation Apr 7, 2023
@cameronnunez
Copy link
Contributor

cameronnunez commented Apr 14, 2023

The pointers look right to me, I imagine it should look something like:

metaUserClientExpiration = metric.Metadata{
    Name:        "security.certificate.expiration.user-client",
    Measurement: "Certificate Expiration",
    ...
}

Looks like the mapping clientCerts has the necessary info. We'll need to ensure this metric gets exposed (maybe this is done through metric.PrometheusExporter?) and is on the list of exported metrics for self-hosted. Observability Infra might be able to provide more insight on this part @abarganier

@cameronnunez
Copy link
Contributor

The tricky part is we'll need to partition the metric by SQL username it seems.

The amount of dev work seems light, so I'd estimate it is ~0.5–1 week

@knz
Copy link
Contributor

knz commented Apr 19, 2023

Folk you are forgetting something. The client cert file(s) are stored on the client, not the server. In most installs, the server simply won't see them at all.

So we can't get a reliable metric server-side.

The best we can do is each time a client connects and presents a client cert look at the expiration and "do something" with it.

Then we have another bit of complexity. For any given username, different clients can use different certs. What if different certs have different expiry? What would we want to report in that case?

I'd be wary to consider an implementation here until we properly identify what problem needs to be solved and what would be the UX of a solution.

@knz
Copy link
Contributor

knz commented Apr 19, 2023

In any case that sounds like at least 1-2 weeks of analysis and discussion with PM and the customer, plus some extra time for implementation, the complexity of which will depend on the desired semantics.

@bdarnell
Copy link
Contributor

Any monitoring we do on the server side will be flawed, but it still seems worth making an effort here if we can get something "good enough" that is substantially easier than setting up monitoring on the client side (where I'm not aware of common tools or practices that we can easily advocate)

Here are a few straw proposals to consider:

  1. The server keeps metadata about all client certificates used in the last 24h and exports a metric for the minimum time-to-expiration in that set. This is a conservative approach to find certificates that are near expiration as reliably as possible, but it may have false positives since we will remember the old certificate until the 24h window expires (because even when we see a newer certificate for the same username, we cannot be certain that there are not other client processes that are still holding on to the old one)
  2. The server keeps a map from username to maximum expiration time of any certificate used by that user. This strategy avoids false positives by assuming that the application's deployment discipline is reliable enough to replace all certificates for a user at once, but if a stray client process were missed it would not be detected.
  3. The server keeps a map from username to the expiration of the most recent certificate used for that user. This is an ugly method that produces a metric that jumps around a lot, but can probabilistically detect one client out of the pool that has an out-of-date cert.
  4. Use logging instead of metrics: when a client authenticates with a certificate that is close to expiration, log a message on a suitable channel. "Close to expiration" probably needs to be configurable given the wide range of practices seen in the wild, although maybe we could come up with some heuristics (I believe both the creation and expiration timestamp of a certificate are known so we could perhaps do something based on a fraction of the certs lifespan)

Another thing worth considering is to encourage some sort of "client instance ID" in a certificate so that we could tell reliably when one certificate completely replaces another instead of having to make guesses about whether or not other copies of the certificate may still be in use.

@knz
Copy link
Contributor

knz commented Apr 24, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

7 participants