-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CFP-22215 mTLS Authentication document with images #5
Conversation
|
||
#### Other options for SPIFFE identity | ||
|
||
##### Cert-manager CSI driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### Cert-manager CSI driver | |
##### cert-manager CSI driver |
cert-manager is always lower case even when breaking backwards compatibility with the English language :D
|
||
### mTLS, SPIFFE, and SPIRE | ||
|
||
When building an mTLS solution, SPIFFE, the API for having workloads request an attested identity with a clear chain of custody, and SPIRE, the implementation of that API, are the state of the art. The SPIFFE project has spent a lot of time thinking deeply about how you can avoid common identity-sharing pitfalls like spoofing identities, getting identity details (commonly but not always X.509 TLS keypairs, but not always) to the workload security, and being able to easily rotate and revoke these identities when required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revokation is still in development:
spiffe/spire#1934 with work being put into the local authority api mostly: spiffe/spire-api-sdk#37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rotation and revocation already exist on SPIRE (it is done automatically depending on authorities expirations).
That proposal is allow forcing that to happens on-demand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to "automatically" do it? What is the responsibility of the SDK vs. Cilium agent code?
The datapath needs to react to the same revocation event somehow, so either the SDK exposes some event notification mechanism through the control plane to inform it, or we pre-build something like timers in and then each component can independently write their own logic to ensure everything is in sync. (Latter doesn't make sense to me for revocation cases, we might get away with it for rotation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SPIRE server will, by default, re-key SVIDs a reasonable period of time before their expiry (I think the default is half their lifetime). The existing Cilium implementation will receive an "update" event including the new key when the SPIRE server is done (via the local SPIRE agent), and @meyskens is already working on handling that event in cilium/cilium#24300, and kicking off a re-auth and associated update of the auth table after that.
The linked proposal for SPIRE is to have a way to force revocation at any time, which would then generate a rekey event, which would end up in the Cilium agent via the SPIRE agent, and be processed in the same way.
Without a forced revocation event in SPIRE, it's really important to ensure that the certificate lifetime is short enough to limit the exposure window in the event of a keypair compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, makes sense. The main reason I raise this is that again, if we assume that non-SPIRE implementations will eventually implement this CFP, it will be useful to know that the design assumes that the auth mechanism provides such "update events", so that is a key integration point that must be implemented.
1. All cilium-agents running a workload mentioned in an auth-enabled policy connect to their local spire-agent, and watch a set of SPIFFE labels that includes the labels for the relevant SVID. The cilium-agent identity is allowed to do Delegated Identity requests to the Workload API (WL API), so this is allowed. | ||
1. The spire-agent watches SVIDs on the SPIRE server, sees any updates to all relevant SVIDs, and passes them back to the cilium-agent. | ||
1. The cilium-agent there figures out where the request is coming from, and connects to the cilium-agent on the source node to perform the mTLS handshake. Because this is a mutual TLS, if it succeeds, then the workloads are authenticated. | ||
1. The mTLS handlers in each cilium-agent then pass the auth success to their local dataplane, by telling it that identity A is authenticated to identity B for some period of time (the lifetime of the certificates in the mTLS handshake). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SPIFFE the expire lifetime is short (1h default), do we want to artificially cap some time so we don't end up with a 10 year valid datapath entry in case somebody configured that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should leave this up to the user, and strongly recommend the use of short-lifetime certificates. As long as we have a reasonably short default, and you need to take active action to end up with a 10-year rule, then that's something we'll need to let users do (sadly).
|
||
#### SPIFFE installation steps and flow | ||
|
||
1. Cluster gets created, without CNI, per usual Cilium install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this "install Cilium" or the CLI cilium install
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just means "the usual Cilium install process". I've updated, thanks.
1. mTLS enabled install also includes per-node SPIRE agent. The SPIRE agent talks to the SPIRE server over the network, but all other communication with the Cilium Agent is via domain sockets shared on the host's filesystem. | ||
1. Cilium starts up as normal, acts as CNI. | ||
1. Cilium also contacts the local SPIRE agent at startup (via a domain socket shared on the host's filesystem) to watch the Delegated Identity API and gets its own SPIFFE identity via the SPIRE workload API. | ||
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium (could be agent or operator) also records the SPIFFE identity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be agent or operator
Maybe we should note the multiple spire server setup that needs a mechanism for Cilium to only generate IDs on the node where the spire server is present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is listed below I see now, maybe we need to refer to that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the SPIFFE ID is deterministic, based only on the Identity number, everyone should be able to generate it, and it shouldn't matter who wins that race. In terms of creating the SPIRE server entry, yeah, that's only the operator. But figuring out the SPIFFE ID string is simpler.
I've had a crack at clarifying.
|
||
This design has a few wins here: | ||
- The Authentication only requires that the TLS Handshake succeed, it does not require that we keep the connection open, so we will be making short lived connections for each authentication check, and this will most likely need to be to a new, dedicated port. | ||
- Because we will be controlling both ends of the TLS handshake, we can use any field we like in the certificates for authentication - SPIFFE certificates issued by SPIRE have the full SPIFFE ID URI as a SAN field, so we can use that for choosing both client and server certificates - the TLS request from the client will be requesting the full SPIFFE ID as its URI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://datatracker.ietf.org/doc/html/rfc6066#section-3 the SNI can only be a DNS name and not a full URL. However we can send the Cilium ID as it's hostname and control it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, right. Maybe we may need to settle on hostname standard we can use for these requests, something like <sourceidnumber><destinationidnumber>.<trustdomain>
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated, thanks!
* disabled / spiffe / other | ||
* mesh-auth-spire-server: `____` | ||
* Configure the spire server | ||
* mesh-auth-spire-agent: `unix://var/run/cilium/spiffe/admin/admin.sock` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a late nit: in the PR i named this the mesh-auth-spire-admin-socket
as the agent itself for workload attestation does not use the admin.sock so while it is the agent i did rename it to be not confused with the usual agent socket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, okay, that's fair, I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit i found a few hours ago but overal an LGTM for me
Okay, I think I've resolved all the comments now. @mhofstetter, if you could rereview, that would be great. |
* mesh-auth-spire-server: `____` | ||
* Configure the spire server | ||
* mesh-auth-spire-admin-socket: `unix://var/run/cilium/spiffe/admin/admin.sock` | ||
* Configure the SPIRE agent socket | ||
* mesh-auth-spire-server: `unix://var/run/cilium/spiffe/server/server.sock` | ||
* Configure the SPIRE Server socket for the operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we would benefit from a mesh-auth-opt
+ spire-config
setting similar to the kvstore-opt
/ etcd-config
settings, where all of the values inside are passed as an object to the pluggable auth mechanism. Then each mechanism can have its own parameters/servers/sockets/etc and we don't have to add new flags to the agent for each specific pluggable implementation.
* mesh-auth-spire-server: `____` | |
* Configure the spire server | |
* mesh-auth-spire-admin-socket: `unix://var/run/cilium/spiffe/admin/admin.sock` | |
* Configure the SPIRE agent socket | |
* mesh-auth-spire-server: `unix://var/run/cilium/spiffe/server/server.sock` | |
* Configure the SPIRE Server socket for the operator | |
* `mesh-auth-opt: '{"spiffe.config": "/var/lib/cilium/spiffe.config"}'` | |
* Configure the authentication options | |
* ``` | |
spire-config: |- | |
spire-server: unix://var/run/cilium/spiffe/server/server.sock | |
spire-admin-socket: unix://var/run/cilium/spiffe/admin/admin.sock | |
``` | |
* Configure the SPIRE agent and server sockets |
Couple of further notes on the above,
- I'd imagine that the
spire-config
, if necessary, is primarily a Helm structure which maybe just writes out the/var/lib/cilium/spiffe.config
config file that could then be read by cilium-agent. This way we don't need to encode the actual auth settings directly into the agent CLI, the agent would just read the generic auth configuration on startup and pass that to the auth module to decide how to instantiate the auth plugin depending on the configuration. - If these settings are static and users don't need to configure them, then maybe they shouldn't be exposed at all.
- I switched some wording here SPIRE -> SPIFFE. Not sure the degree to which we're relying on SPIRE implementation specifically here vs. SPIFFE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know we did the inline-json thing for structs, that probably does make more sense, although inlining json in a string feels gross. 😄
We expect that we'll need to allow folks to bring-their-own SPIRE server in the future, so it makes sense to make this modular and configurable to some extent now.
That said, I anticipate this section will get another update as we work on #23806 - I want this document to be an as-built design once we ship this, not to be a representation of how we thought we might build it before we started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meyskens, ping that we'll probably end up updating what you have in the in-progress code to something more like this. Nothing to do for now, but we'll need to update this with whatever we end up doing.
|
||
In order to mitigate this issue, the Cilium agent will periodically sync the state of authenticated peers to a longer-lived store, that is, the auth map, and upon startup, reconstitute this state in the agent. At this time if any peers had previously been authenticated but timeouts had occurred, re-authentication of the sessions will be initiated. | ||
|
||
Extended agent downtime could impact the ability to properly terminate session authentication for peers, depending on the datapath implementation. N-tuple match on connection endpoints in new auth map discusses how timestamps could be integrated into the datapath to ensure connection termination when mutual authentication reauthentication periods elapse. If these timestamps are not integrated into the datapath, then we would have to accept that authenticated sessions may remain authenticated indefinitely during agent downtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit premature to discuss here, but this sort of notice should go into the upgrade guide to help users reason about their upgrade SLAs in order to minimize disruption when they've got auth enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll make a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting more and more into the weeds with these comments, mostly pretty trivial discussions so that's probably a good sign that we're moving away from key things that the CFP doesn't express and more just into little nits about how to best communicate the design to readers. I may take another look after this, time-boxing the review for now.
1. User creates a cluster, without CNI, per usual Cilium install process. | ||
1. User installs Cilium, with mTLS enabled, SPIFFE auth will not work until the following conditions are met: | ||
* SPIRE Server must be installed somewhere and configured correctly. (Bring-your-own-SPIRE-Server is an anticipated future need, but the initial implementation will use an in-cluster SPIRE server.) | ||
* Per-node SPIRE Agent must be installed in the cluster and configured correctly. The SPIRE agent talks to the SPIRE server over the network, but all other communication with the Cilium Agent is via domain sockets shared on the host's filesystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all accurate to my understanding of our previous discussions, though I'll note that I'm assuming that:
- SPIRE Server would be an independent helm deployment
- SPIRE Agent would be integrated into the Cilium helm charts as an additional sidecar/container.
(Just being explicit here since the text wasn't entirely clear on this, though again the text is fine, it's prescribing what needs to be in place, then it's up to the implementation to provide these in some way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though all that said, I guess it's also a reasonable option to independently deploy the SPIRE agents on the nodes, that part doesn't necessarily need to be embedded in the Cilium daemonset. I guess we should be having this discussion on the linked issue rather than in the CFP ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah, I'm hoping to dig into the installation more this week.
* (This document will be updated after [#23806](https://github.com/cilium/cilium/issues/23806) is done.) | ||
1. Cilium agent starts up as normal, acts as CNI. | ||
1. Cilium agent also contacts the local SPIRE agent at startup (via a domain socket shared on the host's filesystem) to watch the Delegated Identity API and gets its own SPIFFE identity via the SPIRE workload API. | ||
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double cilium is redundant in the suggested form of the SPIFFE identity. The security identities here are typically referred to in the Cilium community either buy the full name, "security identity", or by the shorthand "identity". Occasionally in the datapath it may be "sec-id". The danger here is that "id" can end up colliding between multiple concepts, for instance Cilium Endpoints also have a node-local identifier known as the "endpoint id".
As such I'd suggest perhaps:
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`). | |
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/identity/1337`). |
or:
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`). | |
1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/security-identity/1337`). |
(or a plural form of one of the above if we think that seems more approriate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the bare "identity" personally, I'll update to that.
* Trigger on CiliumEndpoint deletion even for CiliumEndpoints on remote nodes, delete authentication state corresponding to that peer. Including userspace authentication table entries, datapath CT/map entries. | ||
|
||
##### Timer-based | ||
* Do we need periodic garbage collection, e.g. of authenticated sessions? Because the auth table has an expiry time, we have built-in garbage collection - the userspace will be responsible for pruning expired auth table entries if the identity has been deleted and the certificate validity period has passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The question is answered by saying that garbage collection is "built-in", but I don't quite understand where that assumption comes from. The current section is "Local Agent Events -> "Timer-based". When does this "built-in" garbage collection run exactly?
(I assumed there would be a list sorted by expiry time, then a GC thread that just looks at the list, sleeps until the next expiry, wakes up, performs GC, then sleeps again until the next expiry time, maybe with a minimum sleep period or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this expiry is "built-in" by some sort of event, such as waiting on an external system to trigger an exiry timer, then maybe we should rework this "Timer-based" section to say "Upon auth expiry event". The important part there is that we are then relying on the authentication module (SPIFFE libs?) to perform that callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated, but the idea here is that because the auth table entry includes an expiry time (after which the data path will drop packets that match it), garbage collection is not as pressing a concern. Currently, we're planning on having userspace remove entries once the identity goes away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense yep, I'm just asking "How exactly will userspace remove entries?" - what event causes userspace to evaluate which entries to remove and what do we need to do to track those. I'm sure we'll figure this out over time.
I think this is pretty close if not done, although I should note that I'm planning on coming back and updating as we get further through the design. I'd kinda prefer to merge this as-is, and then iterate to fix any further things (for example, some notes about how big the auth table can get once we've sure we have the mechanics locked down). |
I think that this one is basically done, with further changes coming as we work on it. Any chance of that tick please @joestringer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's details yet to be figured out as there always will be, but the core seems well enough down to accept in the current form.
|
||
### mTLS, SPIFFE, and SPIRE | ||
|
||
When building an mTLS solution, SPIFFE, the API for having workloads request an attested identity with a clear chain of custody, and SPIRE, the implementation of that API, are the state of the art. The SPIFFE project has spent a lot of time thinking deeply about how you can avoid common identity-sharing pitfalls like spoofing identities, getting identity details (commonly but not always X.509 TLS keypairs, but not always) to the workload security, and being able to easily rotate and revoke these identities when required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to "automatically" do it? What is the responsibility of the SDK vs. Cilium agent code?
The datapath needs to react to the same revocation event somehow, so either the SDK exposes some event notification mechanism through the control plane to inform it, or we pre-build something like timers in and then each component can independently write their own logic to ensure everything is in sync. (Latter doesn't make sense to me for revocation cases, we might get away with it for rotation)
Cons: | ||
* We pick up a dependency on SPIRE. We’ll need to be running a SPIRE server in the cluster somewhere, and a SPIRE agent on each node (the agent is required for the attestation process to work properly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some community members are pushing on image sizes, so worth keeping an eye out for how this impacts that aspect. I'm sure we can figure out a solution either way for this, but it'd be nice to know rather than trampling on each others' toes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I think it's likely that we will run all of SPIRE as a separate set of pods, probably a StatefulSet for the SPIRE server, and another Daemonset for the SPIRE agent, with a host mount of the socket directory shared with the Cilium Agent. That decouples the lifecycle. I'm planning on updating these details once I've had a look at it more under cilium/cilium#23806.
Signed-off-by: Nick Young <nick@isovalent.com>
Signed-off-by: Nick Young <nick@isovalent.com>
|
||
#### Connections | ||
|
||
##### Between Cilium Operator instances and the SPIFFE server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youngnick not sure what your plans are regarding this CFP, but I noted that the diagrams in the CFP don't include the operator in the architecture. If you plan to update the CFP with the final design, it would be helpful for readers to see this interaction in one of the diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been hoping to come back and do a final update pass, yes. Thanks for the tip.
This converts the mTLS Authentication CFP from Google Doc to Markdown for storage here, updating the design along the way.
The document has had a lot of edits to reflect a couple of key changes: