Add CFP-22215 mTLS Authentication document with images #5

youngnick · 2023-03-03T03:09:07Z

This converts the mTLS Authentication CFP from Google Doc to Markdown for storage here, updating the design along the way.

The document has had a lot of edits to reflect a couple of key changes:

The mTLS solution will now use a BPF authentication map, not the connection table method it did originally
SPIFFE and SPIRE will now be the first authentication type supported (although we will build support for more types)
SPIRE server will by default be installed into the cluster for you when you enable the feature.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

meyskens · 2023-03-03T09:04:46Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+#### Other options for SPIFFE identity
+
+##### Cert-manager CSI driver


Suggested change

##### Cert-manager CSI driver

##### cert-manager CSI driver

cert-manager is always lower case even when breaking backwards compatibility with the English language :D

cilium/CFP-22215-mutual-auth-for-service-mesh.md

meyskens · 2023-03-03T09:08:18Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+### mTLS, SPIFFE, and SPIRE
+
+When building an mTLS solution, SPIFFE, the API for having workloads request an attested identity with a clear chain of custody, and SPIRE, the implementation of that API, are the state of the art. The SPIFFE project has spent a lot of time thinking deeply about how you can avoid common identity-sharing pitfalls like spoofing identities, getting identity details (commonly but not always X.509 TLS keypairs, but not always) to the workload security, and being able to easily rotate and revoke these identities when required.


revokation is still in development:
spiffe/spire#1934 with work being put into the local authority api mostly: spiffe/spire-api-sdk#37

rotation and revocation already exist on SPIRE (it is done automatically depending on authorities expirations).
That proposal is allow forcing that to happens on-demand

What does it mean to "automatically" do it? What is the responsibility of the SDK vs. Cilium agent code?

The datapath needs to react to the same revocation event somehow, so either the SDK exposes some event notification mechanism through the control plane to inform it, or we pre-build something like timers in and then each component can independently write their own logic to ensure everything is in sync. (Latter doesn't make sense to me for revocation cases, we might get away with it for rotation)

The SPIRE server will, by default, re-key SVIDs a reasonable period of time before their expiry (I think the default is half their lifetime). The existing Cilium implementation will receive an "update" event including the new key when the SPIRE server is done (via the local SPIRE agent), and @meyskens is already working on handling that event in cilium/cilium#24300, and kicking off a re-auth and associated update of the auth table after that.

The linked proposal for SPIRE is to have a way to force revocation at any time, which would then generate a rekey event, which would end up in the Cilium agent via the SPIRE agent, and be processed in the same way.

Without a forced revocation event in SPIRE, it's really important to ensure that the certificate lifetime is short enough to limit the exposure window in the event of a keypair compromise.

Thanks for the explanation, makes sense. The main reason I raise this is that again, if we assume that non-SPIRE implementations will eventually implement this CFP, it will be useful to know that the design assumes that the auth mechanism provides such "update events", so that is a key integration point that must be implemented.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

meyskens · 2023-03-03T09:21:36Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+1. All cilium-agents running a workload mentioned in an auth-enabled policy connect to their local spire-agent, and watch a set of SPIFFE labels that includes the labels for the relevant SVID. The cilium-agent identity is allowed to do Delegated Identity requests to the Workload API (WL API), so this is allowed.
+1. The spire-agent watches SVIDs on the SPIRE server, sees any updates to all relevant SVIDs, and passes them back to the cilium-agent.
+1. The cilium-agent there figures out where the request is coming from, and connects to the cilium-agent on the source node to perform the mTLS handshake. Because this is a mutual TLS, if it succeeds, then the workloads are authenticated.
+1. The mTLS handlers in each cilium-agent then pass the auth success to their local dataplane, by telling it that identity A is authenticated to identity B for some period of time (the lifetime of the certificates in the mTLS handshake).


For SPIFFE the expire lifetime is short (1h default), do we want to artificially cap some time so we don't end up with a 10 year valid datapath entry in case somebody configured that?

I actually think we should leave this up to the user, and strongly recommend the use of short-lifetime certificates. As long as we have a reasonably short default, and you need to take active action to end up with a 10-year rule, then that's something we'll need to let users do (sadly).

meyskens · 2023-03-03T09:30:49Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+#### SPIFFE installation steps and flow
+
+1. Cluster gets created, without CNI, per usual Cilium install


is this "install Cilium" or the CLI cilium install?

This just means "the usual Cilium install process". I've updated, thanks.

meyskens · 2023-03-03T09:35:15Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+1. mTLS enabled install also includes per-node SPIRE agent. The SPIRE agent talks to the SPIRE server over the network, but all other communication with the Cilium Agent is via domain sockets shared on the host's filesystem.
+1. Cilium starts up as normal, acts as CNI.
+1. Cilium also contacts the local SPIRE agent at startup (via a domain socket shared on the host's filesystem) to watch the Delegated Identity API and gets its own SPIFFE identity via the SPIRE workload API.
+1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium (could be agent or operator) also records the SPIFFE identity.


could be agent or operator

Maybe we should note the multiple spire server setup that needs a mechanism for Cilium to only generate IDs on the node where the spire server is present?

This is listed below I see now, maybe we need to refer to that here?

Because the SPIFFE ID is deterministic, based only on the Identity number, everyone should be able to generate it, and it shouldn't matter who wins that race. In terms of creating the SPIRE server entry, yeah, that's only the operator. But figuring out the SPIFFE ID string is simpler.

I've had a crack at clarifying.

meyskens · 2023-03-03T09:56:00Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+This design has a few wins here:
+- The Authentication only requires that the TLS Handshake succeed, it does not require that we keep the connection open, so we will be making short lived connections for each authentication check, and this will most likely need to be to a new, dedicated port.
+- Because we will be controlling both ends of the TLS handshake, we can use any field we like in the certificates for authentication - SPIFFE certificates issued by SPIRE have the full SPIFFE ID URI as a SAN field, so we can use that for choosing both client and server certificates - the TLS request from the client will be requesting the full SPIFFE ID as its URI.


https://datatracker.ietf.org/doc/html/rfc6066#section-3 the SNI can only be a DNS name and not a full URL. However we can send the Cilium ID as it's hostname and control it that way

hmm, right. Maybe we may need to settle on hostname standard we can use for these requests, something like <sourceidnumber><destinationidnumber>.<trustdomain> or something.

I've updated, thanks!

cilium/CFP-22215-mutual-auth-for-service-mesh.md

meyskens · 2023-03-07T11:35:22Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+  * disabled / spiffe / other
+* mesh-auth-spire-server: `____`
+  * Configure the spire server
+* mesh-auth-spire-agent: `unix://var/run/cilium/spiffe/admin/admin.sock`


Sorry for a late nit: in the PR i named this the mesh-auth-spire-admin-socketas the agent itself for workload attestation does not use the admin.sock so while it is the agent i did rename it to be not confused with the usual agent socket

Yeah, okay, that's fair, I'll update.

meyskens

One nit i found a few hours ago but overal an LGTM for me

cilium/CFP-22215-mutual-auth-for-service-mesh.md

youngnick · 2023-03-15T03:47:49Z

Okay, I think I've resolved all the comments now. @mhofstetter, if you could rereview, that would be great.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer · 2023-03-20T20:55:16Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+* mesh-auth-spire-server: `____`
+  * Configure the spire server
+* mesh-auth-spire-admin-socket: `unix://var/run/cilium/spiffe/admin/admin.sock`
+  * Configure the SPIRE agent socket
+* mesh-auth-spire-server: `unix://var/run/cilium/spiffe/server/server.sock`
+  * Configure the SPIRE Server socket for the operator


I think that we would benefit from a mesh-auth-opt + spire-config setting similar to the kvstore-opt / etcd-config settings, where all of the values inside are passed as an object to the pluggable auth mechanism. Then each mechanism can have its own parameters/servers/sockets/etc and we don't have to add new flags to the agent for each specific pluggable implementation.

Suggested change

* mesh-auth-spire-server: `____`

* Configure the spire server

* mesh-auth-spire-admin-socket: `unix://var/run/cilium/spiffe/admin/admin.sock`

* Configure the SPIRE agent socket

* mesh-auth-spire-server: `unix://var/run/cilium/spiffe/server/server.sock`

* Configure the SPIRE Server socket for the operator

* `mesh-auth-opt: '{"spiffe.config": "/var/lib/cilium/spiffe.config"}'`

* Configure the authentication options

* ```

spire-config: |-

spire-server: unix://var/run/cilium/spiffe/server/server.sock

spire-admin-socket: unix://var/run/cilium/spiffe/admin/admin.sock

```

* Configure the SPIRE agent and server sockets

Couple of further notes on the above,

I'd imagine that the spire-config, if necessary, is primarily a Helm structure which maybe just writes out the /var/lib/cilium/spiffe.config config file that could then be read by cilium-agent. This way we don't need to encode the actual auth settings directly into the agent CLI, the agent would just read the generic auth configuration on startup and pass that to the auth module to decide how to instantiate the auth plugin depending on the configuration.

If these settings are static and users don't need to configure them, then maybe they shouldn't be exposed at all.

I switched some wording here SPIRE -> SPIFFE. Not sure the degree to which we're relying on SPIRE implementation specifically here vs. SPIFFE.

I didn't know we did the inline-json thing for structs, that probably does make more sense, although inlining json in a string feels gross. 😄

We expect that we'll need to allow folks to bring-their-own SPIRE server in the future, so it makes sense to make this modular and configurable to some extent now.

That said, I anticipate this section will get another update as we work on #23806 - I want this document to be an as-built design once we ship this, not to be a representation of how we thought we might build it before we started.

@meyskens, ping that we'll probably end up updating what you have in the in-progress code to something more like this. Nothing to do for now, but we'll need to update this with whatever we end up doing.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer · 2023-03-20T22:32:13Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+In order to mitigate this issue, the Cilium agent will periodically sync the state of authenticated peers to a longer-lived store, that is, the auth map, and upon startup, reconstitute this state in the agent. At this time if any peers had previously been authenticated but timeouts had occurred, re-authentication of the sessions will be initiated.
+
+Extended agent downtime could impact the ability to properly terminate session authentication for peers, depending on the datapath implementation. N-tuple match on connection endpoints in new auth map discusses how timestamps could be integrated into the datapath to ensure connection termination when mutual authentication reauthentication periods elapse. If these timestamps are not integrated into the datapath, then we would have to accept that authenticated sessions may remain authenticated indefinitely during agent downtime.


It's a bit premature to discuss here, but this sort of notice should go into the upgrade guide to help users reason about their upgrade SLAs in order to minimize disruption when they've got auth enabled.

Agreed, I'll make a note.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer

I'm getting more and more into the weeds with these comments, mostly pretty trivial discussions so that's probably a good sign that we're moving away from key things that the CFP doesn't express and more just into little nits about how to best communicate the design to readers. I may take another look after this, time-boxing the review for now.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer · 2023-03-24T22:01:04Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+1. User creates a cluster, without CNI, per usual Cilium install process. 
+1. User installs Cilium, with mTLS enabled, SPIFFE auth will not work until the following conditions are met:
+    * SPIRE Server must be installed somewhere and configured correctly. (Bring-your-own-SPIRE-Server is an anticipated future need, but the initial implementation will use an in-cluster SPIRE server.)
+    * Per-node SPIRE Agent must be installed in the cluster and configured correctly. The SPIRE agent talks to the SPIRE server over the network, but all other communication with the Cilium Agent is via domain sockets shared on the host's filesystem.


This is all accurate to my understanding of our previous discussions, though I'll note that I'm assuming that:

SPIRE Server would be an independent helm deployment

SPIRE Agent would be integrated into the Cilium helm charts as an additional sidecar/container.

(Just being explicit here since the text wasn't entirely clear on this, though again the text is fine, it's prescribing what needs to be in place, then it's up to the implementation to provide these in some way)

Though all that said, I guess it's also a reasonable option to independently deploy the SPIRE agents on the nodes, that part doesn't necessarily need to be embedded in the Cilium daemonset. I guess we should be having this discussion on the linked issue rather than in the CFP ;)

Haha yeah, I'm hoping to dig into the installation more this week.

joestringer · 2023-03-24T22:06:20Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+    * (This document will be updated after [#23806](https://github.com/cilium/cilium/issues/23806) is done.)
+1. Cilium agent starts up as normal, acts as CNI.
+1. Cilium agent also contacts the local SPIRE agent at startup (via a domain socket shared on the host's filesystem) to watch the Delegated Identity API and gets its own SPIFFE identity via the SPIRE workload API.
+1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`).


nit: double cilium is redundant in the suggested form of the SPIFFE identity. The security identities here are typically referred to in the Cilium community either buy the full name, "security identity", or by the shorthand "identity". Occasionally in the datapath it may be "sec-id". The danger here is that "id" can end up colliding between multiple concepts, for instance Cilium Endpoints also have a node-local identifier known as the "endpoint id".

As such I'd suggest perhaps:

Suggested change

1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`).

1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/identity/1337`).

or:

Suggested change

1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`).

1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/security-identity/1337`).

(or a plural form of one of the above if we think that seems more approriate)

I like the bare "identity" personally, I'll update to that.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer · 2023-03-24T22:11:35Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+  * Trigger on CiliumEndpoint deletion even for CiliumEndpoints on remote nodes, delete authentication state corresponding to that peer. Including userspace authentication table entries, datapath CT/map entries.
+
+##### Timer-based
+* Do we need periodic garbage collection, e.g. of authenticated sessions? Because the auth table has an expiry time, we have built-in garbage collection - the userspace will be responsible for pruning expired auth table entries if the identity has been deleted and the certificate validity period has passed.


nit: The question is answered by saying that garbage collection is "built-in", but I don't quite understand where that assumption comes from. The current section is "Local Agent Events -> "Timer-based". When does this "built-in" garbage collection run exactly?

(I assumed there would be a list sorted by expiry time, then a GC thread that just looks at the list, sleeps until the next expiry, wakes up, performs GC, then sleeps again until the next expiry time, maybe with a minimum sleep period or something)

If this expiry is "built-in" by some sort of event, such as waiting on an external system to trigger an exiry timer, then maybe we should rework this "Timer-based" section to say "Upon auth expiry event". The important part there is that we are then relying on the authentication module (SPIFFE libs?) to perform that callback.

I've updated, but the idea here is that because the auth table entry includes an expiry time (after which the data path will drop packets that match it), garbage collection is not as pressing a concern. Currently, we're planning on having userspace remove entries once the identity goes away.

Makes sense yep, I'm just asking "How exactly will userspace remove entries?" - what event causes userspace to evaluate which entries to remove and what do we need to do to track those. I'm sure we'll figure this out over time.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

youngnick · 2023-03-27T06:39:14Z

I think this is pretty close if not done, although I should note that I'm planning on coming back and updating as we get further through the design. I'd kinda prefer to merge this as-is, and then iterate to fix any further things (for example, some notes about how big the auth table can get once we've sure we have the mechanics locked down).

youngnick · 2023-03-28T05:03:33Z

I think that this one is basically done, with further changes coming as we work on it. Any chance of that tick please @joestringer?

joestringer

I think there's details yet to be figured out as there always will be, but the core seems well enough down to accept in the current form.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

joestringer · 2023-03-28T19:00:53Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+### mTLS, SPIFFE, and SPIRE
+
+When building an mTLS solution, SPIFFE, the API for having workloads request an attested identity with a clear chain of custody, and SPIRE, the implementation of that API, are the state of the art. The SPIFFE project has spent a lot of time thinking deeply about how you can avoid common identity-sharing pitfalls like spoofing identities, getting identity details (commonly but not always X.509 TLS keypairs, but not always) to the workload security, and being able to easily rotate and revoke these identities when required.


What does it mean to "automatically" do it? What is the responsibility of the SDK vs. Cilium agent code?

The datapath needs to react to the same revocation event somehow, so either the SDK exposes some event notification mechanism through the control plane to inform it, or we pre-build something like timers in and then each component can independently write their own logic to ensure everything is in sync. (Latter doesn't make sense to me for revocation cases, we might get away with it for rotation)

joestringer · 2023-03-28T19:01:48Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+Cons:
+* We pick up a dependency on SPIRE. We’ll need to be running a SPIRE server in the cluster somewhere, and a SPIRE agent on each node (the agent is required for the attestation process to work properly).


Some community members are pushing on image sizes, so worth keeping an eye out for how this impacts that aspect. I'm sure we can figure out a solution either way for this, but it'd be nice to know rather than trampling on each others' toes.

cf. cilium/cilium#24371

tbh I think it's likely that we will run all of SPIRE as a separate set of pods, probably a StatefulSet for the SPIRE server, and another Daemonset for the SPIRE agent, with a host mount of the socket directory shared with the Cilium Agent. That decouples the lifecycle. I'm planning on updating these details once I've had a look at it more under cilium/cilium#23806.

cilium/CFP-22215-mutual-auth-for-service-mesh.md

Signed-off-by: Nick Young <nick@isovalent.com>

cilium/CFP-22215-mutual-auth-for-service-mesh.md

Signed-off-by: Nick Young <nick@isovalent.com>

joestringer · 2023-05-22T16:45:35Z

cilium/CFP-22215-mutual-auth-for-service-mesh.md

+
+#### Connections
+
+##### Between Cilium Operator instances and the SPIFFE server


@youngnick not sure what your plans are regarding this CFP, but I noted that the diagrams in the CFP don't include the operator in the architecture. If you plan to update the CFP with the final design, it would be helpful for readers to see this interaction in one of the diagrams.

I have been hoping to come back and do a final update pass, yes. Thanks for the tip.

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

mhofstetter reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Show resolved Hide resolved

meyskens reviewed Mar 3, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

youngnick force-pushed the mtls-cfp branch from c30bb3f to 10c26fc Compare March 6, 2023 09:54

meyskens reviewed Mar 6, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

meyskens reviewed Mar 7, 2023

View reviewed changes

meyskens approved these changes Mar 7, 2023

View reviewed changes

mhofstetter reviewed Mar 8, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

mhofstetter reviewed Mar 8, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

youngnick force-pushed the mtls-cfp branch from f04d0e6 to f0c0858 Compare March 15, 2023 03:47

mhofstetter approved these changes Mar 15, 2023

View reviewed changes

joestringer mentioned this pull request Mar 20, 2023

CFP: Mutual Authentication for Service Mesh cilium/cilium#22215

Closed

35 tasks

joestringer reviewed Mar 20, 2023

View reviewed changes

youngnick mentioned this pull request Mar 24, 2023

Ensure mutual auth and Policy interact in safe ways cilium/cilium#24552

Closed

joestringer reviewed Mar 24, 2023

View reviewed changes

joestringer approved these changes Mar 28, 2023

View reviewed changes

youngnick mentioned this pull request Mar 29, 2023

Check scaling sizes for auth map design cilium/cilium#24617

Closed

youngnick force-pushed the mtls-cfp branch from 08b1eca to 3e6128d Compare March 29, 2023 03:47

Add CFP-22215 mTLS Authentication with images

9e59dab

Signed-off-by: Nick Young <nick@isovalent.com>

youngnick force-pushed the mtls-cfp branch from 3e6128d to 9e59dab Compare March 29, 2023 03:49

meyskens reviewed Mar 29, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

howardjohn reviewed Mar 29, 2023

View reviewed changes

cilium/CFP-22215-mutual-auth-for-service-mesh.md Outdated Show resolved Hide resolved

Fix next round of PR comments

5f6bc58

Signed-off-by: Nick Young <nick@isovalent.com>

xmulligan added the ready-to-merge label Mar 30, 2023

xmulligan merged commit cad07c3 into cilium:main Mar 30, 2023

joestringer reviewed May 22, 2023

View reviewed changes


		#### Other options for SPIFFE identity

		##### Cert-manager CSI driver


		### mTLS, SPIFFE, and SPIRE

		When building an mTLS solution, SPIFFE, the API for having workloads request an attested identity with a clear chain of custody, and SPIRE, the implementation of that API, are the state of the art. The SPIFFE project has spent a lot of time thinking deeply about how you can avoid common identity-sharing pitfalls like spoofing identities, getting identity details (commonly but not always X.509 TLS keypairs, but not always) to the workload security, and being able to easily rotate and revoke these identities when required.


		#### SPIFFE installation steps and flow

		1. Cluster gets created, without CNI, per usual Cilium install


		In order to mitigate this issue, the Cilium agent will periodically sync the state of authenticated peers to a longer-lived store, that is, the auth map, and upon startup, reconstitute this state in the agent. At this time if any peers had previously been authenticated but timeouts had occurred, re-authentication of the sessions will be initiated.

		Extended agent downtime could impact the ability to properly terminate session authentication for peers, depending on the datapath implementation. N-tuple match on connection endpoints in new auth map discusses how timestamps could be integrated into the datapath to ensure connection termination when mutual authentication reauthentication periods elapse. If these timestamps are not integrated into the datapath, then we would have to accept that authenticated sessions may remain authenticated indefinitely during agent downtime.

	1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/cilium-id/1337`).
	1. When generating Cilium Security Identities for identities with mTLS auth enabled, Cilium Operator in SPIFFE mode also records the SPIFFE identity (that is, the string `spiffe://spiffe.cilium.io/identity/1337`).

		Cons:
		* We pick up a dependency on SPIRE. We’ll need to be running a SPIRE server in the cluster somewhere, and a SPIRE agent on each node (the agent is required for the attestation process to work properly).


		#### Connections

		##### Between Cilium Operator instances and the SPIFFE server

Add CFP-22215 mTLS Authentication document with images #5

Add CFP-22215 mTLS Authentication document with images #5

Conversation

youngnick commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meyskens left a comment

Choose a reason for hiding this comment

youngnick commented Mar 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

youngnick commented Mar 27, 2023

youngnick commented Mar 28, 2023

joestringer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngnick Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Mar 24, 2023 •

edited

Loading

joestringer Mar 27, 2023 •

edited

Loading

youngnick Mar 29, 2023 •

edited

Loading