endpoint: don't hold the endpoint lock while generating policy #26242

squeed · 2023-06-14T20:47:14Z

This PR has, basically, two parts. For more details, see the individual commit messages.

Make some endpoint accessor methods lock-free. Some are replaced with sync.Atomic, others are actually immutable and don't need locking.
Calculate an endpoint's policy without holding the endpoint lock.

As preparation for further refactors of the policy engine, it would be desirable to stop holding the whole endpoint lock while we compute policy. To do this, we need to read some values, unlock, compute policy, then lock and apply. More subtly, we need to make sure that regeneration succeeds before unlocking again.

Why is this safe?

Good question.

Endpoint regeneration has a complicated locking story. There are two interesting locks here, both of which combine to protect certain fields that are modified both internally and externally.

Regeneration is serialized by the ep.buildMutex, which ensures we never regenerate and compute policy in parallel. That means we are free to read and write the internal policy fields freely, since we never have to protect against write collisions.

Other fields accessed externally may be protected behind ep.mutex. However, this is too heavy-weight. Why? Because during policy computation, the policy engine needs to read back in to the endpoint. Additionally, the ipcache needs to provide policy deltas to the existing policy without being blocked.

The solution is to determine, exactly, which fields need to be blocked by which mutex. The critical observation is that some fields need both the mutex and the buildMutex to be written to. This ensures consistency in the face of regeneration and parallel operations.

Additionally, the critical observation is that during the policy calculation process, the L4Policies themselves register for incremental policy updates. That means that we have no chance of missing an incremental update during policy calculation, and it is safe to let the ipcache make progress. Specifically, the order is:

Create CachedSelectors. These subscribe for incremental updates
Call UpdatePolicy() -> resolvePolicylocked(), which creates the L4Filters and allows them to start reading incremental updates
Call Consume() -> DistillPolicy(), which evaluates the cached selectors for a list of applicable identities.
Lock the endpoint, which blocks ipcache progress
Apply the policy, then call ApplyPolicyMapChanges, which consumes any incremental updates that accumulated between steps 3 and 4.

Since incremental updates are "idempotent", it is safe to see that we won't miss any incremental updates, even if the ipcache makes progress during the policy computation / endpoint regeneration process.

Relevant fields and their mutexes:

ep.desiredPolicy -- READ by the ipcache, WRITTEN by regeneration. Protected by ep.mutex
ep.policyRevision -- READ by everyone, WRITTEN by regeneration AND the policy engine. Protected by ep.mutex and ep.buildMutex
ep.selectorPolicy -- READ by regeneration, WRITTEN by regeneration and the identity resolution loop. Protected by ep.mutex and ep.identityRevision
ep.realizedRedirects -- READ external policy calculation, WRITTEN by regeneration. Protected by ep.mutex and ep.buildMutex

Processes running in parallel

Endpoint regeneration. Takes ep.buildMutex, does policy calculations, then takes ep.mutex. Commits and drops all locks at the end.
ipcache. Needs to call ep.ApplyPolicyMapChanges, which requires ep.mutex
identity resolution. Clears ep.selectorPolicy, holds ep.mutex
policy ingestion. When a policy is a no-op for an endpoint, calls ep.SetPolicyRevision() which requires ep.mutex and ep.buildMutex.

pkg/endpoint/endpoint.go

squeed · 2023-06-15T20:56:03Z

/test

christarazi

Nice work! A couple of comments.

pkg/endpoint/bpf.go

pkg/endpoint/policy.go

christarazi · 2023-06-15T21:10:20Z

pkg/endpoint/policy.go

-	// TODO: GH-7515: This should be triggered closer to policy change
-	// handlers, but for now let's just update it here.
-	if err := repo.GetPolicyCache().UpdatePolicy(e.SecurityIdentity); err != nil {


This todo is not actually resolved. The todo is basically saying that policy computation can be done outside of the endpoint completely. While this overall PR gets us closer in that direction, we are technically still in the context of the endpoint.

Going forward, we can resolve #7515 when we can calculate policy for endpoints (outside of the context of a single endpoint) when

policies change

identities change

endpoints change

and "change" here means added or removed.

Concretely what this could look like is some func CalculatePolicy(identity *identity.Identity) which we call on each of the above listed occurrences. Then the only logic necessary in the endpoint should just be a lookup like

selectorPolicy := repo.PolicyCache().Lookup(e.SecurityIdentity) ... if forcePolicyCompute { repo.PolicyCache().UpdatePolicy(e.SecurityIdentity) }

I toyed with this in squeed/cilium@policy-allocate-when-used-v2...christarazi:pr/christarazi/separate-policy-calc-ep-regen, but obviously needs more work. Just wanted to share more context.

aha, I saw that the issue was closed and didn't really go much further than that. If we wanna do it, we shouldn't'a closed the issue :-).

Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.

Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.

Agreed, and I think we need to extract / decouple policy calculation from endpoint regeneration in order to be able to push policy to BPF & Envoy outside of a regeneration, in order to simplify the model. Today, the logic is a bit interleaved and harder to reason about.

A regeneration is considered a full policy calculation, while ApplyPolicyMapChanges is incremental. If we separate policy calculation completely from endpoint, then that should in theory allow us to always perform incremental policy updates, which should improve performance, overhead, and reduce potential interleaving of critical sections. It will also put the code into a structure where it's easier to introduce pkg/stream infrastructure.

The way I see it is we can have a model where CalculatePolicy(identity) returns a policy revision and the desired computed policy, and hands that to each endpoint that has identity. Each endpoint with that identity will then take their realized, current policy and compute the diff between the desired policy from CalculatePolicy, and implements the diff. The endpoints would then be considered implementing the returned policy revision.

In contrast to today, each endpoint attempts to compute the policy diff. N endpoints -> compute policies N times (there's some caching of course).

We sort of "invert" this relationship with CalculatePolicy by doing: Policy computed once -> N endpoints updated.

This makes the model much simpler to think about and should enable us to always have incremental policy updates.

All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.

Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.

This reminds me of an aspect of named ports being resolved per-endpoint after a shared selectorPolicy has been computed: It is possible for a named port resolve to the same port that is used in the policy by number, and in the extreme case it would be possible for both of them specify L7 policies for different proxy parsers. Which one gets plumbed to the datapath/Envoy is a matter of map iteration order. Otherwise I think we handle this gracefully, e.g.:

the policy with a deny rule gets to the datapath, if both are denies that is not a problem

if one is redirect while other is not, the one with redirect takes precedence

All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.

Agreed, good point. It would be really nice to make that two-phase policy computation (identity-based & named port-based) very clear in the code. I think it's very doable.

One way to start would be to identify which named port-specific code exists in the policy computation and see how we can extract it out.

I'm also thinking of another possibility where we could potentially even compute the named port policy within the identity-based computation. Consider that identities can map to multiple endpoints, one of those endpoints might have a named port policy. So you could imagine that the identity-based policy can detect this case, generate the named port policy as well, and then provide it to the endpoint that needs it. When endpoints apply the policy that was generated for them, they can perform some lookup to detect whether they need named ports policy, and then apply it as well if so.

Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.

Yep, we can definitely take steps to improve that. 👍

pkg/policy/distillery.go

squeed · 2023-06-16T10:40:14Z

The tests found a bug in this code! It was possible for the ep.policyRevision to go backwards if the repo's policy revision was bumped while regeneration was in progress. Updated to require the buildMutex to bump policy revision. This preserves existing behavior.

squeed · 2023-06-16T13:29:40Z

I've written a lot of commentary around locks, fields, and the requirements around them. As part of doing so, I found a bug in the code, so fixed that too.

The only serious change since the last review pass was discovering a case where policyRevision could go backwards; I've fixed that.

christarazi · 2023-06-17T21:23:22Z

I've written a lot of commentary around locks, fields, and the requirements around them. As part of doing so, I found a bug in the code, so fixed that too.

The only serious change since the last review pass was discovering a case where policyRevision could go backwards; I've fixed that.

Is the bug fix is the same as the policyRevision going backwards fix? I'm thinking it would be good to separate bug fixes into separate commits so that we fix them and backport them without tying them with the overall change of removing the lock dependency while generating policy. However, from your msg it's not clear to me if there are separate bug fixes or if they all comes from the overall change.

squeed · 2023-06-19T08:52:14Z

Is the bug fix is the same as the policyRevision going backwards fix?

I should have been more precise; there was a bug in a previous version of this PR, which tests caught. No bugs in the existing codebase.

squeed · 2023-06-19T14:04:22Z

/test

jrajahalme · 2023-06-19T15:31:09Z

Maybe this PR should be labeled as blocked due to feature freeze? IMO we should not rush fundamental changes around locking into a release at the last moment (1.14 in this case)?

squeed · 2023-06-19T15:59:04Z

Maybe this PR should be labeled as blocked due to feature freeze? IMO we should not rush fundamental changes around locking into a release at the last moment (1.14 in this case)?

That was my idea at first, but now it seems this fixes a critical bug in v1.13, so........

squeed requested review from a team as code owners June 14, 2023 20:47

squeed requested a review from jrajahalme June 14, 2023 20:47

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 14, 2023

squeed requested a review from christarazi June 14, 2023 20:47

squeed mentioned this pull request Jun 14, 2023

policy: allocate identity for CIDR selectors on first use (v2) #25979

Closed

3 tasks

tommyp1ckles reviewed Jun 15, 2023

View reviewed changes

pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved

tommyp1ckles reviewed Jun 15, 2023

View reviewed changes

pkg/endpoint/endpoint.go Outdated Show resolved Hide resolved

squeed force-pushed the endpoint-calculate-policy-unlocked branch from 58db360 to 2d746f4 Compare June 15, 2023 13:11

joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Jun 15, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 15, 2023

squeed force-pushed the endpoint-calculate-policy-unlocked branch 2 times, most recently from 3433367 to 4127f7e Compare June 15, 2023 20:18

christarazi reviewed Jun 15, 2023

View reviewed changes

christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Jun 15, 2023

bimmlerd mentioned this pull request Jun 16, 2023

Cilium agent API rate limiting requests due to deadlock #26295

Closed

2 tasks

squeed force-pushed the endpoint-calculate-policy-unlocked branch from 4127f7e to d3e1c66 Compare June 16, 2023 10:38

squeed force-pushed the endpoint-calculate-policy-unlocked branch 2 times, most recently from e0c91af to cc53321 Compare June 16, 2023 13:16

squeed deleted the endpoint-calculate-policy-unlocked branch June 27, 2023 08:50

squeed added backport/author The backport will be carried out by the author of the PR. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jun 28, 2023

maintainer-s-little-helper bot added this to Needs backport from main in 1.13.5 Jun 28, 2023

joestringer mentioned this pull request Jun 28, 2023

Prepare for release v1.14.0-rc.0 #26544

Merged

squeed mentioned this pull request Jul 10, 2023

[v1.13] endpoint: don't hold the endpoint lock while generating policy #26735

Merged

1 task

squeed added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 10, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.5 Jul 10, 2023

gandro mentioned this pull request Jul 18, 2023

Deadlock in Cilium Agent #23707

Closed

2 tasks

aanm added needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jul 26, 2023

maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Needs backport from main in 1.13.5 Jul 26, 2023

gentoo-root added this to Needs backport from main in 1.13.6 Jul 26, 2023

gentoo-root removed this from Needs backport from main in 1.13.5 Jul 26, 2023

aanm added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 26, 2023

gandro mentioned this pull request Aug 3, 2023

cleanup: Don't run sync-to-k8s-ciliumendpoint for endpoints which don't need it #27128

Closed

nebril added this to Backport pending to v1.13 in 1.13.7 Aug 10, 2023

nebril removed this from Needs backport from main in 1.13.6 Aug 10, 2023

nebril mentioned this pull request Aug 10, 2023

Prepare for release v1.13.6 #27424

Merged

joestringer added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Aug 14, 2023

joestringer moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.7 Aug 25, 2023

christarazi mentioned this pull request Sep 27, 2023

The agent stops accessing the kube-apiserver until agent restarts #28089

Closed

2 tasks

bimmlerd mentioned this pull request Nov 27, 2023

[v1.12] Backport of #26242 (don't hold EP lock while generating policy) #29408

Merged

1 task

bimmlerd added the backport-pending/1.12 label Nov 27, 2023

github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 29, 2023

nebril mentioned this pull request Dec 11, 2023

Prepare for release v1.12.17 #29791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endpoint: don't hold the endpoint lock while generating policy #26242

endpoint: don't hold the endpoint lock while generating policy #26242

squeed commented Jun 14, 2023 •

edited

squeed commented Jun 15, 2023

christarazi left a comment

christarazi Jun 15, 2023

squeed Jun 16, 2023 •

edited

christarazi Jun 17, 2023

squeed Jun 19, 2023

jrajahalme Jun 19, 2023

christarazi Jun 19, 2023 •

edited

squeed commented Jun 16, 2023

squeed commented Jun 16, 2023

christarazi commented Jun 17, 2023

squeed commented Jun 19, 2023

squeed commented Jun 19, 2023

jrajahalme commented Jun 19, 2023

squeed commented Jun 19, 2023

endpoint: don't hold the endpoint lock while generating policy #26242

endpoint: don't hold the endpoint lock while generating policy #26242

Conversation

squeed commented Jun 14, 2023 • edited

Why is this safe?

Relevant fields and their mutexes:

Processes running in parallel

squeed commented Jun 15, 2023

christarazi left a comment

Choose a reason for hiding this comment

christarazi Jun 15, 2023

Choose a reason for hiding this comment

squeed Jun 16, 2023 • edited

Choose a reason for hiding this comment

christarazi Jun 17, 2023

Choose a reason for hiding this comment

squeed Jun 19, 2023

Choose a reason for hiding this comment

jrajahalme Jun 19, 2023

Choose a reason for hiding this comment

christarazi Jun 19, 2023 • edited

Choose a reason for hiding this comment

squeed commented Jun 16, 2023

squeed commented Jun 16, 2023

christarazi commented Jun 17, 2023

squeed commented Jun 19, 2023

squeed commented Jun 19, 2023

jrajahalme commented Jun 19, 2023

squeed commented Jun 19, 2023

squeed commented Jun 14, 2023 •

edited

squeed Jun 16, 2023 •

edited

christarazi Jun 19, 2023 •

edited