Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoint: don't hold the endpoint lock while generating policy #26242

Merged
merged 6 commits into from Jun 26, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Jun 14, 2023

This PR has, basically, two parts. For more details, see the individual commit messages.

  1. Make some endpoint accessor methods lock-free. Some are replaced with sync.Atomic, others are actually immutable and don't need locking.
  2. Calculate an endpoint's policy without holding the endpoint lock.

As preparation for further refactors of the policy engine, it would be desirable to stop holding the whole endpoint lock while we compute policy. To do this, we need to read some values, unlock, compute policy, then lock and apply. More subtly, we need to make sure that regeneration succeeds before unlocking again.

Why is this safe?

Good question.

Endpoint regeneration has a complicated locking story. There are two interesting locks here, both of which combine to protect certain fields that are modified both internally and externally.

Regeneration is serialized by the ep.buildMutex, which ensures we never regenerate and compute policy in parallel. That means we are free to read and write the internal policy fields freely, since we never have to protect against write collisions.

Other fields accessed externally may be protected behind ep.mutex. However, this is too heavy-weight. Why? Because during policy computation, the policy engine needs to read back in to the endpoint. Additionally, the ipcache needs to provide policy deltas to the existing policy without being blocked.

The solution is to determine, exactly, which fields need to be blocked by which mutex. The critical observation is that some fields need both the mutex and the buildMutex to be written to. This ensures consistency in the face of regeneration and parallel operations.

Additionally, the critical observation is that during the policy calculation process, the L4Policies themselves register for incremental policy updates. That means that we have no chance of missing an incremental update during policy calculation, and it is safe to let the ipcache make progress. Specifically, the order is:

  1. Create CachedSelectors. These subscribe for incremental updates
  2. Call UpdatePolicy() -> resolvePolicylocked(), which creates the L4Filters and allows them to start reading incremental updates
  3. Call Consume() -> DistillPolicy(), which evaluates the cached selectors for a list of applicable identities.
  4. Lock the endpoint, which blocks ipcache progress
  5. Apply the policy, then call ApplyPolicyMapChanges, which consumes any incremental updates that accumulated between steps 3 and 4.

Since incremental updates are "idempotent", it is safe to see that we won't miss any incremental updates, even if the ipcache makes progress during the policy computation / endpoint regeneration process.

Relevant fields and their mutexes:

  • ep.desiredPolicy -- READ by the ipcache, WRITTEN by regeneration. Protected by ep.mutex
  • ep.policyRevision -- READ by everyone, WRITTEN by regeneration AND the policy engine. Protected by ep.mutex and ep.buildMutex
  • ep.selectorPolicy -- READ by regeneration, WRITTEN by regeneration and the identity resolution loop. Protected by ep.mutex and ep.identityRevision
  • ep.realizedRedirects -- READ external policy calculation, WRITTEN by regeneration. Protected by ep.mutex and ep.buildMutex

Processes running in parallel

  • Endpoint regeneration. Takes ep.buildMutex, does policy calculations, then takes ep.mutex. Commits and drops all locks at the end.
  • ipcache. Needs to call ep.ApplyPolicyMapChanges, which requires ep.mutex
  • identity resolution. Clears ep.selectorPolicy, holds ep.mutex
  • policy ingestion. When a policy is a no-op for an endpoint, calls ep.SetPolicyRevision() which requires ep.mutex and ep.buildMutex.

@squeed squeed requested review from a team as code owners June 14, 2023 20:47
@squeed squeed requested a review from jrajahalme June 14, 2023 20:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 14, 2023
@squeed squeed requested a review from christarazi June 14, 2023 20:47
@squeed squeed force-pushed the endpoint-calculate-policy-unlocked branch from 58db360 to 2d746f4 Compare June 15, 2023 13:11
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Jun 15, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 15, 2023
@squeed squeed force-pushed the endpoint-calculate-policy-unlocked branch 2 times, most recently from 3433367 to 4127f7e Compare June 15, 2023 20:18
@squeed
Copy link
Contributor Author

squeed commented Jun 15, 2023

/test

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A couple of comments.

pkg/endpoint/bpf.go Outdated Show resolved Hide resolved
pkg/endpoint/policy.go Outdated Show resolved Hide resolved
Comment on lines -195 to -219
// TODO: GH-7515: This should be triggered closer to policy change
// handlers, but for now let's just update it here.
if err := repo.GetPolicyCache().UpdatePolicy(e.SecurityIdentity); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo is not actually resolved. The todo is basically saying that policy computation can be done outside of the endpoint completely. While this overall PR gets us closer in that direction, we are technically still in the context of the endpoint.

Going forward, we can resolve #7515 when we can calculate policy for endpoints (outside of the context of a single endpoint) when

  1. policies change
  2. identities change
  3. endpoints change

and "change" here means added or removed.

Concretely what this could look like is some func CalculatePolicy(identity *identity.Identity) which we call on each of the above listed occurrences. Then the only logic necessary in the endpoint should just be a lookup like

selectorPolicy := repo.PolicyCache().Lookup(e.SecurityIdentity)
...
if forcePolicyCompute {
  repo.PolicyCache().UpdatePolicy(e.SecurityIdentity)
}

I toyed with this in squeed/cilium@policy-allocate-when-used-v2...christarazi:pr/christarazi/separate-policy-calc-ep-regen, but obviously needs more work. Just wanted to share more context.

Copy link
Contributor Author

@squeed squeed Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, I saw that the issue was closed and didn't really go much further than that. If we wanna do it, we shouldn't'a closed the issue :-).

Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.

Agreed, and I think we need to extract / decouple policy calculation from endpoint regeneration in order to be able to push policy to BPF & Envoy outside of a regeneration, in order to simplify the model. Today, the logic is a bit interleaved and harder to reason about.

A regeneration is considered a full policy calculation, while ApplyPolicyMapChanges is incremental. If we separate policy calculation completely from endpoint, then that should in theory allow us to always perform incremental policy updates, which should improve performance, overhead, and reduce potential interleaving of critical sections. It will also put the code into a structure where it's easier to introduce pkg/stream infrastructure.

The way I see it is we can have a model where CalculatePolicy(identity) returns a policy revision and the desired computed policy, and hands that to each endpoint that has identity. Each endpoint with that identity will then take their realized, current policy and compute the diff between the desired policy from CalculatePolicy, and implements the diff. The endpoints would then be considered implementing the returned policy revision.

In contrast to today, each endpoint attempts to compute the policy diff. N endpoints -> compute policies N times (there's some caching of course).

We sort of "invert" this relationship with CalculatePolicy by doing: Policy computed once -> N endpoints updated.

This makes the model much simpler to think about and should enable us to always have incremental policy updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.

Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of an aspect of named ports being resolved per-endpoint after a shared selectorPolicy has been computed: It is possible for a named port resolve to the same port that is used in the policy by number, and in the extreme case it would be possible for both of them specify L7 policies for different proxy parsers. Which one gets plumbed to the datapath/Envoy is a matter of map iteration order. Otherwise I think we handle this gracefully, e.g.:

  • the policy with a deny rule gets to the datapath, if both are denies that is not a problem
  • if one is redirect while other is not, the one with redirect takes precedence

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.

Agreed, good point. It would be really nice to make that two-phase policy computation (identity-based & named port-based) very clear in the code. I think it's very doable.

One way to start would be to identify which named port-specific code exists in the policy computation and see how we can extract it out.

I'm also thinking of another possibility where we could potentially even compute the named port policy within the identity-based computation. Consider that identities can map to multiple endpoints, one of those endpoints might have a named port policy. So you could imagine that the identity-based policy can detect this case, generate the named port policy as well, and then provide it to the endpoint that needs it. When endpoints apply the policy that was generated for them, they can perform some lookup to detect whether they need named ports policy, and then apply it as well if so.

Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.

Yep, we can definitely take steps to improve that. 👍

pkg/policy/distillery.go Outdated Show resolved Hide resolved
@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Jun 15, 2023
@squeed squeed force-pushed the endpoint-calculate-policy-unlocked branch from 4127f7e to d3e1c66 Compare June 16, 2023 10:38
@squeed
Copy link
Contributor Author

squeed commented Jun 16, 2023

The tests found a bug in this code! It was possible for the ep.policyRevision to go backwards if the repo's policy revision was bumped while regeneration was in progress. Updated to require the buildMutex to bump policy revision. This preserves existing behavior.

@squeed squeed force-pushed the endpoint-calculate-policy-unlocked branch 2 times, most recently from e0c91af to cc53321 Compare June 16, 2023 13:16
@squeed
Copy link
Contributor Author

squeed commented Jun 16, 2023

I've written a lot of commentary around locks, fields, and the requirements around them. As part of doing so, I found a bug in the code, so fixed that too.

The only serious change since the last review pass was discovering a case where policyRevision could go backwards; I've fixed that.

@christarazi
Copy link
Member

I've written a lot of commentary around locks, fields, and the requirements around them. As part of doing so, I found a bug in the code, so fixed that too.

The only serious change since the last review pass was discovering a case where policyRevision could go backwards; I've fixed that.

Is the bug fix is the same as the policyRevision going backwards fix? I'm thinking it would be good to separate bug fixes into separate commits so that we fix them and backport them without tying them with the overall change of removing the lock dependency while generating policy. However, from your msg it's not clear to me if there are separate bug fixes or if they all comes from the overall change.

@squeed
Copy link
Contributor Author

squeed commented Jun 19, 2023

Is the bug fix is the same as the policyRevision going backwards fix?

I should have been more precise; there was a bug in a previous version of this PR, which tests caught. No bugs in the existing codebase.

@squeed
Copy link
Contributor Author

squeed commented Jun 19, 2023

/test

@jrajahalme
Copy link
Member

Maybe this PR should be labeled as blocked due to feature freeze? IMO we should not rush fundamental changes around locking into a release at the last moment (1.14 in this case)?

@squeed
Copy link
Contributor Author

squeed commented Jun 19, 2023

Maybe this PR should be labeled as blocked due to feature freeze? IMO we should not rush fundamental changes around locking into a release at the last moment (1.14 in this case)?

That was my idea at first, but now it seems this fixes a critical bug in v1.13, so........

@squeed squeed deleted the endpoint-calculate-policy-unlocked branch June 27, 2023 08:50
@squeed squeed added backport/author The backport will be carried out by the author of the PR. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jun 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.5 Jun 28, 2023
@squeed squeed added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.5 Jul 10, 2023
@gandro gandro mentioned this pull request Jul 18, 2023
2 tasks
@aanm aanm added needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jul 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Needs backport from main in 1.13.5 Jul 26, 2023
@gentoo-root gentoo-root added this to Needs backport from main in 1.13.6 Jul 26, 2023
@gentoo-root gentoo-root removed this from Needs backport from main in 1.13.5 Jul 26, 2023
@aanm aanm added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 26, 2023
@nebril nebril added this to Backport pending to v1.13 in 1.13.7 Aug 10, 2023
@nebril nebril removed this from Needs backport from main in 1.13.6 Aug 10, 2023
@joestringer joestringer added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Aug 14, 2023
@joestringer joestringer moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.7 Aug 25, 2023
@github-actions github-actions bot added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport/author The backport will be carried out by the author of the PR. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.13.7
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

8 participants