New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
endpoint: don't hold the endpoint lock while generating policy #26242
endpoint: don't hold the endpoint lock while generating policy #26242
Conversation
58db360
to
2d746f4
Compare
3433367
to
4127f7e
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A couple of comments.
// TODO: GH-7515: This should be triggered closer to policy change | ||
// handlers, but for now let's just update it here. | ||
if err := repo.GetPolicyCache().UpdatePolicy(e.SecurityIdentity); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo is not actually resolved. The todo is basically saying that policy computation can be done outside of the endpoint completely. While this overall PR gets us closer in that direction, we are technically still in the context of the endpoint.
Going forward, we can resolve #7515 when we can calculate policy for endpoints (outside of the context of a single endpoint) when
- policies change
- identities change
- endpoints change
and "change" here means added or removed.
Concretely what this could look like is some func CalculatePolicy(identity *identity.Identity)
which we call on each of the above listed occurrences. Then the only logic necessary in the endpoint should just be a lookup like
selectorPolicy := repo.PolicyCache().Lookup(e.SecurityIdentity)
...
if forcePolicyCompute {
repo.PolicyCache().UpdatePolicy(e.SecurityIdentity)
}
I toyed with this in squeed/cilium@policy-allocate-when-used-v2...christarazi:pr/christarazi/separate-policy-calc-ep-regen, but obviously needs more work. Just wanted to share more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, I saw that the issue was closed and didn't really go much further than that. If we wanna do it, we shouldn't'a closed the issue :-).
Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to get too distracted from the current issue, but I'm not sure that just extracting policy computation is that interesting. More useful would be the ability to fully push policy to bpf & Envoy outside of a regeneration. I don't know how interleaved they are, but it seems like it could be doable.
Agreed, and I think we need to extract / decouple policy calculation from endpoint regeneration in order to be able to push policy to BPF & Envoy outside of a regeneration, in order to simplify the model. Today, the logic is a bit interleaved and harder to reason about.
A regeneration is considered a full policy calculation, while ApplyPolicyMapChanges
is incremental. If we separate policy calculation completely from endpoint, then that should in theory allow us to always perform incremental policy updates, which should improve performance, overhead, and reduce potential interleaving of critical sections. It will also put the code into a structure where it's easier to introduce pkg/stream
infrastructure.
The way I see it is we can have a model where CalculatePolicy(identity)
returns a policy revision and the desired computed policy, and hands that to each endpoint that has identity
. Each endpoint with that identity
will then take their realized, current policy and compute the diff between the desired policy from CalculatePolicy
, and implements the diff. The endpoints would then be considered implementing the returned policy revision.
In contrast to today, each endpoint attempts to compute the policy diff. N
endpoints -> compute policies N
times (there's some caching of course).
We sort of "invert" this relationship with CalculatePolicy
by doing: Policy computed once -> N
endpoints updated.
This makes the model much simpler to think about and should enable us to always have incremental policy updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.
Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me of an aspect of named ports being resolved per-endpoint after a shared selectorPolicy
has been computed: It is possible for a named port resolve to the same port that is used in the policy by number, and in the extreme case it would be possible for both of them specify L7 policies for different proxy parsers. Which one gets plumbed to the datapath/Envoy is a matter of map iteration order. Otherwise I think we handle this gracefully, e.g.:
- the policy with a deny rule gets to the datapath, if both are denies that is not a problem
- if one is redirect while other is not, the one with redirect takes precedence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is a fine idea - and would indeed be cleaner - except for the specter of named ports. This is basically what forces policy calculation to be done in two phases, an identity-generic phase and and an endpoint-specific phase.
Agreed, good point. It would be really nice to make that two-phase policy computation (identity-based & named port-based) very clear in the code. I think it's very doable.
One way to start would be to identify which named port-specific code exists in the policy computation and see how we can extract it out.
I'm also thinking of another possibility where we could potentially even compute the named port policy within the identity-based computation. Consider that identities can map to multiple endpoints, one of those endpoints might have a named port policy. So you could imagine that the identity-based policy can detect this case, generate the named port policy as well, and then provide it to the endpoint that needs it. When endpoints apply the policy that was generated for them, they can perform some lookup to detect whether they need named ports policy, and then apply it as well if so.
Ultimately, we don't currently detect the "edge" that is an Identity being assigned to an Endpoint on the node. If we build such a thing in to the EndpointManager, then that seems like the right place to move some of these pure control-plane computations.
Yep, we can definitely take steps to improve that. 👍
4127f7e
to
d3e1c66
Compare
The tests found a bug in this code! It was possible for the |
e0c91af
to
cc53321
Compare
I've written a lot of commentary around locks, fields, and the requirements around them. As part of doing so, I found a bug in the code, so fixed that too. The only serious change since the last review pass was discovering a case where policyRevision could go backwards; I've fixed that. |
Is the bug fix is the same as the |
I should have been more precise; there was a bug in a previous version of this PR, which tests caught. No bugs in the existing codebase. |
/test |
Maybe this PR should be labeled as blocked due to feature freeze? IMO we should not rush fundamental changes around locking into a release at the last moment (1.14 in this case)? |
That was my idea at first, but now it seems this fixes a critical bug in v1.13, so........ |
This PR has, basically, two parts. For more details, see the individual commit messages.
As preparation for further refactors of the policy engine, it would be desirable to stop holding the whole endpoint lock while we compute policy. To do this, we need to read some values, unlock, compute policy, then lock and apply. More subtly, we need to make sure that regeneration succeeds before unlocking again.
Why is this safe?
Good question.
Endpoint regeneration has a complicated locking story. There are two interesting locks here, both of which combine to protect certain fields that are modified both internally and externally.
Regeneration is serialized by the
ep.buildMutex
, which ensures we never regenerate and compute policy in parallel. That means we are free to read and write the internal policy fields freely, since we never have to protect against write collisions.Other fields accessed externally may be protected behind
ep.mutex
. However, this is too heavy-weight. Why? Because during policy computation, the policy engine needs to read back in to the endpoint. Additionally, the ipcache needs to provide policy deltas to the existing policy without being blocked.The solution is to determine, exactly, which fields need to be blocked by which mutex. The critical observation is that some fields need both the mutex and the buildMutex to be written to. This ensures consistency in the face of regeneration and parallel operations.
Additionally, the critical observation is that during the policy calculation process, the L4Policies themselves register for incremental policy updates. That means that we have no chance of missing an incremental update during policy calculation, and it is safe to let the ipcache make progress. Specifically, the order is:
L4Filters
and allows them to start reading incremental updatesSince incremental updates are "idempotent", it is safe to see that we won't miss any incremental updates, even if the ipcache makes progress during the policy computation / endpoint regeneration process.
Relevant fields and their mutexes:
ep.desiredPolicy
-- READ by the ipcache, WRITTEN by regeneration. Protected byep.mutex
ep.policyRevision
-- READ by everyone, WRITTEN by regeneration AND the policy engine. Protected byep.mutex
andep.buildMutex
ep.selectorPolicy
-- READ by regeneration, WRITTEN by regeneration and the identity resolution loop. Protected byep.mutex
andep.identityRevision
ep.realizedRedirects
-- READ external policy calculation, WRITTEN by regeneration. Protected byep.mutex
andep.buildMutex
Processes running in parallel
ep.buildMutex
, does policy calculations, then takesep.mutex
. Commits and drops all locks at the end.ep.ApplyPolicyMapChanges
, which requiresep.mutex
ep.selectorPolicy
, holdsep.mutex
ep.SetPolicyRevision()
which requiresep.mutex
andep.buildMutex
.