-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.12] Backport of #26242 (don't hold EP lock while generating policy) #29408
Conversation
[ upstream commit 3ca309d ] [ backporter's notes: 1. adjust types.NamedPortMap -> policy.NamedPortMap 2. EndpointInfoSource is in proxy/logger/epinfo.go instead of proxy/endpoint/endpoint.go 3. Go1.18 compat: atomic.Pointer -> atomic.Value ] This function is called deep within the policy generation hierarchy, and is at a risk of causing deadlocks. Given that it's just reading a pointer to a never-mutated map, we can safely stash this behind an atomic Pointer and remove the lock. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
def22ad
to
9a3081f
Compare
[ upstream commit e20b16d ] [ backporter's notes: 1. manager_test: we don't yet have ipcache test infra structure, use the real thing instead (as other tests do) 2. go1.18 compat: atomic.Pointer -> atomic.Value ] It turns out that most of the endpoint identities, e.g. pod name / namespace, are actually immutable. So, there's no need to grab a lock before reading them. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit b63115b ] These methods are no longer used; remove them from the EndpointInfoSource interface. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit f048a6a ] As preparation for other refactors of the policy engine, no longer hold the endpoint lock while calculating policy. This is safe to do, since the only input is the endpoint's security identity. Furthermore, if, somehow, policy were to be calculated in parallel, we can reject an update if its revision is too old. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit 9623641 ] This ensures the generated ID works like IDs allocated normally - that their LabelArray is set. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit 8e163a9 ] This adds a small test that ensures incremental updates are never lost, even in the face of significant identity churn. It simulates a churning ipcache flinging identities in to the policy engine, and similarly recalculates policy constantly. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
9a3081f
to
defb483
Compare
/test-backport-1.12 |
CI triage:
|
/test-1.24-net-next |
/test-1.16-4.9 |
I recall seeing the error message |
I did a lot of debugging, and I determined that the error message popped up because the endpoint in question (actually the local node's endpoint) changed labels precisely during regeneration. Bad luck. I believe it is safe to ignore this error message in the tests. |
Once this PR is merged, a GitHub action will update the labels of these PRs:
Rationale
In the following, the notation SelectorCache > NameManager means that the selector cache lock is held while the NameManager lock is being acquired. When a lock is held when another is acquired, it creates an edge in the graph of lock dependencies. If there is a cycle in lock dependencies, a wait cycle (or deadlock) is possible.
In this case, the wait cycle in question is the following:
IPCache > EP > NameManager > IPCache
The IPCache > EP link stems from the following stack. Note that EP here stands for any endpoint, as they are all locked in sequence. I believe this is related to replacing the kube API server.
The next link in the chain, EP > NameManager, occurs in the path of EP regeneration when dealing with FQDN selectors. Specifically, the SelectorCache wants to lock NameManager early (due to lock ordering constraints, cf 0a07c9b).
Finally, NameManager > IPCache stems from the incremental policy path used by FQDN. UpdateGenerateDNS locks the NameManager for the duration of the method, and calls into IPCache for CIDR allocation.