New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.13] endpoint: don't hold the endpoint lock while generating policy #26735
Conversation
/test-backport-1.13 |
Note: a hotfix build has been provided to the end-user who initially reported the issue, and they noticed that it solved their problem. |
Travis seems to call out a genuine bug
|
@squeed if this isn't a straight-forward cherry-pick, could you please ask one of the reviewers of the upstream PR to have a closer look? Thank you! |
9122c3e
to
ae337cb
Compare
@julianwiedmann good catch, oops. Fixed that. The backport is, well, medium-trivial, it can't be automated just because of some behavior-unaffecting method signature changes in tests. |
0d1573e
to
ad6d081
Compare
/test-backport-1.13 Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/78/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. Job 'Cilium-PR-K8s-1.18-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.18-kernel-4.19/78/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
The test failures are not flakes. Specifically, the error-message checker is complaining that this error message was logged:
This error is not critical; we handle this case in the code and trigger regeneration again and everything proceeds as expected. I'm going to consider ignoring this error in the tests. I don't immediately understand why this is not an issue in |
ad6d081
to
b723eec
Compare
OK, added an exception for this error message. I have never seen it happen in v1.14 and later, I'm investigating why that is the case. Requested an endpoint reviewer. |
Chatted with @joestringer a bit about this, it seems the error message is coming from regenerating node endpoints (since the node labels, in v1.13, seem to come while the node's endpoint is first being regenerated). I'll see if there's a reasonable way to prevent the error from being output. |
I wasn't able to find a good way to filter that error message; it's just too far away from it's context. I think this PR as written is OK. I'll rebase. |
[ upstream commit 3ca309d ] This function is called deep within the policy generation hierarchy, and is at a risk of causing deadlocks. Given that it's just reading a pointer to a never-mutated map, we can safely stash this behind an atomic Pointer and remove the lock. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit e20b16d ] It turns out that most of the endpoint identities, e.g. pod name / namespace, are actually immutable. So, there's no need to grab a lock before reading them. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit b63115b ] These methods are no longer used; remove them from the EndpointInfoSource interface. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit f048a6a ] As preparation for other refactors of the policy engine, no longer hold the endpoint lock while calculating policy. This is safe to do, since the only input is the endpoint's security identity. Furthermore, if, somehow, policy were to be calculated in parallel, we can reject an update if its revision is too old. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 9623641 ] This ensures the generated ID works like IDs allocated normally - that their LabelArray is set. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
[ upstream commit 8e163a9 ] This adds a small test that ensures incremental updates are never lost, even in the face of significant identity churn. It simulates a churning ipcache flinging identities in to the policy engine, and similarly recalculates policy constantly. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
It is possible for an endpoint's identity to change during policy calculation. If this happens, we need to error-out and try again. Unfortunately we don't have a good way to prevent this error message, since it is logged far away from the context. So, sadly, we have to ignore it. Signed-off-by: Casey Callendrello <cdc@isovalent.com>
b723eec
to
6016d7f
Compare
/test |
/test-backport-1.13 Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/112/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. Job 'Cilium-PR-K8s-1.21-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.19/113/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
/test-1.21-4.19 |
1 similar comment
/test-1.21-4.19 |
I manually backported this PR to generate the hotfix build, might as well save the backporter some bother and submit it as well.
Once this PR is merged, you can update the PR labels via: