-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.10 backports 2022-04-26 #19574
v1.10 backports 2022-04-26 #19574
Conversation
/test-backport-1.10 |
/ci-gke-1.10 |
1 similar comment
/ci-gke-1.10 |
f657a45
to
15f5494
Compare
[ upstream commit b61a347 ] ipcache SupportDump() and SupportsDelete() open the map to probe for the support if the map is not already open and also schedule the bpf-map-sync-cilium_ipcache controller. If the controller is run before initMaps(), initMaps will fail as the controller will leave the map open and initMaps() assumes this not be the case. Solve this by not trying to detect dump support, but try dump and see if it succeeds. This fixes Cilium Agent crash on kernels that do not support ipcache dump operations and when certain Cilium features are enabled on slow machines that caused the scheduled controller to run too soon. Fixes: 19360 Fixes: 19495 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Cilium Maintainers <maintainer@cilium.io>
15f5494
to
1de5aad
Compare
/test-backport-1.10 Job 'Cilium-PR-Runtime-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.19-kernel-5.4' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Cannot parse JSON error on test-1.19-5.4, restarting |
/test-1.19-5.4 |
/test-1.20-4.19 |
Runtime test fail is due to this backport not working correctly, I'll have to investigate it, flipping this to draft in the meanwhile. |
[ upstream commit 2e5f35b ] Move local identity allocator initialization to NewCachingIdentityAllocator() so that it is initialized when the allocator is returned to the caller. Also make the events channel and start the watcher in NewCachingIdentityAllocator(). Close() will no longer GC the local identity allocator or stop the watcher. Now that the locally allocated identities are persisted via the bpf ipcache map across restarts, recycling them at runtime via Close() would be inappropriate. This is then used in daemon bootstrap to restore locally allocated identities before new policies can be received via Cilium API or k8s API. This fixes the issue where CIDR policies were received from k8s before locally allocated (CIDR) identities were restored, causing the identities derived from the received policy to be newly allocated with different numeric identity values, ultimately causing policy drops during Cilium restart. Fixes: cilium#19360 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
1de5aad
to
5a2811b
Compare
Had to apply this for the v1.10 backport:
This was not necessary on v1.11 due to the |
/test-backport-1.10 Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-1.19-4.9 |
/test-1.16-netnext |
restarted test-1.19-4.9 and test-1.16-netnext due to unrelated but familiar looking flakes. aws-cni test is currently broken. |
Why is the policy engine not initialized when something calls into If the above concerns you, we'd just want to ensure that after I agree that replacing the mechanism with policy.Updater in 4881234 changed the calculation here, because the trigger is now initialized earlier in bootstrap, which is probably what ensures that this nil dereference doesn't occur with this patch on newer releases. And I don't think it makes sense to backport that commit/PR, as it's pretty invasive and risky to backport. On the other hand... if we are now triggering policy updates much earlier on newer versions (successfully), is it at all possible that this early trigger of policy updates could cause the policy to be evaluated before we've fully synced the k8s policy resources? That seems like it could introduce a whole different cause for policy drops during restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrajahalme pointed out to me out-of-band that these triggers will not find any endpoints until all k8s resources are synced, so my point above is moot. LGTM.
Labeled as |
@jrajahalme The PR needs to be rebased; looks like some other v1.10 backport PR was merged in between. |
There aren't any conflicts, so we can technically merge this, but it does come with some risk if we think that these changes may cause breakage in combination with the other patches on the v1.10 branch (since we haven't actually tested all of this PR's changes together with all of the recent changes in the upstream branch):
There's some DNS/FQDN code changes going on here, but actually most of the changes are in helm, ci, docs, dependencies which should have no impact. I'm OK with merging this as-is. To be clear, the simple answer to this kind of question is always, force a rebase & re-run the tests, but this takes additional effort & time, and I don't think that it is likely to reveal much important information in this case. As the merger, I'll take responsibility to follow up if it happens to break something. |
Once this PR is merged, you can update the PR labels via: