v1.10 backports 2022-04-26 #19574

jrajahalme · 2022-04-26T10:13:58Z

daemon: Dump ipcache without probing for support #19501 -- daemon: Dump ipcache without probing for support (@jrajahalme)
identity: Initialize local identity allocator early #19556 -- identity: Initialize local identity allocator early (@jrajahalme)

Once this PR is merged, you can update the PR labels via:

$ for pr in 19501 19556; do contrib/backporting/set-labels.py $pr done 1.10; done

jrajahalme · 2022-04-26T10:20:15Z

/test-backport-1.10

aanm · 2022-04-26T12:57:14Z

/ci-gke-1.10

aanm · 2022-04-26T18:49:34Z

/ci-gke-1.10

[ upstream commit b61a347 ] ipcache SupportDump() and SupportsDelete() open the map to probe for the support if the map is not already open and also schedule the bpf-map-sync-cilium_ipcache controller. If the controller is run before initMaps(), initMaps will fail as the controller will leave the map open and initMaps() assumes this not be the case. Solve this by not trying to detect dump support, but try dump and see if it succeeds. This fixes Cilium Agent crash on kernels that do not support ipcache dump operations and when certain Cilium features are enabled on slow machines that caused the scheduled controller to run too soon. Fixes: 19360 Fixes: 19495 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Cilium Maintainers <maintainer@cilium.io>

jrajahalme · 2022-04-27T19:29:52Z

/test-backport-1.10

Job 'Cilium-PR-Runtime-net-next' failed:

Click to show.

Test Name

RuntimeFQDNPolicies DNS proxy policy works if Cilium stops

Failure Output

FAIL: CIDR identities were not restored correctly: Unified diff:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-Runtime-net-next so I can create one.

Job 'Cilium-PR-K8s-1.19-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig AutoDirectNodeRoutes Check connectivity with sockops and direct routing

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-1.19-kernel-5.4/src/github.com/cilium/cilium/test/k8sT/manifests/l3-policy-demo.yaml: Cannot retrieve cilium pod cilium-ksvc6 policy revision: cannot get revision from json: could not parse JSON from command "kubectl exec -n kube-system cilium-ksvc6 -- cilium policy get -o json"

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-5.4 so I can create one.

Job 'Cilium-PR-K8s-1.20-kernel-4.19' failed:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests NodePort with externalTrafficPolicy=Local

Failure Output

FAIL: Request from k8s1 to service http://[fd04::12]:32310 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.20-kernel-4.19 so I can create one.

jrajahalme · 2022-04-28T04:22:03Z

Cannot parse JSON error on test-1.19-5.4, restarting

jrajahalme · 2022-04-28T04:22:07Z

/test-1.19-5.4

jrajahalme · 2022-04-28T04:24:43Z

/test-1.20-4.19

jrajahalme · 2022-04-28T04:25:36Z

Runtime test fail is due to this backport not working correctly, I'll have to investigate it, flipping this to draft in the meanwhile.

[ upstream commit 2e5f35b ] Move local identity allocator initialization to NewCachingIdentityAllocator() so that it is initialized when the allocator is returned to the caller. Also make the events channel and start the watcher in NewCachingIdentityAllocator(). Close() will no longer GC the local identity allocator or stop the watcher. Now that the locally allocated identities are persisted via the bpf ipcache map across restarts, recycling them at runtime via Close() would be inappropriate. This is then used in daemon bootstrap to restore locally allocated identities before new policies can be received via Cilium API or k8s API. This fixes the issue where CIDR policies were received from k8s before locally allocated (CIDR) identities were restored, causing the identities derived from the received policy to be newly allocated with different numeric identity values, ultimately causing policy drops during Cilium restart. Fixes: cilium#19360 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

jrajahalme · 2022-04-28T04:36:54Z

Had to apply this for the v1.10 backport:

diff --git a/daemon/cmd/policy.go b/daemon/cmd/policy.go
index b9f1cbe2d1..cff454b087 100644
--- a/daemon/cmd/policy.go
+++ b/daemon/cmd/policy.go
@@ -81,6 +81,11 @@ func (d *Daemon) policyUpdateTrigger(reasons []string) {
 // in agent configuration, changes in endpoint labels, and change of security
 // identities.
 func (d *Daemon) TriggerPolicyUpdates(force bool, reason string) {
+	// CIDR identity restoration may trigger early when endpoints can not yet be regenerated
+	if d.policyTrigger == nil {
+		return
+	}
+
 	if force {
 		log.Debugf("Artificially increasing policy revision to enforce policy recalculation")
 		d.policy.BumpRevision()

This was not necessary on v1.11 due to the policyTrigger mechanism being replaced with policy.Updater.

jrajahalme · 2022-04-28T04:50:21Z

/test-backport-1.10

Job 'Cilium-PR-K8s-1.19-kernel-4.9' failed:

Click to show.

Test Name

K8sServicesTest Checks service across nodes Tests NodePort (kube-proxy)

Failure Output

FAIL: Request from k8s1 to service tftp://[fd04::12]:31408/hello failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.19-kernel-4.9 so I can create one.

jrajahalme · 2022-04-28T14:07:39Z

/test-1.19-4.9

jrajahalme · 2022-04-28T14:08:43Z

/test-1.16-netnext

jrajahalme · 2022-04-28T14:10:32Z

restarted test-1.19-4.9 and test-1.16-netnext due to unrelated but familiar looking flakes.

aws-cni test is currently broken.

joestringer · 2022-04-28T20:06:00Z

Had to apply this for the v1.10 backport:

diff --git a/daemon/cmd/policy.go b/daemon/cmd/policy.go
index b9f1cbe2d1..cff454b087 100644
--- a/daemon/cmd/policy.go
+++ b/daemon/cmd/policy.go
@@ -81,6 +81,11 @@ func (d *Daemon) policyUpdateTrigger(reasons []string) {
// in agent configuration, changes in endpoint labels, and change of security
// identities.
func (d *Daemon) TriggerPolicyUpdates(force bool, reason string) {
+	// CIDR identity restoration may trigger early when endpoints can not yet be regenerated
+	if d.policyTrigger == nil {
+		return
+	}
+
	if force {
		log.Debugf("Artificially increasing policy revision to enforce policy recalculation")
		d.policy.BumpRevision()

This was not necessary on v1.11 due to the policyTrigger mechanism being replaced with policy.Updater.

Why is the policy engine not initialized when something calls into TriggerPolicyUpdates()? Presumably something that got rearranged is now attempting to trigger policy updates before that code is ready to handle them? If we can guarantee that this logic will be called again soon, then I think we're "OK" here. If it doesn't trigger again then this is potentially avoiding handling the policy update event altogether. Then we'd need another event to come along later to make sure that the trigger eventually happens. On average I guess that's the case, but I just don't know if in some case it could get "stuck" and policy won't get recalculated due to this logic.

If the above concerns you, we'd just want to ensure that after d.policyTrigger is initialized, we then trigger policy updates. This could be a static call at the d.policyTrigger initialization point, or it could be triggered directly after k8s is fully synced, or we could introduce a sync.Once into this TriggerPolicyUpdates case iff d.policyTrigger == nil, then run a goroutine to wait until d.policyTrigger is not nil, and then call the trigger. Alternatively, if you're fairly confident that we'll get & handle another policy update trigger soon anyway then maybe we just don't worry about it.

I agree that replacing the mechanism with policy.Updater in 4881234 changed the calculation here, because the trigger is now initialized earlier in bootstrap, which is probably what ensures that this nil dereference doesn't occur with this patch on newer releases. And I don't think it makes sense to backport that commit/PR, as it's pretty invasive and risky to backport. On the other hand... if we are now triggering policy updates much earlier on newer versions (successfully), is it at all possible that this early trigger of policy updates could cause the policy to be evaluated before we've fully synced the k8s policy resources? That seems like it could introduce a whole different cause for policy drops during restart.

joestringer

@jrajahalme pointed out to me out-of-band that these triggers will not find any endpoints until all k8s resources are synced, so my point above is moot. LGTM.

jrajahalme · 2022-04-29T06:51:13Z

@jrajahalme pointed out to me out-of-band that these triggers will not find any endpoints until all k8s resources are synced, so my point above is moot. LGTM.

Labeled as ready-to-merge based on Joe's approval.

aditighag · 2022-04-29T14:59:01Z

@jrajahalme The PR needs to be rebased; looks like some other v1.10 backport PR was merged in between.

joestringer · 2022-04-29T17:27:16Z

There aren't any conflicts, so we can technically merge this, but it does come with some risk if we think that these changes may cause breakage in combination with the other patches on the v1.10 branch (since we haven't actually tested all of this PR's changes together with all of the recent changes in the upstream branch):

$ git lo HEAD..upstream/v1.10
55e076cebabd (upstream/v1.10) docs: fix version warning URL to point to docs.cilium.io
9e95af3436cf fqdn: Limit the number of zombies per host
955ba8232ff5 fqdn: Refactor zombie sort function
1ee37c1b736c make: check that Go major/minor version matches required version
ee0049f691f5 build(deps): bump docker/setup-buildx-action from 1.6.0 to 1.7.0
fec2ba7f9180 (v1.10) build(deps): bump actions/checkout from 3.0.1 to 3.0.2
28713df2579a docs: Update section title to improve readability
67f0d7723fda pkg/redirectpolicy: Improve error logs
7e425d49b63b docs: improve description for session affinity with KPR
d7ea906209bc pkg/bpf: add map name in error message for OpenParallel
56a1c4566ae6 jenkinsfiles: Increase VM boot timeout
43824d853975 install: Update image digests for v1.10.10
8a77a4a22ab3 (tag: v1.10.10, tag: 1.10.10) Prepare for release v1.10.10
3f3fec152876 build(deps): bump actions/checkout from 3.0.0 to 3.0.1
47e81a6828a6 envoy: Limit accesslog socket permissions
266183d379b1 Add a 'Limitations' section to 'External Workloads'.
9228eed8c470 ci: run documentation workflow on workflow changes
698afebc2d36 ci: pin down image for documentation workflow
adff281aa9fc Use RLock while reading data in DNS proxy
4f5148cf7ee3 wireguard: Reject duplicate public keys
c5fa809f93bb fix(helm): do not override CHECKPOINT_PATH
a77f324fabb8 fix(helm): use a directory to store bootstrap time

There's some DNS/FQDN code changes going on here, but actually most of the changes are in helm, ci, docs, dependencies which should have no impact. I'm OK with merging this as-is. To be clear, the simple answer to this kind of question is always, force a rebase & re-run the tests, but this takes additional effort & time, and I don't think that it is likely to reveal much important information in this case. As the merger, I'll take responsibility to follow up if it happens to break something.

jrajahalme added kind/backports This PR provides functionality previously merged into master. backport/1.10 labels Apr 26, 2022

jrajahalme requested a review from a team as a code owner April 26, 2022 10:13

jrajahalme marked this pull request as draft April 26, 2022 10:14

jrajahalme mentioned this pull request Apr 26, 2022

identity: Initialize local identity allocator early #19556

Merged

jrajahalme force-pushed the pr/v1.10-backport-2022-04-26 branch from f657a45 to 15f5494 Compare April 27, 2022 19:03

jrajahalme force-pushed the pr/v1.10-backport-2022-04-26 branch from 15f5494 to 1de5aad Compare April 27, 2022 19:04

jrajahalme marked this pull request as ready for review April 27, 2022 19:06

jrajahalme marked this pull request as draft April 28, 2022 04:25

jrajahalme force-pushed the pr/v1.10-backport-2022-04-26 branch from 1de5aad to 5a2811b Compare April 28, 2022 04:36

jrajahalme marked this pull request as ready for review April 28, 2022 04:38

joestringer mentioned this pull request Apr 28, 2022

v1.11 backports 2022-04-26 #19573

Merged

joestringer approved these changes Apr 29, 2022

View reviewed changes

jrajahalme added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 29, 2022

aditighag added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 29, 2022

joestringer merged commit f892cde into cilium:v1.10 Apr 29, 2022

joestringer removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 29, 2022

aanm mentioned this pull request May 10, 2022

Prepare for release v1.10.11 #19755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.10 backports 2022-04-26 #19574

v1.10 backports 2022-04-26 #19574

jrajahalme commented Apr 26, 2022

jrajahalme commented Apr 26, 2022

aanm commented Apr 26, 2022

aanm commented Apr 26, 2022

jrajahalme commented Apr 27, 2022 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

Test Name

Failure Output

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022 •

edited

jrajahalme commented Apr 28, 2022 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

joestringer commented Apr 28, 2022

joestringer left a comment

jrajahalme commented Apr 29, 2022

aditighag commented Apr 29, 2022

joestringer commented Apr 29, 2022 •

edited

v1.10 backports 2022-04-26 #19574

v1.10 backports 2022-04-26 #19574

Conversation

jrajahalme commented Apr 26, 2022

jrajahalme commented Apr 26, 2022

aanm commented Apr 26, 2022

aanm commented Apr 26, 2022

jrajahalme commented Apr 27, 2022 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

Test Name

Failure Output

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022 • edited

jrajahalme commented Apr 28, 2022 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

jrajahalme commented Apr 28, 2022

joestringer commented Apr 28, 2022

joestringer left a comment

Choose a reason for hiding this comment

jrajahalme commented Apr 29, 2022

aditighag commented Apr 29, 2022

joestringer commented Apr 29, 2022 • edited

jrajahalme commented Apr 27, 2022 •

edited by maintainer-s-little-helper bot

jrajahalme commented Apr 28, 2022 •

edited

jrajahalme commented Apr 28, 2022 •

edited by maintainer-s-little-helper bot

joestringer commented Apr 29, 2022 •

edited