ci: Restart pods when toggling KPR switch #18031

brb · 2021-11-26T19:35:57Z

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Fix #17881

Previously, in the graceful backend termination test we switched to KPR=disabled and we didn't restart CoreDNS. Before the switch, CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet should have been BPF masq-ed. After the switch, the BPF masq is no longer in place, so the packets from CoreDNS are subject to the iptables' masquerading (they can be either dropped by the invalid rule or masqueraded to some other port). Combined with CoreDNS unable to recover from connectivity errors [1], the CoreDNS was no longer able to receive updates from the kube-apiserver, thus NXDOMAIN errors for the new service name. To avoid such flakes, forcefully restart the DNS pods if the KPR setting change is detected. [1]: #18018 Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb · 2021-11-26T19:39:40Z

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sHealthTest cilium-health Checks status between nodes

Failure Output

FAIL: Expected

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

brb · 2021-11-27T06:49:26Z

test-1.22-4.19

brb · 2021-11-27T07:18:15Z

k8s-1.21-kernel-5.4 hit #17010 (comment)

brb · 2021-11-27T07:18:24Z

test-1.21-5.4

brb · 2021-11-27T07:23:08Z

gke-stable hit #6728

brb · 2021-11-27T07:23:13Z

test-gke

brb · 2021-11-27T08:09:03Z

k8s-1.22-kernel-4.19 hit #18014 (currently investigating)

brb · 2021-11-27T13:00:32Z

test-1.22-4.19

brb · 2021-11-29T11:57:01Z

k8s-1.22-kernel-4.19 hit #18014.

joestringer · 2021-12-06T18:08:14Z

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

brb · 2021-12-08T13:27:23Z

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

IIRC @aanm has already documented it somewhere. Unfortunately, I cannot find it. @aanm ?

aanm · 2021-12-09T02:46:23Z

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

IIRC @aanm has already documented it somewhere. Unfortunately, I cannot find it. @aanm ?

@joestringer @brb Here

nbusseneau · 2021-12-10T15:52:32Z

optionChangeRequiresPodRedeploy does not exist in v1.9 because #16767 was not backported to v1.9, preventing this from being backported in #18147.

Are we hitting this issue in v1.9?

If yes, I advocate for backporting test: Redeploy DNS after endpointRoutes reconfiguration #16767 as well.
If not, I advocate for removing this PR from backports in v1.9 backports 2021-12-06 #18147 and removing backport label for 1.9 on this PR.

pchaigno · 2021-12-13T21:53:48Z

I think we should backport both. I can't think of a reason why the v1.9 tests wouldn't be affected by bug fixed in those two PRs. Note it's a bit hard to check because the issue can manifest in several different tests and it depends, at least partially, on the order in which tests are executed.

If we backport #16767, we'll also need to backport #16835.

nbusseneau · 2021-12-14T14:54:22Z

OK, then I'm adding both of these to the backport PR.

brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Nov 26, 2021

brb requested a review from a team as a code owner November 26, 2021 19:35

brb requested a review from tklauser November 26, 2021 19:35

maintainer-s-little-helper bot assigned tklauser Nov 26, 2021

brb mentioned this pull request Nov 27, 2021

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing #17010

Closed

pchaigno approved these changes Nov 29, 2021

View reviewed changes

aanm approved these changes Nov 29, 2021

View reviewed changes

brb added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. needs-backport/1.11 labels Nov 29, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 29, 2021

qmonnet merged commit 06d9441 into master Nov 29, 2021

qmonnet deleted the pr/brb/ci-fix-graceful-termination-flake branch November 29, 2021 16:10

qmonnet mentioned this pull request Nov 30, 2021

v1.11 backports 2021-11-30 #18076

Merged

18 tasks

qmonnet added backport-pending/1.11 and removed needs-backport/1.11 labels Nov 30, 2021

brb added needs-backport/1.9 labels Dec 1, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.9.12 Dec 1, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Dec 1, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.0 Dec 1, 2021

nathanjsweet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 2, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.0 Dec 2, 2021

nathanjsweet mentioned this pull request Dec 6, 2021

v1.10 backports 2021-12-06 #18146

Merged

nathanjsweet added backport-pending/1.10 and removed needs-backport/1.10 labels Dec 6, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 6, 2021

nathanjsweet mentioned this pull request Dec 6, 2021

v1.9 backports 2021-12-06 #18147

Merged

nathanjsweet added backport-pending/1.9 and removed needs-backport/1.9 labels Dec 6, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.12 Dec 6, 2021

joestringer added backport-done/1.10 and removed backport-pending/1.10 labels Dec 10, 2021

joestringer moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021

joestringer mentioned this pull request Dec 10, 2021

Prepare for release v1.10.6 #18214

Merged

This was referenced Dec 14, 2021

test: Redeploy DNS after endpointRoutes reconfiguration #16767

Merged

test: Delete DNS pods in AfterAll for datapath tests #16835

Merged

tklauser added backport-done/1.9 and removed backport-pending/1.9 labels Dec 15, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.12 Dec 15, 2021

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.9.12 #18533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Restart pods when toggling KPR switch #18031

ci: Restart pods when toggling KPR switch #18031

brb commented Nov 26, 2021 •

edited

brb commented Nov 26, 2021 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 29, 2021

joestringer commented Dec 6, 2021

brb commented Dec 8, 2021

aanm commented Dec 9, 2021

nbusseneau commented Dec 10, 2021

pchaigno commented Dec 13, 2021

nbusseneau commented Dec 14, 2021

ci: Restart pods when toggling KPR switch #18031

ci: Restart pods when toggling KPR switch #18031

Conversation

brb commented Nov 26, 2021 • edited

brb commented Nov 26, 2021 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 27, 2021

brb commented Nov 29, 2021

joestringer commented Dec 6, 2021

brb commented Dec 8, 2021

aanm commented Dec 9, 2021

nbusseneau commented Dec 10, 2021

pchaigno commented Dec 13, 2021

nbusseneau commented Dec 14, 2021

brb commented Nov 26, 2021 •

edited

brb commented Nov 26, 2021 •

edited by maintainer-s-little-helper bot