Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Restart pods when toggling KPR switch #18031

Merged
merged 1 commit into from Nov 29, 2021

Conversation

brb
Copy link
Member

@brb brb commented Nov 26, 2021

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Fix #17881

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Nov 26, 2021
@brb brb requested a review from a team as a code owner November 26, 2021 19:35
@brb brb requested a review from tklauser November 26, 2021 19:35
@brb
Copy link
Member Author

brb commented Nov 26, 2021

/test

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sHealthTest cilium-health Checks status between nodes

Failure Output

FAIL: Expected

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@brb
Copy link
Member Author

brb commented Nov 27, 2021

test-1.22-4.19

@brb
Copy link
Member Author

brb commented Nov 27, 2021

k8s-1.21-kernel-5.4 hit #17010 (comment)

@brb
Copy link
Member Author

brb commented Nov 27, 2021

test-1.21-5.4

@brb
Copy link
Member Author

brb commented Nov 27, 2021

gke-stable hit #6728

@brb
Copy link
Member Author

brb commented Nov 27, 2021

test-gke

@brb
Copy link
Member Author

brb commented Nov 27, 2021

k8s-1.22-kernel-4.19 hit #18014 (currently investigating)

@brb
Copy link
Member Author

brb commented Nov 27, 2021

test-1.22-4.19

@brb
Copy link
Member Author

brb commented Nov 29, 2021

k8s-1.22-kernel-4.19 hit #18014.

@brb brb added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. needs-backport/1.11 labels Nov 29, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 29, 2021
@qmonnet qmonnet merged commit 06d9441 into master Nov 29, 2021
@qmonnet qmonnet deleted the pr/brb/ci-fix-graceful-termination-flake branch November 29, 2021 16:10
@qmonnet qmonnet mentioned this pull request Nov 30, 2021
18 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.12 Dec 1, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Dec 1, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.0 Dec 1, 2021
@nathanjsweet nathanjsweet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.0 Dec 2, 2021
@joestringer
Copy link
Member

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 6, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.12 Dec 6, 2021
@brb
Copy link
Member Author

brb commented Dec 8, 2021

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

IIRC @aanm has already documented it somewhere. Unfortunately, I cannot find it. @aanm ?

@aanm
Copy link
Member

aanm commented Dec 9, 2021

@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting?

IIRC @aanm has already documented it somewhere. Unfortunately, I cannot find it. @aanm ?

@joestringer @brb Here

@joestringer joestringer moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021
@nbusseneau
Copy link
Member

optionChangeRequiresPodRedeploy does not exist in v1.9 because #16767 was not backported to v1.9, preventing this from being backported in #18147.

Are we hitting this issue in v1.9?

@pchaigno
Copy link
Member

I think we should backport both. I can't think of a reason why the v1.9 tests wouldn't be affected by bug fixed in those two PRs. Note it's a bit hard to check because the issue can manifest in several different tests and it depends, at least partially, on the order in which tests are executed.

If we backport #16767, we'll also need to backport #16835.

@nbusseneau
Copy link
Member

OK, then I'm adding both of these to the backport PR.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.12 Dec 15, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.12 Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.6
Backport done to v1.10
1.11.0
Backport done to v1.11
1.9.12
Backport done to v1.9
8 participants