New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Restart pods when toggling KPR switch #18031
Conversation
Previously, in the graceful backend termination test we switched to KPR=disabled and we didn't restart CoreDNS. Before the switch, CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet should have been BPF masq-ed. After the switch, the BPF masq is no longer in place, so the packets from CoreDNS are subject to the iptables' masquerading (they can be either dropped by the invalid rule or masqueraded to some other port). Combined with CoreDNS unable to recover from connectivity errors [1], the CoreDNS was no longer able to receive updates from the kube-apiserver, thus NXDOMAIN errors for the new service name. To avoid such flakes, forcefully restart the DNS pods if the KPR setting change is detected. [1]: #18018 Signed-off-by: Martynas Pumputis <m@lambda.lt>
/test Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
test-1.22-4.19 |
|
test-1.21-5.4 |
|
test-gke |
|
test-1.22-4.19 |
|
@brb should we document somewhere that users must restart all pods to properly apply that feature if they change the setting? |
Are we hitting this issue in v1.9?
|
I think we should backport both. I can't think of a reason why the v1.9 tests wouldn't be affected by bug fixed in those two PRs. Note it's a bit hard to check because the issue can manifest in several different tests and it depends, at least partially, on the order in which tests are executed. |
OK, then I'm adding both of these to the backport PR. |
Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.
To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.
[1]: #18018
Fix #17881