test/contrib: Bump CoreDNS version to 1.8.3 #18018

brb · 2021-11-26T13:05:15Z

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Fix #17401

aanm · 2021-11-26T13:08:10Z

@brb have you looked at #17489?

brb · 2021-11-26T13:10:28Z

have you looked at #17489?

@aanm Yep, why? This one is a different bug.

aanm · 2021-11-26T13:13:21Z

@brb I thought it was the same bug. I mean it might be the same bug but the fix is only available in conjunction of both PRs. Nice finding. I assume we will need to backport this to all branches?

brb · 2021-11-26T13:15:34Z

/test

tklauser

Nice find!

cilium/cilium#18018 updated the tests in cilium/cilium to consistenly use CoreDNS v1.8.3. Thus we no longer need to pre-pull any other CoreDNS image versions. Signed-off-by: Tobias Klauser <tobias@cilium.io>

tklauser · 2021-11-26T14:37:18Z

FYI: opened cilium/packer-ci-build#301 to drop all but k8s.gcr.io/coredns/coredns:v1.8.3 from the pre-pulled images in the test VMs.

pchaigno · 2021-11-26T16:43:51Z

To determine the min vsn bump, I was using the following:

Naïve question: why can't we simply use the latest version?

aanm · 2021-11-26T18:09:07Z

To determine the min vsn bump, I was using the following:

Naïve question: why can't we simply use the latest version?

@pchaigno we should use the same coredns version released on each k8s version as described here of course since coredns has bugs we need to update the version on our CI.

brb · 2021-11-26T19:16:44Z

CI is failing:

2021-11-26T14:04:29.959324049Z E1126 14:04:29.958463       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope
2021-11-26T14:04:31.367303882Z E1126 14:04:31.367133       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope
2021-11-26T14:04:33.649383857Z E1126 14:04:33.649116       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope

[ upstream commit 06d9441 ] Previously, in the graceful backend termination test we switched to KPR=disabled and we didn't restart CoreDNS. Before the switch, CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet should have been BPF masq-ed. After the switch, the BPF masq is no longer in place, so the packets from CoreDNS are subject to the iptables' masquerading (they can be either dropped by the invalid rule or masqueraded to some other port). Combined with CoreDNS unable to recover from connectivity errors [1], the CoreDNS was no longer able to receive updates from the kube-apiserver, thus NXDOMAIN errors for the new service name. To avoid such flakes, forcefully restart the DNS pods if the KPR setting change is detected. [1]: #18018 Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Quentin Monnet <quentin@isovalent.com>

[ upstream commit 06d9441 ] Previously, in the graceful backend termination test we switched to KPR=disabled and we didn't restart CoreDNS. Before the switch, CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet should have been BPF masq-ed. After the switch, the BPF masq is no longer in place, so the packets from CoreDNS are subject to the iptables' masquerading (they can be either dropped by the invalid rule or masqueraded to some other port). Combined with CoreDNS unable to recover from connectivity errors [1], the CoreDNS was no longer able to receive updates from the kube-apiserver, thus NXDOMAIN errors for the new service name. To avoid such flakes, forcefully restart the DNS pods if the KPR setting change is detected. [1]: #18018 Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

[ upstream commit 06d9441 ] Previously, in the graceful backend termination test we switched to KPR=disabled and we didn't restart CoreDNS. Before the switch, CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet should have been BPF masq-ed. After the switch, the BPF masq is no longer in place, so the packets from CoreDNS are subject to the iptables' masquerading (they can be either dropped by the invalid rule or masqueraded to some other port). Combined with CoreDNS unable to recover from connectivity errors [1], the CoreDNS was no longer able to receive updates from the kube-apiserver, thus NXDOMAIN errors for the new service name. To avoid such flakes, forcefully restart the DNS pods if the KPR setting change is detected. [1]: #18018 Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>

brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Nov 26, 2021

brb requested a review from aanm November 26, 2021 13:05

brb requested review from a team as code owners November 26, 2021 13:05

brb requested a review from tklauser November 26, 2021 13:05

maintainer-s-little-helper bot assigned aanm and tklauser Nov 26, 2021

brb added needs-backport/1.9 labels Nov 26, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 26, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Nov 26, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.9.12 Nov 26, 2021

tklauser approved these changes Nov 26, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned tklauser Nov 26, 2021

tklauser mentioned this pull request Nov 26, 2021

provision: only pull CoreDNS v1.8.3 image cilium/packer-ci-build#301

Merged

aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Nov 26, 2021

aanm approved these changes Nov 26, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned aanm Nov 26, 2021

nathanjsweet mentioned this pull request Dec 2, 2021

v1.11 backports 2021-12-02 #18109

Merged

nathanjsweet added backport-pending/1.11 and removed needs-backport/1.11 labels Dec 2, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.0 Dec 2, 2021

christarazi mentioned this pull request Dec 3, 2021

v1.10 backports 2021-12-01 #18088

Merged

joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 3, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.0 Dec 3, 2021

nathanjsweet mentioned this pull request Dec 6, 2021

v1.10 backports 2021-12-06 #18146

Merged

nathanjsweet added backport-pending/1.10 and removed needs-backport/1.10 labels Dec 6, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 6, 2021

nathanjsweet mentioned this pull request Dec 6, 2021

v1.9 backports 2021-12-06 #18147

Merged

nathanjsweet added backport-pending/1.9 and removed needs-backport/1.9 labels Dec 6, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.12 Dec 6, 2021

joestringer added backport-done/1.10 and removed backport-pending/1.10 labels Dec 10, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021

joestringer mentioned this pull request Dec 10, 2021

Prepare for release v1.10.6 #18214

Merged

tklauser added backport-done/1.9 and removed backport-pending/1.9 labels Dec 15, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.12 Dec 15, 2021

joestringer mentioned this pull request Jan 18, 2022

Prepare for release v1.9.12 #18533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/contrib: Bump CoreDNS version to 1.8.3 #18018

test/contrib: Bump CoreDNS version to 1.8.3 #18018

brb commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021

tklauser left a comment

tklauser commented Nov 26, 2021

pchaigno commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021

test/contrib: Bump CoreDNS version to 1.8.3 #18018

test/contrib: Bump CoreDNS version to 1.8.3 #18018

Conversation

brb commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021

tklauser left a comment

Choose a reason for hiding this comment

tklauser commented Nov 26, 2021

pchaigno commented Nov 26, 2021

aanm commented Nov 26, 2021

brb commented Nov 26, 2021