Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/contrib: Bump CoreDNS version to 1.8.3 #18018

Merged
merged 1 commit into from Nov 29, 2021
Merged

Conversation

brb
Copy link
Member

@brb brb commented Nov 26, 2021

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Fix #17401

@brb brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Nov 26, 2021
@brb brb requested a review from aanm November 26, 2021 13:05
@brb brb requested review from a team as code owners November 26, 2021 13:05
@brb brb requested a review from tklauser November 26, 2021 13:05
@aanm
Copy link
Member

aanm commented Nov 26, 2021

@brb have you looked at #17489?

@brb
Copy link
Member Author

brb commented Nov 26, 2021

have you looked at #17489?

@aanm Yep, why? This one is a different bug.

@aanm
Copy link
Member

aanm commented Nov 26, 2021

@brb I thought it was the same bug. I mean it might be the same bug but the fix is only available in conjunction of both PRs. Nice finding. I assume we will need to backport this to all branches?

@brb
Copy link
Member Author

brb commented Nov 26, 2021

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.0 Nov 26, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Nov 26, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.12 Nov 26, 2021
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

tklauser added a commit to cilium/packer-ci-build that referenced this pull request Nov 26, 2021
cilium/cilium#18018 updated the tests in
cilium/cilium to consistenly use CoreDNS v1.8.3. Thus we no longer need
to pre-pull any other CoreDNS image versions.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser added a commit to cilium/packer-ci-build that referenced this pull request Nov 26, 2021
cilium/cilium#18018 updated the tests in
cilium/cilium to consistenly use CoreDNS v1.8.3. Thus we no longer need
to pre-pull any other CoreDNS image versions.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Member

FYI: opened cilium/packer-ci-build#301 to drop all but k8s.gcr.io/coredns/coredns:v1.8.3 from the pre-pulled images in the test VMs.

@pchaigno
Copy link
Member

To determine the min vsn bump, I was using the following:

Naïve question: why can't we simply use the latest version?

@aanm
Copy link
Member

aanm commented Nov 26, 2021

To determine the min vsn bump, I was using the following:

Naïve question: why can't we simply use the latest version?

@pchaigno we should use the same coredns version released on each k8s version as described here of course since coredns has bugs we need to update the version on our CI.

@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Nov 26, 2021
@brb
Copy link
Member Author

brb commented Nov 26, 2021

CI is failing:

2021-11-26T14:04:29.959324049Z E1126 14:04:29.958463       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope
2021-11-26T14:04:31.367303882Z E1126 14:04:31.367133       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope
2021-11-26T14:04:33.649383857Z E1126 14:04:33.649116       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope

joestringer pushed a commit that referenced this pull request Dec 2, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.0 Dec 2, 2021
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 3, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.0 Dec 3, 2021
nathanjsweet pushed a commit that referenced this pull request Dec 6, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 6, 2021
nathanjsweet pushed a commit that referenced this pull request Dec 6, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.12 Dec 6, 2021
joestringer pushed a commit that referenced this pull request Dec 10, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021
nbusseneau pushed a commit that referenced this pull request Dec 14, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this pull request Dec 15, 2021
[ upstream commit 06d9441 ]

Previously, in the graceful backend termination test we switched to
KPR=disabled and we didn't restart CoreDNS. Before the switch,
CoreDNS@k8s2 -> kube-apiserver@k8s1 was handled by the socket-lb, so the
outgoing packet was $CORE_DNS_IP -> $KUBE_API_SERVER_NODE_IP. The packet
should have been BPF masq-ed. After the switch, the BPF masq is no
longer in place, so the packets from CoreDNS are subject to the
iptables' masquerading (they can be either dropped by the invalid rule
or masqueraded to some other port). Combined with CoreDNS unable to
recover from connectivity errors [1], the CoreDNS was no longer able to
receive updates from the kube-apiserver, thus NXDOMAIN errors for the
new service name.

To avoid such flakes, forcefully restart the DNS pods if the KPR setting
change is detected.

[1]: #18018

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.12 Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.6
Backport done to v1.10
1.11.0
Backport done to v1.11
1.9.12
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

CI: K8sBookInfoDemoTest Bookinfo Demo Tests bookinfo demo: DNS entry is not ready after timeout
7 participants