v1.11 backports 2021-12-02 #18109

nathanjsweet · 2021-12-02T23:41:23Z

test/contrib: Bump CoreDNS version to 1.8.3 #18018 -- test/contrib: Bump CoreDNS version to 1.8.3 (@brb)
Fix kubectl CI flakiness #18087 -- Fix kubectl CI flakiness (@aanm)
test: Extend coredns clusterrole with additional resource permissions #18104 -- test: Extend coredns clusterrole with additional resource permissions (@aditighag)
test: Quarantine Secondary nodeport device tests #18091 -- test: Quarantine Secondary nodeport device tests (@joestringer)
aws: Disable flaky test #18092 -- aws: Disable flaky test (@joestringer)

Once this PR is merged, you can update the PR labels via:

$ for pr in 18018 18087 18104 18091 18092; do contrib/backporting/set-labels.py $pr done 1.11; done

or with

$ make add-label branch=v1.11 issues=18018,18087,18104,18091,18092

[ upstream commit 398d55c ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the client couldn't recover. In the case of CoreDNS the loose of connectivity to kube-apiserver was even not logged. I have validated this by adding the following rule on the node which was running the CoreDNS pod (6443 port as the socket-lb was doing the service xlation): iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \ --dport=6443 -j DROP After upgrading CoreDNS to the one which was compiled with Go >= 1.16, the pod was not only logging the errors, but also was able to recover from them in a fast way. An example of such an error: W1126 12:45:08.403311 1 reflector.go:436] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding To determine the min vsn bump, I was using the following: for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \ --version done CoreDNS-1.7.0 linux/amd64, go1.14.4, f59c03d CoreDNS-1.7.1 linux/amd64, go1.15.2, aa82ca6 CoreDNS-1.8.0 linux/amd64, go1.15.3, 054c9ae k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown: k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown: CoreDNS-1.8.3 linux/amd64, go1.16, 4293992 CoreDNS-1.8.4 linux/amd64, go1.16.4, 053c4d5 Hopefully, the bumped version will fix the CI flakes in which a service domain name is not available after 7min. In other words, CoreDNS is not able to resolve the name which means that it hasn't received update from the kube-apiserver for the service. [1]: kubernetes/kubernetes#87615 (comment) Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

[ upstream commit 6c432fb ] This reverts commit bb6ef27. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

[ upstream commit 75fbebb ] Since we only update the Kubernetes version tested on our CI when the first RC is announced we should use that binary instead of the `.0` as the `.0` is not available at the time the rc.0 is released. Fixes: 6181255 ("test: ensure kubectl version is available for test run") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

[ upstream commit 854bb86 ] Commit 398d55c didn't add permissions for `endpointslices` resource to the coredns `cluterrole` on k8s < 1.20. As a result, core-dns deployments failed on the these versions with the error - `2021-11-30T14:09:43.349414540Z E1130 14:09:43.349292 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope` Fixes: 398d55c Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

.github/workflows/conformance-eks-v1.11.yaml

aanm

Looks good for my commits. Thanks

[ upstream commit 2d7602e ] See issue 18072 for more details about the flaky test. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

[ upstream commit 0c7fe95 ] This test has been flaky for well over a year now, see issue 11560. Track re-enablement in https://github.com/cilium/cilium/projects/173 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

joestringer

LGTM thanks!

joestringer · 2021-12-03T00:06:58Z

/test-backport-1.11

Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDemosTest Tests Star Wars Demo

Failure Output

FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sConformance Portmap Chaining Check one node connectivity-check compliance with portmap chaining

Failure Output

FAIL: connectivity-check pods are not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create a new GitHub issue to track it.

aditighag · 2021-12-03T00:08:33Z

LGTM, thanks!

brb

My changes LGTM, thanks!

pchaigno

Reviewed considering a request was made to cilium/ci-structure. I didn't spot anything that should be Cilium-version-dependent in the test changes, so LGTM.

nathanjsweet · 2021-12-03T19:33:21Z

Flakes

Cilium-PR-K8s-1.17-kernel-4.9
- CI: K8sConformance Portmap Chaining: connectivity-check pods are not ready after timeout #15791
Cilium-PR-K8s-1.20-kernel-4.9
- CI: failed to ensure kubectl version: failed to download kubectl #18060
Cilium-PR-K8s-1.23-kernel-4.9
- Some version of Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io #16402

brb and others added 4 commits December 2, 2021 17:34

Revert "test/helpers: fix ensure kubectl version to work for RCs"

d5e5ed4

[ upstream commit 6c432fb ] This reverts commit bb6ef27. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

nathanjsweet requested review from a team as code owners December 2, 2021 23:41

nathanjsweet requested review from a team, aditighag, joestringer, brb and aanm December 2, 2021 23:41

maintainer-s-little-helper bot added backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Dec 2, 2021

maintainer-s-little-helper bot assigned aanm, aditighag, brb and joestringer Dec 2, 2021

joestringer requested changes Dec 2, 2021

View reviewed changes

.github/workflows/conformance-eks-v1.11.yaml Outdated Show resolved Hide resolved

aanm approved these changes Dec 2, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned aanm Dec 2, 2021

joestringer added 2 commits December 2, 2021 17:50

test: Quarantine Secondary nodeport device tests

0b1c75c

[ upstream commit 2d7602e ] See issue 18072 for more details about the flaky test. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

aws: Disable flaky test

0bdabbf

[ upstream commit 0c7fe95 ] This test has been flaky for well over a year now, see issue 11560. Track re-enablement in https://github.com/cilium/cilium/projects/173 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>

nathanjsweet force-pushed the pr/nathanjsweet/v1.11-backport-2021-12-02 branch from 9fef380 to 0bdabbf Compare December 2, 2021 23:51

joestringer approved these changes Dec 2, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned joestringer Dec 2, 2021

aditighag approved these changes Dec 3, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned aditighag Dec 3, 2021

brb approved these changes Dec 3, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned brb Dec 3, 2021

pchaigno approved these changes Dec 3, 2021

View reviewed changes

This was referenced Dec 3, 2021

Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io #16402

Closed

CI: failed to ensure kubectl version: failed to download kubectl #18060

Closed

nathanjsweet merged commit b423190 into v1.11 Dec 3, 2021

nathanjsweet deleted the pr/nathanjsweet/v1.11-backport-2021-12-02 branch December 3, 2021 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.11 backports 2021-12-02 #18109

v1.11 backports 2021-12-02 #18109

nathanjsweet commented Dec 2, 2021 •

edited

aanm left a comment

joestringer left a comment

joestringer commented Dec 3, 2021 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

aditighag commented Dec 3, 2021

brb left a comment

pchaigno left a comment

nathanjsweet commented Dec 3, 2021 •

edited

v1.11 backports 2021-12-02 #18109

v1.11 backports 2021-12-02 #18109

Conversation

nathanjsweet commented Dec 2, 2021 • edited

aanm left a comment

Choose a reason for hiding this comment

joestringer left a comment

Choose a reason for hiding this comment

joestringer commented Dec 3, 2021 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

aditighag commented Dec 3, 2021

brb left a comment

Choose a reason for hiding this comment

pchaigno left a comment

Choose a reason for hiding this comment

nathanjsweet commented Dec 3, 2021 • edited

nathanjsweet commented Dec 2, 2021 •

edited

joestringer commented Dec 3, 2021 •

edited by maintainer-s-little-helper bot

nathanjsweet commented Dec 3, 2021 •

edited