Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.11 backports 2021-12-02 #18109

Merged
merged 6 commits into from Dec 3, 2021

Conversation

nathanjsweet
Copy link
Member

@nathanjsweet nathanjsweet commented Dec 2, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 18018 18087 18104 18091 18092; do contrib/backporting/set-labels.py $pr done 1.11; done

or with

$ make add-label branch=v1.11 issues=18018,18087,18104,18091,18092

brb and others added 4 commits December 2, 2021 17:34
[ upstream commit 398d55c ]

As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which
could result in lost connections to kube-apiserver. Worse than this was
that the client couldn't recover.

In the case of CoreDNS the loose of connectivity to kube-apiserver was
even not logged. I have validated this by adding the following rule on
the node which was running the CoreDNS pod (6443 port as the socket-lb
was doing the service xlation):

    iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \
        --dport=6443 -j DROP

After upgrading CoreDNS to the one which was compiled with Go >= 1.16,
the pod was not only logging the errors, but also was able to recover
from them in a fast way. An example of such an error:

    W1126 12:45:08.403311       1 reflector.go:436]
    pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch
    of *v1.Endpoints ended with: an error on the server ("unable to
    decode an event from the watch stream: http2: client connection
    lost") has prevented the request from succeeding

To determine the min vsn bump, I was using the following:

    for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do
        docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \
            --version
    done

    CoreDNS-1.7.0
    linux/amd64, go1.14.4, f59c03d
    CoreDNS-1.7.1
    linux/amd64, go1.15.2, aa82ca6
    CoreDNS-1.8.0
    linux/amd64, go1.15.3, 054c9ae
    k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown:
    k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown:
    CoreDNS-1.8.3
    linux/amd64, go1.16, 4293992
    CoreDNS-1.8.4
    linux/amd64, go1.16.4, 053c4d5

Hopefully, the bumped version will fix the CI flakes in which a service
domain name is not available after 7min. In other words, CoreDNS is not
able to resolve the name which means that it hasn't received update from
the kube-apiserver for the service.

[1]: kubernetes/kubernetes#87615 (comment)

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 6c432fb ]

This reverts commit bb6ef27.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 75fbebb ]

Since we only update the Kubernetes version tested on our CI when the
first RC is announced we should use that binary instead of the `.0` as
the `.0` is not available at the time the rc.0 is released.

Fixes: 6181255 ("test: ensure kubectl version is available for test run")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 854bb86 ]

Commit 398d55c didn't add permissions for `endpointslices` resource to the
coredns `cluterrole` on k8s < 1.20. As a result, core-dns deployments
failed on the these versions with the error -

`2021-11-30T14:09:43.349414540Z E1130 14:09:43.349292 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope`

Fixes: 398d55c
Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@nathanjsweet nathanjsweet requested review from a team as code owners December 2, 2021 23:41
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Dec 2, 2021
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commits. Thanks

[ upstream commit 2d7602e ]

See issue 18072 for more details about the flaky test.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 0c7fe95 ]

This test has been flaky for well over a year now, see issue 11560.
Track re-enablement in https://github.com/cilium/cilium/projects/173

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
@nathanjsweet nathanjsweet force-pushed the pr/nathanjsweet/v1.11-backport-2021-12-02 branch from 9fef380 to 0bdabbf Compare December 2, 2021 23:51
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@joestringer
Copy link
Member

joestringer commented Dec 3, 2021

/test-backport-1.11

Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDemosTest Tests Star Wars Demo

Failure Output

FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sConformance Portmap Chaining Check one node connectivity-check compliance with portmap chaining

Failure Output

FAIL: connectivity-check pods are not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.17-kernel-4.9 so I can create a new GitHub issue to track it.

@aditighag
Copy link
Member

LGTM, thanks!

Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes LGTM, thanks!

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed considering a request was made to cilium/ci-structure. I didn't spot anything that should be Cilium-version-dependent in the test changes, so LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.11 This PR represents a backport for Cilium 1.11.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants