New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.11 backports 2021-12-02 #18109
v1.11 backports 2021-12-02 #18109
Conversation
[ upstream commit 398d55c ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the client couldn't recover. In the case of CoreDNS the loose of connectivity to kube-apiserver was even not logged. I have validated this by adding the following rule on the node which was running the CoreDNS pod (6443 port as the socket-lb was doing the service xlation): iptables -I FORWARD 1 -m tcp --proto tcp --src $CORE_DNS_POD_IP \ --dport=6443 -j DROP After upgrading CoreDNS to the one which was compiled with Go >= 1.16, the pod was not only logging the errors, but also was able to recover from them in a fast way. An example of such an error: W1126 12:45:08.403311 1 reflector.go:436] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding To determine the min vsn bump, I was using the following: for i in 1.7.0 1.7.1 1.8.0 1.8.1 1.8.2 1.8.3 1.8.4; do docker run --rm -ti "k8s.gcr.io/coredns/coredns:v$i" \ --version done CoreDNS-1.7.0 linux/amd64, go1.14.4, f59c03d CoreDNS-1.7.1 linux/amd64, go1.15.2, aa82ca6 CoreDNS-1.8.0 linux/amd64, go1.15.3, 054c9ae k8s.gcr.io/coredns/coredns:v1.8.1 not found: manifest unknown: k8s.gcr.io/coredns/coredns:v1.8.2 not found: manifest unknown: CoreDNS-1.8.3 linux/amd64, go1.16, 4293992 CoreDNS-1.8.4 linux/amd64, go1.16.4, 053c4d5 Hopefully, the bumped version will fix the CI flakes in which a service domain name is not available after 7min. In other words, CoreDNS is not able to resolve the name which means that it hasn't received update from the kube-apiserver for the service. [1]: kubernetes/kubernetes#87615 (comment) Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 75fbebb ] Since we only update the Kubernetes version tested on our CI when the first RC is announced we should use that binary instead of the `.0` as the `.0` is not available at the time the rc.0 is released. Fixes: 6181255 ("test: ensure kubectl version is available for test run") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 854bb86 ] Commit 398d55c didn't add permissions for `endpointslices` resource to the coredns `cluterrole` on k8s < 1.20. As a result, core-dns deployments failed on the these versions with the error - `2021-11-30T14:09:43.349414540Z E1130 14:09:43.349292 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope` Fixes: 398d55c Signed-off-by: Aditi Ghag <aditi@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commits. Thanks
[ upstream commit 2d7602e ] See issue 18072 for more details about the flaky test. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
[ upstream commit 0c7fe95 ] This test has been flaky for well over a year now, see issue 11560. Track re-enablement in https://github.com/cilium/cilium/projects/173 Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: nathanjsweet <nathanjsweet@pm.me>
9fef380
to
0bdabbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
/test-backport-1.11 Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment Job 'Cilium-PR-K8s-1.17-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
LGTM, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My changes LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed considering a request was made to cilium/ci-structure. I didn't spot anything that should be Cilium-version-dependent in the test changes, so LGTM.
Once this PR is merged, you can update the PR labels via:
or with