Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: k8s v1.19 or older: Kubernetes DNS did not become ready in time (all tests) #18086

Closed
joestringer opened this issue Dec 1, 2021 · 7 comments · Fixed by #18104
Closed
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects

Comments

@joestringer
Copy link
Member

PR #18018 appears to have broken DNS provisioning in CI on master for k8s versions < 1.20.

Examples:
https://jenkins.cilium.io/job/cilium-master-k8s-1.19-kernel-4.9/247/execution/node/130/log/?consoleFull
https://jenkins.cilium.io/job/cilium-master-k8s-1.17-kernel-4.9/288/execution/node/130/log/?consoleFull

@joestringer joestringer added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Dec 1, 2021
@joestringer joestringer added this to To quarantine in 1.11 CI via automation Dec 1, 2021
@joestringer joestringer moved this from To quarantine to Unassigned in 1.11 CI Dec 1, 2021
@joestringer joestringer assigned brb and aanm and unassigned aanm Dec 1, 2021
@joestringer joestringer changed the title CI: v1.19: Kubernetes DNS did not become ready in time (all tests) CI: k8s v1.19 or older: Kubernetes DNS did not become ready in time (all tests) Dec 1, 2021
@joestringer
Copy link
Member Author

I suggest that we either locate and resolve the problem in the short term, or revert the PR in master and then propose the PR again, this time regularly running k8s 1.19 CI runs to gain confidence in the changes.

@joestringer joestringer moved this from Unassigned to To triage in 1.11 CI Dec 1, 2021
@aditighag
Copy link
Member

Core dns deployment is failing because of this error -

2021-11-30T14:09:43.349414540Z E1130 14:09:43.349292 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:coredns" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope

why we think the v1.10 branch is failing to provision those DNS manifests (seems specifically on k8s 1.19, which doesn't seem to be run automatically on master)

I think we need to introduce test/provision/manifest/1.19/coredns_deployment.yaml in order to add the necessary resource (endpointslices in this case) permissions to the coredns clusterrole?

@joestringer
Copy link
Member Author

@aditighag that may work for 1.19 (assuming we are auto-enabling EndpointSlices on that version), however we will still also need a solution for k8s 1.18 or below which do not have support for EndpointSlices.

@aditighag
Copy link
Member

EndpointSlice | false | Alpha | 1.16 | 1.16
-- | -- | -- | -- | --
EndpointSlice | false | Beta | 1.17 | 1.17
EndpointSlice | true | Beta | 1.18 | 1.20
EndpointSlice | true | GA | 1.21 | -

Based on the k8s reference doc, EndpointSlice is supported since 1.16. Also, we enable the feature gate on test clusters 1.18 onwards.

@joestringer
Copy link
Member Author

joestringer commented Dec 2, 2021

👍 ah great, not sure where I was reading that the minimum support was 1.19. Still, we'll need to figure out a solution down to v1.16 (Cilium v1.10 branch) and down to v1.12 (Cilium v1.9 branch), based on the Cilium CI Matrix.

@joestringer
Copy link
Member Author

joestringer commented Dec 2, 2021

^^ Related question, is it possible to just add in the EndpointSlice support on v1.16 or will that cause other issues given the alpha status?

@joestringer
Copy link
Member Author

Looking back over the failures from v1.9 backports, it looks like I was assuming all k8s below 1.20 was affected but the failures did not occur on k8s 1.15 for example. So @aditighag 's proposal above sounds good. It should be applied for all versions that have Endpoint Slices enabled, on master then backported to all branches along with the original PR.

1.11 CI automation moved this from To triage to Evaulate to exit quarantine Dec 2, 2021
@brb brb assigned aditighag and unassigned brb Dec 3, 2021
@joestringer joestringer moved this from Evaluate to exit quarantine to Done in 1.11 CI Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants