Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications #17590

Closed
ti-mo opened this issue Oct 13, 2021 · 2 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@ti-mo
Copy link
Contributor

ti-mo commented Oct 13, 2021

Test output:

15:20:01 STEP: Performing K8s service preflight check
15:20:03 STEP: Waiting for cilium-operator to be ready
FAIL: unable to retrieve all nodes with 'kubectl get nodes -o json | jq '.items | length'': Exitcode: -1 
Err: signal: killed
Stdout:
 	 2
	 
Stderr:
 	 

=== Test Finished at 2021-10-12T15:20:13Z====
15:20:13 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathConfig
FAIL: Found 1 io.cilium/app=operator logs matching list of errors that must be investigated:
level=error
===================== TEST FAILED =====================
15:20:25 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathConfig
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE           NAME                               READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
	 cilium-monitoring   grafana-5747bcc8f9-ftfkh           1/1     Running   0          2m16s   10.0.1.21       k8s2   <none>           <none>
	 cilium-monitoring   prometheus-655fb888d7-qhrxl        1/1     Running   0          2m16s   10.0.1.146      k8s2   <none>           <none>
	 kube-system         cilium-cqjw7                       1/1     Running   0          64s     192.168.36.12   k8s2   <none>           <none>
	 kube-system         cilium-operator-687c69586d-77fxn   1/1     Running   0          64s     192.168.36.11   k8s1   <none>           <none>
	 kube-system         cilium-operator-687c69586d-xgpj5   0/1     Error     0          64s     192.168.36.12   k8s2   <none>           <none>
...

Haven't seen a Pod in status Error before, but there are operator logs. They indicate:

2021-10-12T15:20:09.775659439Z level=debug msg="Controller func execution time: 1.468µs" name=update-cilium-nodes-pod-cidr subsys=controller uuid=9e125a1b-7b5c-453c-93c9-dfcb51923a65
2021-10-12T15:20:13.244328728Z error retrieving resource lock kube-system/cilium-operator-resource-lock: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock": context deadline exceeded
2021-10-12T15:20:13.244492883Z Failed to release lock: resource name may not be empty
2021-10-12T15:20:13.244701389Z level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Get \"https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock\": context deadline exceeded" subsys=klog
2021-10-12T15:20:13.244716419Z level=info msg="Leader election lost" operator-id=k8s2-xBerCvjWDi subsys=cilium-operator-generic
2021-10-12T15:20:13.244884133Z level=info msg="failed to renew lease kube-system/cilium-operator-resource-lock: timed out waiting for the condition" subsys=klog
2021-10-12T15:20:13.244898894Z level=error msg="Failed to release lock: resource name may not be empty" subsys=klog

Looks like the k8s apiserver becomes unresponsive and there's a cascading failure?

Zip: test_results_Cilium-PR-K8s-1.21-kernel-4.9_1603_BDD-Test-PR.zip

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/1603/

@ti-mo ti-mo added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Oct 13, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Feb 22, 2022
@brb
Copy link
Member

brb commented May 6, 2022

Haven't seen this failure in awhile. Closing.

@brb brb closed this as completed May 6, 2022
@brb brb added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

2 participants