CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

giorio94 · 2023-06-01T06:55:04Z

CI failure

Observed in #25554

Multiple tests failed with:

Timeout while waiting for Cilium to become ready
Expected
      <*errors.errorString | 0xc000505690>: 
      only 1 of 2 desired pods are ready
       {
           s: "only 1 of 2 desired pods are ready",
       }
   to be nil

Looking at one of the sysdumps, it appears that the connectivity from the API server to one of the nodes was lost:

NAME   STATUS     ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION            CONTAINER-RUNTIME
k8s1   Ready      control-plane   15m   v1.24.4   192.168.56.11   <none>        Ubuntu 20.04.6 LTS   5.4.240-0504240-generic   containerd://1.6.4
k8s2   NotReady   <none>          11m   v1.24.4   192.168.56.12   <none>        Ubuntu 20.04.6 LTS   5.4.240-0504240-generic   containerd://1.6.4

Likely culprits seem to be either K8sPolicyTestExtended.Validate toEntities KubeAPIServer.Denies connection to KubeAPIServer (during the clean-up phase) or Tests upgrade and downgrade from a Cilium stable image to master (the first test which appears to have failed):

16:54:57  K8sPolicyTestExtended Validate toEntities KubeAPIServer 
16:54:57    Denies connection to KubeAPIServer
16:54:57    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:54:57  14:54:57 STEP: Installing allow-all egress policy
16:55:05  14:55:05 STEP: Installing toEntities KubeAPIServer
16:55:11  14:55:11 STEP: Verifying policy correctness
16:55:11  14:55:11 STEP: Checking ingress connectivity from k8s1 pod to k8s2 pod
16:55:11  14:55:11 STEP: Bypassing check for ingress connectivity for host, which cannot be done in non-managed environments
16:55:11  14:55:11 STEP: Bypassing check for ingress connectivity for remote-node, which cannot be done in a two-node cluster
16:55:11  14:55:11 STEP: Verifying KubeAPIServer connectivity is denied
16:55:17  === Test Finished at 2023-05-31T14:55:16Z====
16:55:17  14:55:16 STEP: Running JustAfterEach block for EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite
16:55:17  <Checks>
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  
16:55:17  </Checks>
16:55:17  
16:55:17  14:55:16 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17  14:55:17 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:17 STEP: Removing Cilium installation using generated helm manifest
16:55:18  
16:55:18  • [SLOW TEST:20.971 seconds]
16:55:18  K8sPolicyTestExtended
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18    Validate toEntities KubeAPIServer
16:55:18    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18      Denies connection to KubeAPIServer
16:55:18      /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  ------------------------------
16:55:18  K8sDatapathBGPTests
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  === Test Finished at 2023-05-31T14:55:18Z====
16:55:18  14:55:18 STEP: Running AfterEach for block EntireTestsuite
16:55:18  <Checks>
16:55:18  
16:55:18  </Checks>
16:55:18  
16:55:18  
16:55:18  S [SKIPPING] [0.000 seconds]
16:55:18  K8sDatapathBGPTests [It]
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  
16:55:18  skipping due to unmet condition
16:55:18  
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:668
16:55:18  ------------------------------
16:55:18  S
16:55:18  ------------------------------
16:55:18  K8sUpdates 
16:55:18    Tests upgrade and downgrade from a Cilium stable image to master
16:55:18    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  14:55:18 STEP: Running BeforeAll block for EntireTestsuite K8sUpdates
16:55:18  14:55:18 STEP: Ensuring the namespace kube-system exists
16:55:18  14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:55:19  14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
16:55:25  14:55:25 STEP: Waiting for pods to be terminated
16:55:31  14:55:31 STEP: Deleting Cilium and CoreDNS
16:55:31  14:55:31 STEP: Waiting for pods to be terminated
16:55:31  14:55:31 STEP: Cleaning Cilium state (74e365941a7e181332431dbab90a9df1f7b80e84)
16:55:31  14:55:31 STEP: Cleaning up Cilium components
16:55:33  14:55:33 STEP: Waiting for Cilium to become ready
16:59:46  FAIL: Timed out after 240.001s.
16:59:46  Cilium "1.14.0-dev" did not become ready in time
16:59:46  Expected
16:59:46      <*errors.errorString | 0xc00260d150>: 
16:59:46      only 1 of 2 desired pods are ready
16:59:46      {
16:59:46          s: "only 1 of 2 desired pods are ready",
16:59:46      }
16:59:46  to be nil
16:59:46  === Test Finished at 2023-05-31T14:59:33Z====
16:59:46  14:59:33 STEP: Running JustAfterEach block for EntireTestsuite K8sUpdates
17:00:06  ===================== TEST FAILED =====================
17:00:06  15:00:03 STEP: Running AfterFailed block for EntireTestsuite K8sUpdates
17:02:50  cmd: kubectl get pods -o wide --all-namespaces
17:02:50  Exitcode: 0 
17:02:50  Stdout:
17:02:50   	 NAMESPACE           NAME                           READY   STATUS              RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
17:02:50  	 cilium-monitoring   grafana-84476dcf4b-gzcqx       0/1     Running             0          12m    10.0.0.228      k8s1   <none>           <none>
17:02:50  	 cilium-monitoring   prometheus-7dbb447479-fnglj    1/1     Running             0          12m    10.0.0.166      k8s1   <none>           <none>
17:02:50  	 kube-system         cilium-9brnl                   1/1     Running             0          7m4s   192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         cilium-sqxsj                   0/1     Init:4/6            0          7m4s   192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         coredns-6b775575b5-2zhsj       0/1     ContainerCreating   0          75s    <none>          k8s1   <none>           <none>
17:02:50  	 kube-system         coredns-6b775575b5-84m65       1/1     Terminating         0          11m    10.0.1.176      k8s2   <none>           <none>
17:02:50  	 kube-system         etcd-k8s1                      1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-apiserver-k8s1            1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-controller-manager-k8s1   1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-proxy-8b72v               1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         kube-proxy-xch9k               1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-scheduler-k8s1            1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         log-gatherer-2s9dr             1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         log-gatherer-ttdr4             1/1     Running             0          13m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         registry-adder-2j6p4           1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         registry-adder-7c7ds           1/1     Running             0          13m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 
17:02:50  Stderr:
17:02:50   	 
17:02:50  
17:02:50  Fetching command output from pods [cilium-9brnl cilium-sqxsj]
17:08:19  cmd: kubectl exec -n kube-system cilium-9brnl -c cilium-agent -- cilium endpoint list
17:08:19  Exitcode: 1 
17:08:19  Err: exit status 1
17:08:19  Stdout:
17:08:19   	 
17:08:19  Stderr:
17:08:19   	 Error: cannot get endpoint list: Get "[http:///var/run/cilium/cilium.sock/v1/endpoint](http://var/run/cilium/cilium.sock/v1/endpoint)": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
17:08:19  	 Is the agent running?
17:08:19  	 
17:08:19  	 command terminated with exit code 1
17:08:19  	 
17:08:19  
17:08:19  cmd: kubectl exec -n kube-system cilium-sqxsj -c cilium-agent -- cilium endpoint list
17:08:19  Exitcode: 1 
17:08:19  Err: exit status 1
17:08:19  Stdout:
17:08:19   	 
17:08:19  Stderr:
17:08:19   	 Error from server: error dialing backend: dial tcp 192.168.56.12:10250: i/o timeout
17:08:19  	 
17:08:19  
17:08:19  ===================== Exiting AfterFailed =====================
17:08:19  15:08:09 STEP: Running AfterEach for block EntireTestsuite K8sUpdates
17:12:16  15:12:09 STEP: Cleaning up Cilium components
17:16:13  FAIL: terminating containers are not deleted after timeout
17:16:13  Expected
17:16:13      <*fmt.wrapError | 0xc0006eb4e0>: 
17:16:13      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:16:13      {
17:16:13          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13          err: <*errors.errorString | 0xc0004d4ef0>{
17:16:13              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13          },
17:16:13      }
17:16:13  to be nil
17:16:13  15:16:10 STEP: Waiting for Cilium to become ready
17:20:11  FAIL: Timed out after 240.000s.
17:20:11  Cilium "1.14.0-dev" did not become ready in time
17:20:11  Expected
17:20:11      <*errors.errorString | 0xc001a6b850>: 
17:20:11      only 1 of 2 desired pods are ready
17:20:11      {
17:20:11          s: "only 1 of 2 desired pods are ready",
17:20:11      }
17:20:11  to be nil
17:24:23  FAIL: terminating containers are not deleted after timeout
17:24:23  Expected
17:24:23      <*fmt.wrapError | 0xc0007dd4c0>: 
17:24:23      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:24:23      {
17:24:23          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23          err: <*errors.errorString | 0xc0006b60f0>{
17:24:23              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23          },
17:24:23      }
17:24:23  to be nil
17:28:20  FAIL: terminating containers are not deleted after timeout
17:28:20  Expected
17:28:20      <*fmt.wrapError | 0xc000691d80>: 
17:28:20      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:28:20      {
17:28:20          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20          err: <*errors.errorString | 0xc000e5dc90>{
17:28:20              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20          },
17:28:20      }
17:28:20  to be nil
17:28:20  15:28:11 STEP: Running AfterEach for block EntireTestsuite
17:28:20  <Checks>
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Cilium pods: [cilium-9brnl cilium-sqxsj]
17:28:20  Netpols loaded: 
17:28:20  CiliumNetworkPolicies loaded: 
17:28:20  Endpoint Policy Enforcement:
17:28:20  Pod   Ingress   Egress

Link: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/283/
Sysdumps: sysdumps.zip (I've dropped the ones for subsequent failures, to reduce the size)

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-01T01:55:59Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-08-16T01:44:21Z

This issue has not seen any activity since it was marked stale.
Closing.

giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 1, 2023

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

giorio94 commented Jun 1, 2023

github-actions bot commented Aug 1, 2023

github-actions bot commented Aug 16, 2023

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

Comments

giorio94 commented Jun 1, 2023

CI failure

github-actions bot commented Aug 1, 2023

github-actions bot commented Aug 16, 2023