Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

Closed
giorio94 opened this issue Jun 1, 2023 · 2 comments
Closed

CI: k8s-1.24-kernel-5.4: multiple test failures due to NotReady node #25811

giorio94 opened this issue Jun 1, 2023 · 2 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@giorio94
Copy link
Member

giorio94 commented Jun 1, 2023

CI failure

Observed in #25554

Multiple tests failed with:

Timeout while waiting for Cilium to become ready
Expected
      <*errors.errorString | 0xc000505690>: 
      only 1 of 2 desired pods are ready
       {
           s: "only 1 of 2 desired pods are ready",
       }
   to be nil

Looking at one of the sysdumps, it appears that the connectivity from the API server to one of the nodes was lost:

NAME   STATUS     ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION            CONTAINER-RUNTIME
k8s1   Ready      control-plane   15m   v1.24.4   192.168.56.11   <none>        Ubuntu 20.04.6 LTS   5.4.240-0504240-generic   containerd://1.6.4
k8s2   NotReady   <none>          11m   v1.24.4   192.168.56.12   <none>        Ubuntu 20.04.6 LTS   5.4.240-0504240-generic   containerd://1.6.4

Likely culprits seem to be either K8sPolicyTestExtended.Validate toEntities KubeAPIServer.Denies connection to KubeAPIServer (during the clean-up phase) or Tests upgrade and downgrade from a Cilium stable image to master (the first test which appears to have failed):

16:54:57  K8sPolicyTestExtended Validate toEntities KubeAPIServer 
16:54:57    Denies connection to KubeAPIServer
16:54:57    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:54:57  14:54:57 STEP: Installing allow-all egress policy
16:55:05  14:55:05 STEP: Installing toEntities KubeAPIServer
16:55:11  14:55:11 STEP: Verifying policy correctness
16:55:11  14:55:11 STEP: Checking ingress connectivity from k8s1 pod to k8s2 pod
16:55:11  14:55:11 STEP: Bypassing check for ingress connectivity for host, which cannot be done in non-managed environments
16:55:11  14:55:11 STEP: Bypassing check for ingress connectivity for remote-node, which cannot be done in a two-node cluster
16:55:11  14:55:11 STEP: Verifying KubeAPIServer connectivity is denied
16:55:17  === Test Finished at 2023-05-31T14:55:16Z====
16:55:17  14:55:16 STEP: Running JustAfterEach block for EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:16 STEP: Running AfterEach for block EntireTestsuite
16:55:17  <Checks>
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  Number of "context deadline exceeded" in logs: 0
16:55:17  Number of "level=error" in logs: 0
16:55:17  Number of "level=warning" in logs: 0
16:55:17  Number of "Cilium API handler panicked" in logs: 0
16:55:17  Number of "Goroutine took lock for more than" in logs: 0
16:55:17  No errors/warnings found in logs
16:55:17  
16:55:17  </Checks>
16:55:17  
16:55:17  14:55:16 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17  14:55:17 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended
16:55:17  14:55:17 STEP: Removing Cilium installation using generated helm manifest
16:55:18  
16:55:18  • [SLOW TEST:20.971 seconds]
16:55:18  K8sPolicyTestExtended
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18    Validate toEntities KubeAPIServer
16:55:18    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18      Denies connection to KubeAPIServer
16:55:18      /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  ------------------------------
16:55:18  K8sDatapathBGPTests
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  === Test Finished at 2023-05-31T14:55:18Z====
16:55:18  14:55:18 STEP: Running AfterEach for block EntireTestsuite
16:55:18  <Checks>
16:55:18  
16:55:18  </Checks>
16:55:18  
16:55:18  
16:55:18  S [SKIPPING] [0.000 seconds]
16:55:18  K8sDatapathBGPTests [It]
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  
16:55:18  skipping due to unmet condition
16:55:18  
16:55:18  /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:668
16:55:18  ------------------------------
16:55:18  S
16:55:18  ------------------------------
16:55:18  K8sUpdates 
16:55:18    Tests upgrade and downgrade from a Cilium stable image to master
16:55:18    /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18  14:55:18 STEP: Running BeforeAll block for EntireTestsuite K8sUpdates
16:55:18  14:55:18 STEP: Ensuring the namespace kube-system exists
16:55:18  14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:55:19  14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
16:55:25  14:55:25 STEP: Waiting for pods to be terminated
16:55:31  14:55:31 STEP: Deleting Cilium and CoreDNS
16:55:31  14:55:31 STEP: Waiting for pods to be terminated
16:55:31  14:55:31 STEP: Cleaning Cilium state (74e365941a7e181332431dbab90a9df1f7b80e84)
16:55:31  14:55:31 STEP: Cleaning up Cilium components
16:55:33  14:55:33 STEP: Waiting for Cilium to become ready
16:59:46  FAIL: Timed out after 240.001s.
16:59:46  Cilium "1.14.0-dev" did not become ready in time
16:59:46  Expected
16:59:46      <*errors.errorString | 0xc00260d150>: 
16:59:46      only 1 of 2 desired pods are ready
16:59:46      {
16:59:46          s: "only 1 of 2 desired pods are ready",
16:59:46      }
16:59:46  to be nil
16:59:46  === Test Finished at 2023-05-31T14:59:33Z====
16:59:46  14:59:33 STEP: Running JustAfterEach block for EntireTestsuite K8sUpdates
17:00:06  ===================== TEST FAILED =====================
17:00:06  15:00:03 STEP: Running AfterFailed block for EntireTestsuite K8sUpdates
17:02:50  cmd: kubectl get pods -o wide --all-namespaces
17:02:50  Exitcode: 0 
17:02:50  Stdout:
17:02:50   	 NAMESPACE           NAME                           READY   STATUS              RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
17:02:50  	 cilium-monitoring   grafana-84476dcf4b-gzcqx       0/1     Running             0          12m    10.0.0.228      k8s1   <none>           <none>
17:02:50  	 cilium-monitoring   prometheus-7dbb447479-fnglj    1/1     Running             0          12m    10.0.0.166      k8s1   <none>           <none>
17:02:50  	 kube-system         cilium-9brnl                   1/1     Running             0          7m4s   192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         cilium-sqxsj                   0/1     Init:4/6            0          7m4s   192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         coredns-6b775575b5-2zhsj       0/1     ContainerCreating   0          75s    <none>          k8s1   <none>           <none>
17:02:50  	 kube-system         coredns-6b775575b5-84m65       1/1     Terminating         0          11m    10.0.1.176      k8s2   <none>           <none>
17:02:50  	 kube-system         etcd-k8s1                      1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-apiserver-k8s1            1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-controller-manager-k8s1   1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-proxy-8b72v               1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         kube-proxy-xch9k               1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         kube-scheduler-k8s1            1/1     Running             0          17m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         log-gatherer-2s9dr             1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         log-gatherer-ttdr4             1/1     Running             0          13m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 kube-system         registry-adder-2j6p4           1/1     Running             0          13m    192.168.56.12   k8s2   <none>           <none>
17:02:50  	 kube-system         registry-adder-7c7ds           1/1     Running             0          13m    192.168.56.11   k8s1   <none>           <none>
17:02:50  	 
17:02:50  Stderr:
17:02:50   	 
17:02:50  
17:02:50  Fetching command output from pods [cilium-9brnl cilium-sqxsj]
17:08:19  cmd: kubectl exec -n kube-system cilium-9brnl -c cilium-agent -- cilium endpoint list
17:08:19  Exitcode: 1 
17:08:19  Err: exit status 1
17:08:19  Stdout:
17:08:19   	 
17:08:19  Stderr:
17:08:19   	 Error: cannot get endpoint list: Get "[http:///var/run/cilium/cilium.sock/v1/endpoint](http://var/run/cilium/cilium.sock/v1/endpoint)": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
17:08:19  	 Is the agent running?
17:08:19  	 
17:08:19  	 command terminated with exit code 1
17:08:19  	 
17:08:19  
17:08:19  cmd: kubectl exec -n kube-system cilium-sqxsj -c cilium-agent -- cilium endpoint list
17:08:19  Exitcode: 1 
17:08:19  Err: exit status 1
17:08:19  Stdout:
17:08:19   	 
17:08:19  Stderr:
17:08:19   	 Error from server: error dialing backend: dial tcp 192.168.56.12:10250: i/o timeout
17:08:19  	 
17:08:19  
17:08:19  ===================== Exiting AfterFailed =====================
17:08:19  15:08:09 STEP: Running AfterEach for block EntireTestsuite K8sUpdates
17:12:16  15:12:09 STEP: Cleaning up Cilium components
17:16:13  FAIL: terminating containers are not deleted after timeout
17:16:13  Expected
17:16:13      <*fmt.wrapError | 0xc0006eb4e0>: 
17:16:13      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:16:13      {
17:16:13          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13          err: <*errors.errorString | 0xc0004d4ef0>{
17:16:13              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13          },
17:16:13      }
17:16:13  to be nil
17:16:13  15:16:10 STEP: Waiting for Cilium to become ready
17:20:11  FAIL: Timed out after 240.000s.
17:20:11  Cilium "1.14.0-dev" did not become ready in time
17:20:11  Expected
17:20:11      <*errors.errorString | 0xc001a6b850>: 
17:20:11      only 1 of 2 desired pods are ready
17:20:11      {
17:20:11          s: "only 1 of 2 desired pods are ready",
17:20:11      }
17:20:11  to be nil
17:24:23  FAIL: terminating containers are not deleted after timeout
17:24:23  Expected
17:24:23      <*fmt.wrapError | 0xc0007dd4c0>: 
17:24:23      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:24:23      {
17:24:23          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23          err: <*errors.errorString | 0xc0006b60f0>{
17:24:23              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23          },
17:24:23      }
17:24:23  to be nil
17:28:20  FAIL: terminating containers are not deleted after timeout
17:28:20  Expected
17:28:20      <*fmt.wrapError | 0xc000691d80>: 
17:28:20      Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:28:20      {
17:28:20          msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20          err: <*errors.errorString | 0xc000e5dc90>{
17:28:20              s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20          },
17:28:20      }
17:28:20  to be nil
17:28:20  15:28:11 STEP: Running AfterEach for block EntireTestsuite
17:28:20  <Checks>
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Number of "context deadline exceeded" in logs: 0
17:28:20  Number of "level=error" in logs: 0
17:28:20  Number of "level=warning" in logs: 0
17:28:20  Number of "Cilium API handler panicked" in logs: 0
17:28:20  Number of "Goroutine took lock for more than" in logs: 0
17:28:20  No errors/warnings found in logs
17:28:20  Cilium pods: [cilium-9brnl cilium-sqxsj]
17:28:20  Netpols loaded: 
17:28:20  CiliumNetworkPolicies loaded: 
17:28:20  Endpoint Policy Enforcement:
17:28:20  Pod   Ingress   Egress

Link: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/283/
Sysdumps: sysdumps.zip (I've dropped the ones for subsequent failures, to reduce the size)

@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 1, 2023
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 1, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

1 participant