You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
giorio94 opened this issue
Jun 1, 2023
· 2 comments
Labels
area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Timeout while waiting for Cilium to become ready
Expected
<*errors.errorString | 0xc000505690>:
only 1 of 2 desired pods are ready
{
s: "only 1 of 2 desired pods are ready",
}
to be nil
Looking at one of the sysdumps, it appears that the connectivity from the API server to one of the nodes was lost:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s1 Ready control-plane 15m v1.24.4 192.168.56.11 <none> Ubuntu 20.04.6 LTS 5.4.240-0504240-generic containerd://1.6.4
k8s2 NotReady <none> 11m v1.24.4 192.168.56.12 <none> Ubuntu 20.04.6 LTS 5.4.240-0504240-generic containerd://1.6.4
Likely culprits seem to be either K8sPolicyTestExtended.Validate toEntities KubeAPIServer.Denies connection to KubeAPIServer (during the clean-up phase) or Tests upgrade and downgrade from a Cilium stable image to master (the first test which appears to have failed):
16:54:57 K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:54:57 Denies connection to KubeAPIServer
16:54:57 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:54:57 14:54:57 STEP: Installing allow-all egress policy
16:55:05 14:55:05 STEP: Installing toEntities KubeAPIServer
16:55:11 14:55:11 STEP: Verifying policy correctness
16:55:11 14:55:11 STEP: Checking ingress connectivity from k8s1 pod to k8s2 pod
16:55:11 14:55:11 STEP: Bypassing check for ingress connectivity for host, which cannot be done in non-managed environments
16:55:11 14:55:11 STEP: Bypassing check for ingress connectivity for remote-node, which cannot be done in a two-node cluster
16:55:11 14:55:11 STEP: Verifying KubeAPIServer connectivity is denied
16:55:17 === Test Finished at 2023-05-31T14:55:16Z====
16:55:17 14:55:16 STEP: Running JustAfterEach block for EntireTestsuite K8sPolicyTestExtended
16:55:17 14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17 14:55:16 STEP: Running AfterEach for block EntireTestsuite K8sPolicyTestExtended
16:55:17 14:55:16 STEP: Running AfterEach for block EntireTestsuite
16:55:17 <Checks>
16:55:17 Number of "context deadline exceeded" in logs: 0
16:55:17 Number of "level=error" in logs: 0
16:55:17 Number of "level=warning" in logs: 0
16:55:17 Number of "Cilium API handler panicked" in logs: 0
16:55:17 Number of "Goroutine took lock for more than" in logs: 0
16:55:17 No errors/warnings found in logs
16:55:17 Number of "context deadline exceeded" in logs: 0
16:55:17 Number of "level=error" in logs: 0
16:55:17 Number of "level=warning" in logs: 0
16:55:17 Number of "Cilium API handler panicked" in logs: 0
16:55:17 Number of "Goroutine took lock for more than" in logs: 0
16:55:17 No errors/warnings found in logs
16:55:17 Number of "context deadline exceeded" in logs: 0
16:55:17 Number of "level=error" in logs: 0
16:55:17 Number of "level=warning" in logs: 0
16:55:17 Number of "Cilium API handler panicked" in logs: 0
16:55:17 Number of "Goroutine took lock for more than" in logs: 0
16:55:17 No errors/warnings found in logs
16:55:17
16:55:17 </Checks>
16:55:17
16:55:17 14:55:16 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended Validate toEntities KubeAPIServer
16:55:17 14:55:17 STEP: Running AfterAll block for EntireTestsuite K8sPolicyTestExtended
16:55:17 14:55:17 STEP: Removing Cilium installation using generated helm manifest
16:55:18
16:55:18 • [SLOW TEST:20.971 seconds]
16:55:18 K8sPolicyTestExtended
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18 Validate toEntities KubeAPIServer
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:461
16:55:18 Denies connection to KubeAPIServer
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18 ------------------------------
16:55:18 K8sDatapathBGPTests
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18 === Test Finished at 2023-05-31T14:55:18Z====
16:55:18 14:55:18 STEP: Running AfterEach for block EntireTestsuite
16:55:18 <Checks>
16:55:18
16:55:18 </Checks>
16:55:18
16:55:18
16:55:18 S [SKIPPING] [0.000 seconds]
16:55:18 K8sDatapathBGPTests [It]
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18
16:55:18 skipping due to unmet condition
16:55:18
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:668
16:55:18 ------------------------------
16:55:18 S
16:55:18 ------------------------------
16:55:18 K8sUpdates
16:55:18 Tests upgrade and downgrade from a Cilium stable image to master
16:55:18 /home/jenkins/workspace/Cilium-PR-K8s-1.24-kernel-5.4/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:515
16:55:18 14:55:18 STEP: Running BeforeAll block for EntireTestsuite K8sUpdates
16:55:18 14:55:18 STEP: Ensuring the namespace kube-system exists
16:55:18 14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:55:19 14:55:18 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
16:55:25 14:55:25 STEP: Waiting for pods to be terminated
16:55:31 14:55:31 STEP: Deleting Cilium and CoreDNS
16:55:31 14:55:31 STEP: Waiting for pods to be terminated
16:55:31 14:55:31 STEP: Cleaning Cilium state (74e365941a7e181332431dbab90a9df1f7b80e84)
16:55:31 14:55:31 STEP: Cleaning up Cilium components
16:55:33 14:55:33 STEP: Waiting for Cilium to become ready
16:59:46 FAIL: Timed out after 240.001s.
16:59:46 Cilium "1.14.0-dev" did not become ready in time
16:59:46 Expected
16:59:46 <*errors.errorString | 0xc00260d150>:
16:59:46 only 1 of 2 desired pods are ready
16:59:46 {
16:59:46 s: "only 1 of 2 desired pods are ready",
16:59:46 }
16:59:46 to be nil
16:59:46 === Test Finished at 2023-05-31T14:59:33Z====
16:59:46 14:59:33 STEP: Running JustAfterEach block for EntireTestsuite K8sUpdates
17:00:06 ===================== TEST FAILED =====================
17:00:06 15:00:03 STEP: Running AfterFailed block for EntireTestsuite K8sUpdates
17:02:50 cmd: kubectl get pods -o wide --all-namespaces
17:02:50 Exitcode: 0
17:02:50 Stdout:
17:02:50 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
17:02:50 cilium-monitoring grafana-84476dcf4b-gzcqx 0/1 Running 0 12m 10.0.0.228 k8s1 <none> <none>
17:02:50 cilium-monitoring prometheus-7dbb447479-fnglj 1/1 Running 0 12m 10.0.0.166 k8s1 <none> <none>
17:02:50 kube-system cilium-9brnl 1/1 Running 0 7m4s 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system cilium-sqxsj 0/1 Init:4/6 0 7m4s 192.168.56.12 k8s2 <none> <none>
17:02:50 kube-system coredns-6b775575b5-2zhsj 0/1 ContainerCreating 0 75s <none> k8s1 <none> <none>
17:02:50 kube-system coredns-6b775575b5-84m65 1/1 Terminating 0 11m 10.0.1.176 k8s2 <none> <none>
17:02:50 kube-system etcd-k8s1 1/1 Running 0 17m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system kube-apiserver-k8s1 1/1 Running 0 17m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system kube-controller-manager-k8s1 1/1 Running 0 17m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system kube-proxy-8b72v 1/1 Running 0 13m 192.168.56.12 k8s2 <none> <none>
17:02:50 kube-system kube-proxy-xch9k 1/1 Running 0 17m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system kube-scheduler-k8s1 1/1 Running 0 17m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system log-gatherer-2s9dr 1/1 Running 0 13m 192.168.56.12 k8s2 <none> <none>
17:02:50 kube-system log-gatherer-ttdr4 1/1 Running 0 13m 192.168.56.11 k8s1 <none> <none>
17:02:50 kube-system registry-adder-2j6p4 1/1 Running 0 13m 192.168.56.12 k8s2 <none> <none>
17:02:50 kube-system registry-adder-7c7ds 1/1 Running 0 13m 192.168.56.11 k8s1 <none> <none>
17:02:50
17:02:50 Stderr:
17:02:50
17:02:50
17:02:50 Fetching command output from pods [cilium-9brnl cilium-sqxsj]
17:08:19 cmd: kubectl exec -n kube-system cilium-9brnl -c cilium-agent -- cilium endpoint list
17:08:19 Exitcode: 1
17:08:19 Err: exit status 1
17:08:19 Stdout:
17:08:19
17:08:19 Stderr:
17:08:19 Error: cannot get endpoint list: Get "[http:///var/run/cilium/cilium.sock/v1/endpoint](http://var/run/cilium/cilium.sock/v1/endpoint)": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
17:08:19 Is the agent running?
17:08:19
17:08:19 command terminated with exit code 1
17:08:19
17:08:19
17:08:19 cmd: kubectl exec -n kube-system cilium-sqxsj -c cilium-agent -- cilium endpoint list
17:08:19 Exitcode: 1
17:08:19 Err: exit status 1
17:08:19 Stdout:
17:08:19
17:08:19 Stderr:
17:08:19 Error from server: error dialing backend: dial tcp 192.168.56.12:10250: i/o timeout
17:08:19
17:08:19
17:08:19 ===================== Exiting AfterFailed =====================
17:08:19 15:08:09 STEP: Running AfterEach for block EntireTestsuite K8sUpdates
17:12:16 15:12:09 STEP: Cleaning up Cilium components
17:16:13 FAIL: terminating containers are not deleted after timeout
17:16:13 Expected
17:16:13 <*fmt.wrapError | 0xc0006eb4e0>:
17:16:13 Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:16:13 {
17:16:13 msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13 err: <*errors.errorString | 0xc0004d4ef0>{
17:16:13 s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:16:13 },
17:16:13 }
17:16:13 to be nil
17:16:13 15:16:10 STEP: Waiting for Cilium to become ready
17:20:11 FAIL: Timed out after 240.000s.
17:20:11 Cilium "1.14.0-dev" did not become ready in time
17:20:11 Expected
17:20:11 <*errors.errorString | 0xc001a6b850>:
17:20:11 only 1 of 2 desired pods are ready
17:20:11 {
17:20:11 s: "only 1 of 2 desired pods are ready",
17:20:11 }
17:20:11 to be nil
17:24:23 FAIL: terminating containers are not deleted after timeout
17:24:23 Expected
17:24:23 <*fmt.wrapError | 0xc0007dd4c0>:
17:24:23 Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:24:23 {
17:24:23 msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23 err: <*errors.errorString | 0xc0006b60f0>{
17:24:23 s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:24:23 },
17:24:23 }
17:24:23 to be nil
17:28:20 FAIL: terminating containers are not deleted after timeout
17:28:20 Expected
17:28:20 <*fmt.wrapError | 0xc000691d80>:
17:28:20 Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]
17:28:20 {
17:28:20 msg: "Pods are still not deleted after a timeout: 4m0s timeout expired: Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20 err: <*errors.errorString | 0xc000e5dc90>{
17:28:20 s: "Pods are still terminating: [cilium-sqxsj coredns-6b775575b5-84m65]",
17:28:20 },
17:28:20 }
17:28:20 to be nil
17:28:20 15:28:11 STEP: Running AfterEach for block EntireTestsuite
17:28:20 <Checks>
17:28:20 Number of "context deadline exceeded" in logs: 0
17:28:20 Number of "level=error" in logs: 0
17:28:20 Number of "level=warning" in logs: 0
17:28:20 Number of "Cilium API handler panicked" in logs: 0
17:28:20 Number of "Goroutine took lock for more than" in logs: 0
17:28:20 No errors/warnings found in logs
17:28:20 Number of "context deadline exceeded" in logs: 0
17:28:20 Number of "level=error" in logs: 0
17:28:20 Number of "level=warning" in logs: 0
17:28:20 Number of "Cilium API handler panicked" in logs: 0
17:28:20 Number of "Goroutine took lock for more than" in logs: 0
17:28:20 No errors/warnings found in logs
17:28:20 Number of "context deadline exceeded" in logs: 0
17:28:20 Number of "level=error" in logs: 0
17:28:20 Number of "level=warning" in logs: 0
17:28:20 Number of "Cilium API handler panicked" in logs: 0
17:28:20 Number of "Goroutine took lock for more than" in logs: 0
17:28:20 No errors/warnings found in logs
17:28:20 Cilium pods: [cilium-9brnl cilium-sqxsj]
17:28:20 Netpols loaded:
17:28:20 CiliumNetworkPolicies loaded:
17:28:20 Endpoint Policy Enforcement:
17:28:20 Pod Ingress Egress
The text was updated successfully, but these errors were encountered:
giorio94
added
area/CI
Continuous Integration testing issue or flake
ci/flake
This is a known failure that occurs in the tree. Please investigate me!
labels
Jun 1, 2023
area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
CI failure
Observed in #25554
Multiple tests failed with:
Looking at one of the sysdumps, it appears that the connectivity from the API server to one of the nodes was lost:
Likely culprits seem to be either
K8sPolicyTestExtended.Validate toEntities KubeAPIServer.Denies connection to KubeAPIServer
(during the clean-up phase) orTests upgrade and downgrade from a Cilium stable image to master
(the first test which appears to have failed):Link: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/283/
Sysdumps: sysdumps.zip (I've dropped the ones for subsequent failures, to reduce the size)
The text was updated successfully, but these errors were encountered: