Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: conformance-ginkgo workflow fails at step "Fetch JUnits" #31040

Closed
joestringer opened this issue Feb 28, 2024 · 10 comments
Closed

CI: conformance-ginkgo workflow fails at step "Fetch JUnits" #31040

joestringer opened this issue Feb 28, 2024 · 10 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/bug/CI This is a bug in the testing code.

Comments

@joestringer
Copy link
Member

It is possible for the conformance-ginkgo GitHub workflow to fail at the Fetch JUnits step:

https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:1

Run mkdir -p cilium-junits
  mkdir -p cilium-junits
  cd test/
  junit_filename="E2E Test ([1](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:1).[2](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:2)9, f1[3](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:3)-datapath-service-ns-xdp-1).xml"
  for filename in *.xml; do cp "${filename}" "../cilium-junits/${junit_filename}"; done;
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    job_name: E2E Test (1.29, f13-datapath-service-ns-xdp-1)
    QUAY_ORGANIZATION: cilium
    QUAY_ORGANIZATION_DEV: cilium
    QUAY_CHARTS_ORGANIZATION_DEV: cilium-charts-dev
    EGRESS_GATEWAY_HELM_VALUES: --helm-set=egressGateway.enabled=true
    CILIUM_CLI_RELEASE_REPO: cilium/cilium-cli
    CILIUM_CLI_VERSION: v0.15.23
    PUSH_TO_DOCKER_HUB: true
    GCP_PERF_RESULTS_BUCKET: gs://cilium-scale-results
    KIND_VERSION: v0.22.0
    KIND_K8S_IMAGE: kindest/node:v1.29.2@sha256:51a1[4](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:4)34a[5](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:5)397193442f0be2a297b488b[6](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:6)c919ce[8](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:8)a3[9](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:9)31be0ce822606ea5ca245
    KIND_K8S_VERSION: v1.29.2
cp: cannot stat '*.xml': No such file or directory
Error: Process completed with exit code 1.

This seems like it may be a race condition for the completion of a previous step to write the JUnit output to disk before proceeding to the next step where the JUnit file is gathered and uploaded as an artifact. Can we add an extra check to ensure that the files are ready for the subsequent test before proceeding to this step?

@joestringer joestringer added kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Feb 28, 2024
@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Feb 29, 2024

@joestringer Looking at the run ginko step, I see that the process may have been killed (externally?) with Error: The operation was canceled. - perhaps this is why the files are missing?

@tommyp1ckles
Copy link
Contributor

Do we have any other examples of this happening?

@joestringer
Copy link
Member Author

I've seen this failure before, but I've lost links to them. Is there some specific piece of information you'd like to see in a sysdump or otherwise gathered from the test run in order to investigate deeper?

One thing that crosses my mind with operation was canceled is if we're somehow crossing some timeout and that's triggering termination/cleanup of a process.

@joestringer
Copy link
Member Author

Oh right, I missed that the actual test run had various failures during my initial investigation. Here's a snippet from the same failure above:

K8sDatapathServicesTest Checks N/S loadbalancing 
  Tests with XDP, direct routing, DSR with Geneve and Maglev
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
17:28:19 STEP: Installing Cilium
17:28:26 STEP: Waiting for Cilium to become ready
FAIL: Timed out after 240.001s.
Timeout while waiting for Cilium to become ready
Expected
    <*errors.errorString | 0xc0010153c0>: 
    only 1 of 2 desired pods are ready
    {
        s: "only 1 of 2 desired pods are ready",
    }
to be nil
=== Test Finished at 2024-02-28T17:32:26Z====
17:32:26 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathServicesTest
===================== TEST FAILED =====================
17:32:27 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathServicesTest
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE           NAME                                         READY   STATUS        RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
	 cilium-monitoring   grafana-6f4755f98c-qbtf6                     1/1     Running       0          27m     10.0.0.175   kind-control-plane   <none>           <none>
	 cilium-monitoring   prometheus-67fdcf4796-n4b42                  1/1     Running       0          27m     10.0.0.154   kind-control-plane   <none>           <none>
	 default             app1-9fdc9569f-bv8gc                         2/2     Running       0          25m     10.0.0.198   kind-control-plane   <none>           <none>
	 default             app1-9fdc9569f-cn2rs                         2/2     Running       0          25m     10.0.0.135   kind-control-plane   <none>           <none>
	 default             app2-bc9f7bb88-xgll9                         1/1     Running       0          25m     10.0.0.128   kind-control-plane   <none>           <none>
	 default             app3-6798875c47-z64gg                        1/1     Running       0          25m     10.0.0.5     kind-control-plane   <none>           <none>
	 default             echo-656f7b858d-c2d9k                        2/2     Running       0          25m     10.0.0.165   kind-control-plane   <none>           <none>
	 default             echo-656f7b858d-dgvtw                        2/2     Running       0          25m     10.0.1.157   kind-worker          <none>           <none>
	 default             test-k8s2-7b76864867-926k9                   2/2     Running       0          25m     10.0.1.151   kind-worker          <none>           <none>
	 default             testclient-4857h                             1/1     Running       0          25m     10.0.1.246   kind-worker          <none>           <none>
	 default             testclient-vnmtb                             1/1     Running       0          25m     10.0.0.17    kind-control-plane   <none>           <none>
	 default             testds-9p5ht                                 2/2     Running       0          25m     10.0.0.56    kind-control-plane   <none>           <none>
	 default             testds-cxxt7                                 2/2     Running       0          25m     10.0.1.239   kind-worker          <none>           <none>
	 kube-system         cilium-2x9pq                                 0/1     Pending       0          4m25s   <none>       <none>               <none>           <none>
	 kube-system         cilium-node-init-5lksk                       1/1     Running       0          4m25s   172.18.0.4   kind-worker2         <none>           <none>
	 kube-system         cilium-node-init-6ch5m                       1/1     Running       0          4m25s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         cilium-node-init-ll9ll                       1/1     Running       0          4m25s   172.18.0.3   kind-worker          <none>           <none>
	 kube-system         cilium-operator-6c8ccb64bd-6r7hq             1/1     Running       0          4m24s   172.18.0.3   kind-worker          <none>           <none>
	 kube-system         cilium-operator-6c8ccb64bd-7t855             1/1     Running       0          4m24s   172.18.0.4   kind-worker2         <none>           <none>
	 kube-system         cilium-s4gkh                                 0/1     Terminating   0          24m     172.18.0.3   kind-worker          <none>           <none>
	 kube-system         cilium-skndx                                 1/1     Running       0          4m25s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         coredns-76f75df574-ltvqg                     1/1     Running       0          25m     10.0.1.161   kind-worker          <none>           <none>
	 kube-system         coredns-76f75df574-p9zzj                     1/1     Running       0          25m     10.0.1.150   kind-worker          <none>           <none>
	 kube-system         etcd-kind-control-plane                      1/1     Running       0          27m     172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-apiserver-kind-control-plane            1/1     Running       0          27m     172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-controller-manager-kind-control-plane   1/1     Running       0          27m     172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-scheduler-kind-control-plane            1/1     Running       0          27m     172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         log-gatherer-4mcns                           1/1     Running       0          27m     172.18.0.4   kind-worker2         <none>           <none>
	 kube-system         log-gatherer-9tj66                           1/1     Running       0          27m     172.18.0.3   kind-worker          <none>           <none>
	 kube-system         log-gatherer-h5w4k                           1/1     Running       0          27m     172.18.0.2   kind-control-plane   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-2x9pq cilium-s4gkh cilium-skndx]
cmd: kubectl exec -n kube-system cilium-2x9pq -c cilium-agent -- cilium-dbg service list
Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error from server (BadRequest): pod cilium-2x9pq does not have a host assigned
	 

cmd: kubectl exec -n kube-system cilium-2x9pq -c cilium-agent -- cilium-dbg endpoint list
Exitcode: 1 
Err: exit status 1
Stdout:
 	 
Stderr:
 	 Error from server (BadRequest): pod cilium-2x9pq does not have a host assigned
	 

CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                           Ingress   Egress
test-k8s2-7b76864867-926k9              
testclient-4857h                        
testclient-vnmtb                        
testds-9p5ht                            
coredns-76f75df574-ltvqg                
app2-bc9f7bb88-xgll9                    
echo-656f7b858d-dgvtw                   
app1-9fdc9569f-cn2rs                    
app1-9fdc9569f-bv8gc                    
app3-6798875c47-z64gg                   
echo-656f7b858d-c2d9k                   
coredns-76f75df574-p9zzj                
prometheus-67fdcf4796-n4b42             
testds-cxxt7                            
grafana-6f4755f98c-qbtf6                
Cilium agent 'cilium-skndx': Status: Ok  Health: Ok Nodes "" ContainerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 73 Failed 0

</Checks>

17:33:17 STEP: Running AfterAll block for EntireTestsuite K8sDatapathServicesTest Checks N/S loadbalancing
Error: The operation was canceled.

I see other tests failing earlier with timeouts as well, and the test runs took 31m 34s according to GitHub. The timeout expressed in the workflow is 40m though 🤔 https://github.com/cilium/cilium/actions/runs/8084204673/workflow#L356

@joestringer
Copy link
Member Author

Something else that might be interesting, here's the first test that was run, and Cilium failed to become ready right from the get-go, which is pretty strange:

/root/go/bin/ginkgo  --focus="K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, DSR|K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, Hybrid"  --skip=""  --seed=1679952881  -v --  -cilium.provision=false  -cilium.image=quay.io/cilium/cilium-ci  -cilium.tag=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801   -cilium.operator-image=quay.io/cilium/operator  -cilium.operator-tag=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801  -cilium.hubble-relay-image=quay.io/cilium/hubble-relay-ci  -cilium.hubble-relay-tag=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801  -cilium.kubeconfig=/root/.kube/config  -cilium.provision-k8s=false  -cilium.operator-suffix=-ci
Using CNI_INTEGRATION="kind"
Running Suite: Suite-k8s-1.29
=============================
Random Seed: 1679952881
Will run 5 of 132 specs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
K8sDatapathServicesTest Checks N/S loadbalancing 
  Tests with XDP, direct routing, Hybrid and Random
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
17:05:26 STEP: Running BeforeAll block for EntireTestsuite
17:05:26 STEP: Starting tests: command line parameters: {Reprovision:false HoldEnvironment:false PassCLIEnvironment:false SSHConfig: ShowCommands:false TestScope: SkipLogGathering:false CiliumImage:quay.io/cilium/cilium-ci CiliumTag:fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 CiliumOperatorImage:quay.io/cilium/operator CiliumOperatorTag:fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 CiliumOperatorSuffix:-ci HubbleRelayImage:quay.io/cilium/hubble-relay-ci HubbleRelayTag:fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 ProvisionK8s:false Timeout:24h0m0s Kubeconfig:/root/.kube/config KubectlPath:/tmp/kubectl RegistryCredentials: Multinode:true RunQuarantined:false Help:false} environment variables: [SHELL=/bin/bash KERNEL=net-next K8S_NODES=3 *** LOGNAME=root KUBEPROXY=0 HOME=/root LANG=C.UTF-8 NO_CILIUM_ON_NODES=kind-worker2 NETNEXT=1 SSH_CONNECTION=10.0.2.2 45348 10.0.2.15 22 INTEGRATION_TESTS=true CILIUM_NO_IPV6_OUTSIDE=true USER=root CNI_INTEGRATION=kind SHLVL=1 K8S_VERSION=1.29 SSH_CLIENT=10.0.2.2 45348 22 PATH=/root/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin MAIL=/var/mail/root OLDPWD=/root _=./test.test CILIUM_IMAGE=quay.io/cilium/cilium-ci CILIUM_TAG=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 CILIUM_OPERATOR_IMAGE=quay.io/cilium/operator CILIUM_OPERATOR_TAG=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 CILIUM_OPERATOR_SUFFIX=-ci HUBBLE_RELAY_IMAGE=quay.io/cilium/hubble-relay-ci HUBBLE_RELAY_TAG=fedc3de8f1460e3e0532ce0e80fb8f7b38f45801 SKIP_K8S_PROVISION=true]
17:05:26 STEP: Ensuring the namespace kube-system exists
17:05:27 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
17:05:36 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
17:05:36 STEP: Preparing cluster
17:05:37 STEP: Deleting namespace local-path-storage
17:05:45 STEP: Labelling nodes
17:05:45 STEP: Cleaning up Cilium components
17:05:47 STEP: Running BeforeAll block for EntireTestsuite K8sDatapathServicesTest
17:05:47 STEP: Ensuring the namespace kube-system exists
17:05:48 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
17:05:48 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
17:05:52 STEP: Running BeforeAll block for EntireTestsuite K8sDatapathServicesTest Checks N/S loadbalancing
17:05:52 STEP: Installing Cilium
17:05:58 STEP: Waiting for Cilium to become ready
17:07:28 STEP: Validating if Kubernetes DNS is deployed
17:07:28 STEP: Checking if deployment is ready
17:07:28 STEP: Checking if kube-dns service is plumbed correctly
17:07:28 STEP: Checking if pods have identity
17:07:28 STEP: Checking if DNS can resolve
17:07:35 STEP: Kubernetes DNS is not ready: %!s(<nil>)
17:07:35 STEP: Restarting Kubernetes DNS (-l k8s-app=kube-dns)
17:07:35 STEP: Waiting for Kubernetes DNS to become operational
17:07:35 STEP: Checking if deployment is ready
17:07:36 STEP: Kubernetes DNS is not ready yet: only 0 of 2 replicas are available
17:07:36 STEP: Checking if deployment is ready
17:07:37 STEP: Kubernetes DNS is not ready yet: only 0 of 2 replicas are available
17:07:37 STEP: Checking if deployment is ready
17:07:38 STEP: Kubernetes DNS is not ready yet: only 0 of 2 replicas are available
17:07:38 STEP: Checking if deployment is ready
17:07:39 STEP: Kubernetes DNS is not ready yet: only 0 of 2 replicas are available
17:07:39 STEP: Checking if deployment is ready
17:07:40 STEP: Kubernetes DNS is not ready yet: only 1 of 2 replicas are available
17:07:40 STEP: Checking if deployment is ready
17:07:41 STEP: Checking if kube-dns service is plumbed correctly
17:07:41 STEP: Checking if pods have identity
17:07:41 STEP: Checking if DNS can resolve
17:07:43 STEP: Kubernetes DNS is not ready yet: CiliumEndpoint does not exist
17:07:43 STEP: Checking if deployment is ready
17:07:43 STEP: Checking if kube-dns service is plumbed correctly
17:07:43 STEP: Checking if pods have identity
17:07:43 STEP: Checking if DNS can resolve
17:07:44 STEP: Validating Cilium Installation
17:07:44 STEP: Checking whether host EP regenerated
17:07:44 STEP: Performing Cilium status preflight check
17:07:44 STEP: Performing Cilium controllers preflight check
17:07:44 STEP: Performing Cilium health check
17:07:47 STEP: Performing Cilium service preflight check
17:07:47 STEP: Performing K8s service preflight check
17:07:48 STEP: Waiting for cilium-operator to be ready
17:07:48 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator")
17:07:48 STEP: WaitforPods(namespace="kube-system", filter="-l name=cilium-operator") => <nil>
17:07:52 STEP: Connectivity config:: helpers.DualStackSupported(): true
Primary Interface eth0   :: IPv4: ([172](https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:16:174).18.0.2, 172.18.0.3), IPv6: (fc00:c111::2, fc00:c111::3)
Secondary Interface enp0s9 :: IPv4: (, ), IPv6: (, )
17:07:52 STEP: WaitforPods(namespace="default", filter="")
17:08:22 STEP: WaitforPods(namespace="default", filter="") => <nil>
17:08:22 STEP: Installing Cilium
17:08:30 STEP: Waiting for Cilium to become ready
FAIL: Timed out after 240.000s.
Timeout while waiting for Cilium to become ready
Expected
    <*errors.errorString | 0xc001206e60>: 
    only 1 of 2 desired pods are ready
    {
        s: "only 1 of 2 desired pods are ready",
    }
to be nil
=== Test Finished at 2024-02-28T17:12:30Z====
17:12:30 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathServicesTest
===================== TEST FAILED =====================

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Mar 1, 2024

I do see some errors in the events logs, looks like the that cilium Pod is being killed (intentional or not?) during the mount-bpf-init stage and timing out somewhere. It's strange to see KillPodSandboxError on a hostNetwork Pod.

kube-system         23s         Warning   FailedKillPod             pod/cilium-s4gkh                                                                                kubelet, kind-worker          error killing pod: [failed to "KillContainer" for "mount-bpf-fs" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "67257da8-8994-466e-95a3-50a545ec8960" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]   23s          1       cilium-s4gkh.17b8165caa70c544
kube-system         73s         Warning   FailedKillPod             pod/cilium-s4gkh                                                                                kubelet, kind-worker          error killing pod: [failed to "KillContainer" for "mount-bpf-fs" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "67257da8-8994-466e-95a3-50a545ec8960" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]   5m16s        2       cilium-s4gkh.17b8165caa70c544
kube-system         7m14s       Warning   FailedKillPod             pod/cilium-s4gkh                                                                                kubelet, kind-worker          error killing pod: [failed to "KillContainer" for "mount-bpf-fs" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "67257da8-8994-466e-95a3-50a545ec8960" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]                                                                                                                                                                                                                                                                                                                       15m          3       cilium-s4gkh.17b8165caa70c544
kube-system         3m11s       Warning   FailedKillPod             pod/cilium-s4gkh                                                                                kubelet, kind-worker          error killing pod: [failed to "KillContainer" for "mount-bpf-fs" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "67257da8-8994-466e-95a3-50a545ec8960" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = failed to stop container \"c9a6fc768e844be5a5cdcd7022d5dfe279bf9f2ac89d3ae312e85d8ea0027650\": an error occurs during waiting for container \"c9a6fc768e844be5a5cdcd7022d5dfe279bf9f2ac89d3ae312e85d8ea0027650\" to be killed: wait container \"c9a6fc768e844be5a5cdcd7022d5dfe279bf9f2ac89d3ae312e85d8ea0027650\": context deadline exceeded"]   3m11s        1       cilium-s4gkh.17b8170621758253
kube-system         2m13s       Warning   FailedKillPod             pod/cilium-s4gkh                                                                                kubelet, kind-worker          error killing pod: [failed to "KillContainer" for "mount-bpf-fs" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "67257da8-8994-466e-95a3-50a545ec8960" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]   10m          3       cilium-s4gkh.17b8165caa70c544

@tommyp1ckles
Copy link
Contributor

From what I can tell, Cilium is just getting stuck on mount-bpf-fs init stage, hard to say why from the sysdumps. Only thing I can think of is just trying to reproduce in CI while dumping some additional information about the kind cluster: https://github.com/cilium/cilium/pull/31102/files

@tommyp1ckles tommyp1ckles self-assigned this Mar 1, 2024
@joestringer
Copy link
Member Author

joestringer commented Mar 5, 2024

I have a sneaking suspicion I hit a similar problem here:

https://github.com/cilium/cilium/actions/runs/8146839676/job/22266268950#step:16:415

K8sDatapathServicesTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  Checks N/S loadbalancing
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Tests with XDP, direct routing, Hybrid and Maglev
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
------------------------------
K8sDatapathServicesTest Checks N/S loadbalancing 
  Tests with XDP, direct routing, DSR and Random
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515
21:02:24 STEP: Installing Cilium
21:02:32 STEP: Waiting for Cilium to become ready
FAIL: Timed out after 240.000s.
Timeout while waiting for Cilium to become ready
Expected
    <*errors.errorString | 0xc001fba240>: 
    only 1 of 2 desired pods are ready
    {
        s: "only 1 of 2 desired pods are ready",
    }
to be nil
=== Test Finished at 2024-03-04T21:06:32Z====
21:06:32 STEP: Running JustAfterEach block for EntireTestsuite K8sDatapathServicesTest
===================== TEST FAILED =====================
21:06:33 STEP: Running AfterFailed block for EntireTestsuite K8sDatapathServicesTest
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE           NAME                                         READY   STATUS     RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
	 cilium-monitoring   grafana-6f4755f98c-vs8xj                     1/1     Running    0          8m27s   10.0.0.150   kind-worker          <none>           <none>
	 cilium-monitoring   prometheus-67fdcf4796-g4s7m                  1/1     Running    0          8m27s   10.0.0.16    kind-worker          <none>           <none>
	 default             app1-9fdc9569f-brfv9                         2/2     Running    0          7m14s   10.0.1.9     kind-control-plane   <none>           <none>
	 default             app1-9fdc9569f-z2b96                         2/2     Running    0          7m14s   10.0.1.57    kind-control-plane   <none>           <none>
	 default             app2-bc9f7bb88-gqhvw                         1/1     Running    0          7m14s   10.0.1.204   kind-control-plane   <none>           <none>
	 default             app3-6798875c47-7hzwf                        1/1     Running    0          7m14s   10.0.1.25    kind-control-plane   <none>           <none>
	 default             echo-656f7b858d-4f6pv                        2/2     Running    0          7m11s   10.0.0.84    kind-worker          <none>           <none>
	 default             echo-656f7b858d-jv7n9                        2/2     Running    0          7m11s   10.0.1.88    kind-control-plane   <none>           <none>
	 default             test-k8s2-7b76864867-ptmg2                   2/2     Running    0          7m13s   10.0.0.191   kind-worker          <none>           <none>
	 default             testclient-c9ndj                             1/1     Running    0          7m13s   10.0.0.141   kind-worker          <none>           <none>
	 default             testclient-c9wm9                             1/1     Running    0          7m13s   10.0.1.91    kind-control-plane   <none>           <none>
	 default             testds-cxn6n                                 2/2     Running    0          7m13s   10.0.0.113   kind-worker          <none>           <none>
	 default             testds-d7btr                                 2/2     Running    0          7m13s   10.0.1.114   kind-control-plane   <none>           <none>
	 kube-system         cilium-envoy-4kmmg                           1/1     Running    0          4m27s   172.18.0.3   kind-worker2         <none>           <none>
	 kube-system         cilium-envoy-kf7xl                           1/1     Running    0          4m27s   172.18.0.4   kind-worker          <none>           <none>
	 kube-system         cilium-envoy-zjphl                           1/1     Running    0          4m27s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         cilium-node-init-42whq                       1/1     Running    0          4m27s   172.18.0.3   kind-worker2         <none>           <none>
	 kube-system         cilium-node-init-q5m5s                       1/1     Running    0          4m27s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         cilium-node-init-xgskv                       1/1     Running    0          4m27s   172.18.0.4   kind-worker          <none>           <none>
	 kube-system         cilium-operator-f6bc44d98-l75dk              1/1     Running    0          4m27s   172.18.0.3   kind-worker2         <none>           <none>
	 kube-system         cilium-operator-f6bc44d98-sjmv2              1/1     Running    0          4m27s   172.18.0.4   kind-worker          <none>           <none>
	 kube-system         cilium-qv62g                                 0/1     Init:5/7   0          4m27s   172.18.0.4   kind-worker          <none>           <none>
	 kube-system         cilium-xjpqz                                 1/1     Running    0          4m27s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         coredns-76f75df574-kfxd7                     1/1     Running    0          7m35s   10.0.1.121   kind-control-plane   <none>           <none>
	 kube-system         coredns-76f75df574-pjhfd                     1/1     Running    0          7m35s   10.0.1.242   kind-control-plane   <none>           <none>
	 kube-system         etcd-kind-control-plane                      1/1     Running    0          9m18s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-apiserver-kind-control-plane            1/1     Running    0          9m18s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-controller-manager-kind-control-plane   1/1     Running    0          9m18s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         kube-scheduler-kind-control-plane            1/1     Running    0          9m18s   172.18.0.2   kind-control-plane   <none>           <none>
	 kube-system         log-gatherer-bmp7p                           1/1     Running    0          8m48s   172.18.0.3   kind-worker2         <none>           <none>
	 kube-system         log-gatherer-jgmnp                           1/1     Running    0          8m48s   172.18.0.4   kind-worker          <none>           <none>
	 kube-system         log-gatherer-zzb4b                           1/1     Running    0          8m48s   172.18.0.2   kind-control-plane   <none>           <none>
	 
Stderr:

The test configuration is:

{
      "k8s-version": "1.[29](https://github.com/cilium/cilium/actions/runs/8146839676/job/22266233516#step:4:30)",
      "ip-family": "dual",
      "kube-image": "kindest/node:v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245",
      "kernel": "bpf-next-20240226.012633@sha256:5107cfa8f934959b38e99d5aaf5f10de5e4[30](https://github.com/cilium/cilium/actions/runs/8146839676/job/22266233516#step:4:31)000f738b236570257d49e2ce8bc"
    },

The test is:

K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, Hybrid and Maglev

I see a bunch of these warnings in the dmesg output in the sysdump:

2024-03-04T21:01:17,116382+00:00 netfs: Zero-sized write [R=1f1ad]
2024-03-04T21:02:23,713742+00:00 netfs: Zero-sized write [R=2cc50]
2024-03-04T21:02:39,779050+00:00 mv (42845) used greatest stack depth: 12168 bytes left
2024-03-04T21:06:32,820514+00:00 netfs: Zero-sized write [R=3bc70]
2024-03-04T21:06:32,880520+00:00 netfs: Zero-sized write [R=3bd09]
2024-03-04T21:06:49,233470+00:00 netfs: Zero-sized write [R=3e6db]
2024-03-04T21:06:49,643923+00:00 netfs: Zero-sized write [R=3eb84]
2024-03-04T21:06:50,013071+00:00 netfs: Zero-sized write [R=3ef17]
2024-03-04T21:06:52,344582+00:00 netfs: Zero-sized write [R=409cc]
2024-03-04T21:06:52,672127+00:00 netfs: Zero-sized write [R=40f60]
2024-03-04T21:06:52,754518+00:00 netfs: Zero-sized write [R=41088]
2024-03-04T21:06:52,828411+00:00 netfs: Zero-sized write [R=41191]
2024-03-04T21:06:52,831989+00:00 netfs: Zero-sized write [R=41198]
2024-03-04T21:06:52,954387+00:00 netfs: Zero-sized write [R=413a5]
2024-03-04T21:06:53,093022+00:00 netfs: Zero-sized write [R=416d6]
2024-03-04T21:06:54,906086+00:00 netfs: Zero-sized write [R=43205]
2024-03-04T21:06:55,496966+00:00 netfs: Zero-sized write [R=4369d]
2024-03-04T21:06:55,504110+00:00 netfs: Zero-sized write [R=436a4]
2024-03-04T21:06:55,761084+00:00 netfs: Zero-sized write [R=43824]

These seem to correlate to discussion on the kernel mailinglists around bpf-next changes, which may suggest that k8s 1.29 / bpf-next test runs hit this and others don't.

The k8s events for the Pod that seems to have slow init / startup look like this, with a strange ~4m delay during clean-cilium-state:

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m25s  default-scheduler  Successfully assigned kube-system/cilium-qv62g to kind-worker
  Normal  Pulling    4m24s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m24s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 232ms (232ms including waiting)
  Normal  Created    4m24s  kubelet            Created container config
  Normal  Started    4m24s  kubelet            Started container config
  Normal  Pulling    4m23s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m23s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 141ms (154ms including waiting)
  Normal  Created    4m23s  kubelet            Created container mount-cgroup
  Normal  Started    4m22s  kubelet            Started container mount-cgroup
  Normal  Pulling    4m22s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m22s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 178ms (178ms including waiting)
  Normal  Created    4m22s  kubelet            Created container apply-sysctl-overwrites
  Normal  Started    4m21s  kubelet            Started container apply-sysctl-overwrites
  Normal  Pulling    4m21s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m21s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 85ms (85ms including waiting)
  Normal  Created    4m21s  kubelet            Created container mount-bpf-fs
  Normal  Started    4m21s  kubelet            Started container mount-bpf-fs
  Normal  Pulling    4m20s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m20s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 88ms (88ms including waiting)
  Normal  Created    4m20s  kubelet            Created container wait-for-node-init
  Normal  Started    4m20s  kubelet            Started container wait-for-node-init
  Normal  Pulling    4m19s  kubelet            Pulling image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd"
  Normal  Pulled     4m19s  kubelet            Successfully pulled image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" in 129ms (129ms including waiting)
  Normal  Created    4m19s  kubelet            Created container clean-cilium-state
  Normal  Started    4m18s  kubelet            Started container clean-cilium-state
  Normal  Pulled     21s    kubelet            Container image "quay.io/cilium/cilium-ci:ce34d107189c9db504909825193bd903de6709bd" already present on machine

tommyp1ckles added a commit that referenced this issue Mar 7, 2024
With issue #31040 we've been seeing strange timeouts in bpf-next.

It seems like: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca9ca1a5d5a980550db1001ea825f9fdfa550b83 may address this issue.
The latest round of lvh-images [1] bpf-next contain this patch so we're going to bump these now and see if this resolves #31040.

[1] cilium/little-vm-helper-images#400

Addresses: #31040

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles
Copy link
Contributor

New bpf-next image where bumped here: #31230. This has patch that we hope fill fix this issue.

Going to observe for a couple days to see if the issue stops happening.

@tommyp1ckles
Copy link
Contributor

Can't find any instances of this occurring since the image bump, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/bug/CI This is a bug in the testing code.
Projects
None yet
Development

No branches or pull requests

2 participants