Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kata tests are broken #6481

Closed
haircommander opened this issue Jan 3, 2023 · 11 comments · Fixed by #6603
Closed

kata tests are broken #6481

haircommander opened this issue Jan 3, 2023 · 11 comments · Fixed by #6603
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@haircommander
Copy link
Member

Which jobs are failing?

kata-jenkins

Which tests are failing?

looks like cri-o has been failing to setup: http://jenkins.katacontainers.io/job/kata-containers-2-crio-PR/4525/consoleFull

Since when has it been failing?

Pre-holiday, I overrode it to release 1.26.0 a number of times

Testgrid link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

@haircommander haircommander added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 3, 2023
@haircommander
Copy link
Member Author

If I had to guess, i'd guess it's due to #6289

cc @fidencio @littlejawa

@amarlearning
Copy link
Member

amarlearning commented Jan 4, 2023

I want to work on this issue.

@littlejawa
Copy link
Contributor

If I had to guess, i'd guess it's due to #6289

Right, that's probably the root cause.

@amarlearning - I'm not sure how familiar you are with kata containers, so let me give you some background. Sorry if that's redundant :-/
The CI job for kata on cri-o doesn't use the config from the cri-o repository. It uses scripts from the kata repository, shared with all other kata CI jobs.
For instance, this is how we install the cni plugins: https://github.com/kata-containers/tests/blob/main/.ci/install_cni_plugins.sh

As for the config file, I think we reuse the default config from containerd.
Maybe that is the issue actually - it used to be similar, but your PR is moving away from that config, and now we can't rely on it anymore?
That's just a guess from my side, as a first glance.

I will do some tests on my side to understand it better. If you do anything on the kata repository, please ping me so that I can get a look :)

@amarlearning
Copy link
Member

amarlearning commented Jan 4, 2023

@littlejawa yes, you are right, I have zero experience with Kata containers. With community help, I believe I can do it.

Btw, I was going through the Jenkins build URL that @haircommander shared in the issue and found that one of the reasons why the build is failing is because

00:14:51 cp: cannot stat '/tmp/jenkins/workspace/kata-containers-2-crio-PR/go/src/github.com/cri-o/cri-o/contrib/cni/10-crio-bridge.conf': No such file or directory
00:14:51 INFO: Environment initialization failed. Clean up and try again.

I believe we need to fix this file not found issue.

Based on my debugging, we need to update this line with the correct file extension and the issue will be fixed.

@littlejawa
Copy link
Contributor

Hi @amarlearning - right, this is probably the issue.
I made a pull request to fix that, let's see how the CI behaves on the kata side.
(see kata-containers/tests#5363)

@amarlearning
Copy link
Member

@littlejawa Thanks 👍🏻

@littlejawa
Copy link
Contributor

The change was merged, but the build is still broken.
Now the job has moved further. We are now able to start kubernetes tests, and get an error as follow :

00:06:43 INFO: Run tests
00:06:43 1..1
00:06:43 ok 1 Running with postStart and preStop handlers
00:07:02 1..1
00:07:02 ok 1 Block Storage Support
00:07:20 1..1
00:07:20 ok 1 Check capabilities of pod
00:07:25 1..1
00:07:26 not ok 1 ConfigMap for a pod
00:08:56 # (in test file k8s-configmap.bats, line 29)
00:08:56 #   `kubectl wait --for=condition=Ready --timeout=$timeout pod "$pod_name"' failed
00:08:56 # INFO: k8s configured to use runtimeclass
00:08:56 # configmap/test-configmap created
00:08:56 # pod/config-env-test-pod created
00:08:56 # error: timed out waiting for the condition on pods/config-env-test-pod
00:08:56 # Name:         config-env-test-pod
00:08:56 # Namespace:    default
00:08:56 # Priority:     0
00:08:56 # Node:         fedora35cloud820d50/10.0.0.5
00:08:56 # Start Time:   Tue, 10 Jan 2023 23:07:26 +0000
00:08:56 # Labels:       <none>
00:08:56 # Annotations:  <none>
00:08:56 # Status:       Pending
00:08:56 # IP:
00:08:56 # IPs:          <none>
00:08:56 # Containers:
00:08:56 #   test-container:
00:08:56 #     Container ID:
00:08:56 #     Image:         quay.io/prometheus/busybox:latest
00:08:56 #     Image ID:
00:08:56 #     Port:          <none>
00:08:56 #     Host Port:     <none>
00:08:56 #     Command:
00:08:56 #       tail
00:08:56 #       -f
00:08:56 #       /dev/null
00:08:56 #     State:          Waiting
00:08:56 #       Reason:       ContainerCreating
00:08:56 #     Ready:          False
00:08:56 #     Restart Count:  0
00:08:56 #     Environment:
00:08:56 #       KUBE_CONFIG_1:  <set to the key 'data-1' of config map 'test-configmap'>  Optional: false
00:08:56 #       KUBE_CONFIG_2:  <set to the key 'data-2' of config map 'test-configmap'>  Optional: false
00:08:56 #     Mounts:
00:08:56 #       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jnjgz (ro)
00:08:56 # Conditions:
00:08:56 #   Type              Status
00:08:56 #   Initialized       True
00:08:56 #   Ready             False
00:08:56 #   ContainersReady   False
00:08:56 #   PodScheduled      True
00:08:56 # Volumes:
00:08:56 #   kube-api-access-jnjgz:
00:08:56 #     Type:                    Projected (a volume that contains injected data from multiple sources)
00:08:56 #     TokenExpirationSeconds:  3607
00:08:56 #     ConfigMapName:           kube-root-ca.crt
00:08:56 #     ConfigMapOptional:       <nil>
00:08:56 #     DownwardAPI:             true
00:08:56 # QoS Class:                   BestEffort
00:08:56 # Node-Selectors:              <none>
00:08:56 # Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
00:08:56 #                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
00:08:56 # Events:
00:08:56 #   Type     Reason                  Age   From               Message
00:08:56 #   ----     ------                  ----  ----               -------
00:08:56 #   Normal   Scheduled               90s   default-scheduler  Successfully assigned default/config-env-test-pod to fedora35cloud820d50
00:08:56 #   Warning  FailedCreatePodSandBox  88s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(507874bc44f883a75e626dec4c09c9a64533058de0ed56ab991c497f7b76635b): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth5d9d1845 Mac doesn't match: 4a:c7:7e:b2:b3:45 not found
00:08:56 #   Warning  FailedCreatePodSandBox  76s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(e3d696e22df14bb1d8d3076c9bc35bc277da3d9b33a06255ed02b681e8517af9): error checking pod default_config-env-test-pod for CNI network "crio": Interface vethfb9a8a78 Mac doesn't match: 76:cd:df:6d:7e:81 not found
00:08:56 #   Warning  FailedCreatePodSandBox  63s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(2fbfa16d4c5ab3159b5e72f9b9f5731645ed986cd563023e73831ebd47d77e8a): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth3361a47c Mac doesn't match: d2:4a:f4:a0:38:12 not found
00:08:56 #   Warning  FailedCreatePodSandBox  48s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(66395da939ef56ee49917bb95ceb32f834baeb6da92d2cb6b518dea3ab037682): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth590621bf Mac doesn't match: 36:31:16:29:13:1a not found
00:08:56 #   Warning  FailedCreatePodSandBox  32s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(e167d74c5ce27367c0ce305f50b44fd114f9b8ce385e4ba29d16c7a06f8feb63): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth8c9f89f1 Mac doesn't match: 1a:96:e3:38:7d:c9 not found
00:08:56 #   Warning  FailedCreatePodSandBox  20s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(f07ea500257ebaa4c74f51af3011e4307d16267b4b40c7f08706e072e4d56590): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth9d0e443c Mac doesn't match: 5a:9a:dc:97:7b:bb not found
00:08:56 #   Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get network status for pod sandbox k8s_config-env-test-pod_default_44b730b1-7433-4413-a061-983df63fd5b9_0(da29622a03a0d6dc883908f2a09bc8f3e802f147a5438815709a2cace917859d): error checking pod default_config-env-test-pod for CNI network "crio": Interface veth117f329e Mac doesn't match: 3e:41:e6:1a:b9:10 not found

I am not familiar with this error. I've found previous reports of it being related to using the dual-stack CNI config on a machine where only IPv4 is available, but this config used to work on the same CI job.
I'm also confused by the fact that a couple tests are working before we get to this error.

@amarlearning - from your test on the CNI version update, does this ring a bell?

@littlejawa
Copy link
Contributor

Looks similar to #805

@littlejawa
Copy link
Contributor

If I use the ipv4 config file, still using CNI 0.3.1, the test passes, and I can go further.

This lead to another error, with a specific integration test: "ctr pod lifecycle with evented pleg enabled", which was introduced by #6404
Working on it (see: #6531).

So at this point, we need two things to get kata-jenkins job back to normal:

@littlejawa
Copy link
Contributor

#6541 was merged, using the latest CNI plugins with the fix in it.
Now the kata job still has the "mac address error" above, because this job is still using v1.1.1
I'm trying to fix that here : kata-containers/kata-containers#6111

@littlejawa
Copy link
Contributor

All CNI-related issues seems to be gone now.
What remains is the failure on "ctr pod lifecycle with evented pleg enabled".
This is still being discussed on #6531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants