New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: conformance-ginkgo workflow fails at step "Fetch JUnits" #31040
Comments
@joestringer Looking at the run ginko step, I see that the process may have been killed (externally?) with |
Do we have any other examples of this happening? |
I've seen this failure before, but I've lost links to them. Is there some specific piece of information you'd like to see in a sysdump or otherwise gathered from the test run in order to investigate deeper? One thing that crosses my mind with |
Oh right, I missed that the actual test run had various failures during my initial investigation. Here's a snippet from the same failure above:
I see other tests failing earlier with timeouts as well, and the test runs took 31m 34s according to GitHub. The timeout expressed in the workflow is 40m though 🤔 https://github.com/cilium/cilium/actions/runs/8084204673/workflow#L356 |
Something else that might be interesting, here's the first test that was run, and Cilium failed to become ready right from the get-go, which is pretty strange:
|
I do see some errors in the events logs, looks like the that cilium Pod is being killed (intentional or not?) during the mount-bpf-init stage and timing out somewhere. It's strange to see KillPodSandboxError on a hostNetwork Pod.
|
From what I can tell, Cilium is just getting stuck on mount-bpf-fs init stage, hard to say why from the sysdumps. Only thing I can think of is just trying to reproduce in CI while dumping some additional information about the kind cluster: https://github.com/cilium/cilium/pull/31102/files |
I have a sneaking suspicion I hit a similar problem here: https://github.com/cilium/cilium/actions/runs/8146839676/job/22266268950#step:16:415
The test configuration is:
The test is:
I see a bunch of these warnings in the
These seem to correlate to discussion on the kernel mailinglists around bpf-next changes, which may suggest that k8s 1.29 / bpf-next test runs hit this and others don't. The k8s events for the Pod that seems to have slow init / startup look like this, with a strange ~4m delay during
|
With issue #31040 we've been seeing strange timeouts in bpf-next. It seems like: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca9ca1a5d5a980550db1001ea825f9fdfa550b83 may address this issue. The latest round of lvh-images [1] bpf-next contain this patch so we're going to bump these now and see if this resolves #31040. [1] cilium/little-vm-helper-images#400 Addresses: #31040 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Can't find any instances of this occurring since the image bump, closing for now. |
It is possible for the conformance-ginkgo GitHub workflow to fail at the
Fetch JUnits
step:https://github.com/cilium/cilium/actions/runs/8084204673/job/22089004217#step:19:1
This seems like it may be a race condition for the completion of a previous step to write the JUnit output to disk before proceeding to the next step where the JUnit file is gathered and uploaded as an artifact. Can we add an extra check to ensure that the files are ready for the subsequent test before proceeding to this step?
The text was updated successfully, but these errors were encountered: