New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Smoke test: Timed out waiting for the condition on pods #12279
Comments
One more occurrance here https://github.com/cilium/cilium/pull/12013/checks?check_run_id=804482109 |
We could probably have a final step in workflow to dump a bunch of logs to stdout in case of error. |
Change IPAM to kubernetes instead of cluster-pool Closes cilium#12279 Signed-off-by: Tam Mach <sayboras@yahoo.com>
Good point. Let me add the step to dump logs and events for related services, so that we can have more details to tackle this issue. Currently, my hunch is due race conditional between local-path-storage and cilium, which causes the bpf mount failed. I am not 100% confident though :) |
One more observation that I have is that the conformance test (with helm) never fail so far (maybe it might fail later :D). There are a few differences between conformance test and quick install test as below:
Will narrow down further more once I have some free time. Any input or pointer are much appreciated 👍 |
…esting Dump related logs and events for cilium, hubble and connectivity checks Relates cilium#12279 Signed-off-by: Tam Mach <sayboras@yahoo.com>
…esting Dump related logs and events for cilium, hubble and connectivity checks Relates #12279 Signed-off-by: Tam Mach <sayboras@yahoo.com>
Failed in https://github.com/cilium/cilium/pull/12320/checks?check_run_id=818304498, I am dumping logs below to make sure it will not be lost. raw logs
|
Happened again in #13045 but this time we have the
We may have to bump some timeout to know if it's the DNS failing or some other packet drop. |
The default value for these two fields is only 1 second. This PR is to update values to 7 seconds, which is 5 (curl connection timeout) + 2 (some buffer) Relates: cilium#12279 Signed-off-by: Tam Mach <sayboras@yahoo.com>
Yep, that could be useful 👍 We'll need to separate it from other outputs because there's quite a lot of text printed with bugtool. |
The default value for these two fields is only 1 second. This PR is to update values to 7 seconds, which is 5 (curl connection timeout) + 2 (some buffer) Relates: #12279 Signed-off-by: Tam Mach <sayboras@yahoo.com>
This is good point, so I decided to make it as downloadable artifact (e.g. bugtool.tar) in github action |
Crossposting cilium-sysdump from #13090 here as well. |
I can't find anything wrong with this, I am suspecting that it is not happening anymore. To confirm this hypothesis, I setup one scheduled job in my forked repo, 25 runs withouth any failure so far https://github.com/sayboras/cilium/actions?query=workflow%3A%22Smoke+Test+with+IPv6%22+event%3Aschedule |
Ok, let's close then. If someone hits this again, they/we can reopen. |
I hit this on PR #14679 yesterday, where the smoke-test-ipv6 failed. Attaching sysdump. How do we get history of this GH workflow? |
I usually checked this page for history. Yup, I can see a few occurrences for this check 🤔 |
Looks like pod |
Surprisingly, I didn't see the failure from my PR. |
I suspect we're underreporting this failure. Looking back through the workflows history I see this has hit #17219 (link cc @joamaki ), #16926 (link cc @errordeveloper ), #17241 (link cc @brb ), #17210 (link cc @jrajahalme ) and possibly others, this is only going back 8 days. |
Here's an example failure from the v1.10 backports PR #17256 (link): I didn't yet dig too far, but this is a classic sign that the connectivity check is just failing since the pods are never becoming ready (which is effectively what the test does, deploy the pods and rely on readiness checks to validate different paths between pods):
|
No failure for last 2 weeks except legitimate one. Discussed offline, we can close this and re-open if there is any occurrence in the future. |
CI failure
Some failure happened for newly introduced github actions for conformance test (added in #11888 #12232). It's kind of rare based on my observation till now, and simple re-run will do (good thing that it didn't take long to re-run). https://github.com/cilium/cilium/runs/802697197?check_suite_focus=true
However, the developer experience is not pleasant. It will be helpful if we can dump pod logs in github action for quick analysis.
The text was updated successfully, but these errors were encountered: