-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sig-apps] Conformance tests being skipped. #785
Comments
https://github.com/cncf/k8s-conformance/blob/master/instructions.md Deploy a Sonobuoy pod to your cluster with:
NOTE: The --mode=certified-conformance flag is required for certification runs since Kubernetes v1.16 (and Sonobuoy v0.16). Without this flag, tests which may be disruptive to your other workloads may be skipped. A valid certification run may not skip any conformance tests. If you're setting the test focus/skip values manually, certification runs require E2E_FOCUS=[Conformance] and no value for E2E_SKIP. |
so , now these tests are passing for me :) , it might have been flakiness. [[[ EDIT, see below, this was 3/4 actually still fail ]]] |
Regardless, I think the antrea e2e test suite should be validating against a complete certified-conformance test run. @McCodeman |
yup, ill be working with @antoninbas to figure out why we skip the others and get this sorted, will spend more time characterizing these failures and confirming wether they are related to my infra, antrea , or both .... have kicked off a new full Test with new infra, new CIDRs, and so on, will see what the results are. |
We ran a select subset of conformance tests for every PR, and we try to keep the time it takes to run the tests to around 20 minutes. If running the full suite takes much longer on our infrastructure, we will probably not run it for every PR, but we can add a separate daily Jenkins job. |
Hi folks. Ok... I confirmed 3 out of the 4 tests I originally reported break pretty consistently..... here are results from a I think antonin, maybe a fast path forward would be (1) As a first priority, specifically adding a (2) We can work on adding a long-term full conformance suite job as well which runs nightly.
|
To go further into details here, it seems like what happening is that these statefulsets health checks, which i think involve node->pod connectivity, dont turn green even after certain image updates. Im not sure how it is that this could be antreas fault but am digging more now.
|
@jayunit100 thanks for finding this and digging into it!
This error indicated GARPs were not sent. I think it may explain why the tests were flaky: only when enough number of tests had run in a cluster, leading to some IPs being reused before the Node's ARP cache expiry, the Node would fail to reach those Pods. antrea-agent is supposed to send the GARPs, but I think the routine was broken when refactoring those methods for Windows support in 0.7.0: 8bd2df8. |
great awesome thanks tnqn. Keep me posted and let me know how i can help ! |
@tnqn on a local cluster, an you run ....
That should quickly give you an indication - or alternatively just point me at a dockerhub image to test and ill swap it out in one of my clusters. |
@jayunit100 Appreciate your help! |
I have run
Perhaps it's due to kernel difference that the newer one is more robust handling ARP cache. @jayunit100 if you get time to run the tests in your cluster, please apply the latest yaml directly. The "latest" image has the fix. |
0.7.1 Defeinetly still fails for me, ill try applying the master yaml . |
moved detailed comments to the other issue above, definetly seems to be a real bug here around statefulset ips and restarts. |
The failure only appeared when using containerd as CRI, and not docker, that's why @antoninbas and I couldn't reproduce it locally. When docker as CRI, the previous Pod was deleted quickly, two CNI Del calls were called before the new Pod's CNI Add call, thus everything is fine.
When containerd as CRI, there were multiple CNI Del calls, the second of which could be after the new Pod's CNI Add call, thus antrea-agent deleted the network interface whose name was computed from Pod namespace + Pod name and caused the networking issue.
It might not be a new issue in 0.7 and might apply to previous versions when running with containerd. |
Ok gotcha. Is there a quick fix to this? maybe some duct tape we can put in for 0.7.2, which canbe more elgantly fixed later ? |
@jayunit100 I'm working it, will update to you once we have a proper fix. |
it looks like #827 will fix the bug underlying the motivation for this issue , but i guess we should leave this issue open until conormance is running nightly |
I have updated "Fixes" to "For" so it won't close this one. |
Describe the bug
Looks like upstream Conformance tests suites will need to get some looking into - some are failing (most pass) when running the full suite.
The reason this might be new failures is that we i guess are skipping
sig-apps
tests in CI...?This one seems like it might be flakey in certain antrea clusters, but not sure.
To Reproduce
sonobuoy run --e2e-focus "Conformance" --wait=600 --plugin e2e
Expected
Conformance should pass :)
Actual behavior
The above 4 tests pass.
Versions:
0.7.0
The text was updated successfully, but these errors were encountered: