Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: github actions job to run kubernetes upstream conformance tests #25913

Merged
merged 1 commit into from Jun 14, 2023

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Jun 5, 2023

Use kind to run the kubernetes e2e network policies jobs

This is basically duplicating the existing job .github/workflows/conformance-k8s-kind.yaml and modifying the regex to also run the network policy tests.

I don't recommend the network policy tests alone because most of the issues are discovered when running with other tests, since the network policies should not impact them, if there is a problem on the implementation is common to see how unrelated test flake

@aojea aojea requested review from a team as code owners June 5, 2023 19:30
@aojea aojea requested a review from brlbil June 5, 2023 19:30
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 5, 2023
@aojea
Copy link
Contributor Author

aojea commented Jun 5, 2023

/assign @aanm

let's wait for the results of the CI

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

One comment about making the regex for the test selection equivalent to what we already have in the tree.

@aanm
Copy link
Member

aanm commented Jun 6, 2023

@aojea
Copy link
Contributor Author

aojea commented Jun 6, 2023

it seems the auto commit does not handle well the end of line 😄 , repushed

@aanm
Copy link
Member

aanm commented Jun 6, 2023

it seems the auto commit does not handle well the end of line smile , repushed

@aojea great! It looks that now the tests are failing in a similar way as the jenkins build https://github.com/cilium/cilium/commit/894b921a90c9173e55594efffbfa9f9abcdfe946/checks/14054603389/logs

@christarazi christarazi added area/CI Continuous Integration testing issue or flake area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. labels Jun 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 6, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now that the job is running the same tests as the Jenkins job.

@christarazi
Copy link
Member

christarazi commented Jun 6, 2023

Just an FYI, if you put a release-note in the PR description, then our release tooling will pick up that and use it as the release note in the changelog. Otherwise, if no release-note, then the PR title is used.

I've removed the release-note as you filled it with "NONE".

@aojea
Copy link
Contributor Author

aojea commented Jun 7, 2023

Interesting 8 failures

2023-06-06T21:24:19.9706251Z �[38;5;9m�[1mSummarizing 8 Failures:�[0m
2023-06-06T21:24:19.9709074Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should deny ingress access to updated pod [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9709977Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9711533Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should enforce policy to allow traffic from pods within server namespace based on PodSelector [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9712405Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9861231Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should enforce policy based on PodSelector or NamespaceSelector [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9862118Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9863201Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should work with Ingress,Egress specified together [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9863802Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9864649Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9865636Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9866368Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;9m�[1mNetworkPolicy between server and client [BeforeEach] �[0mshould allow ingress access from namespace on one named port [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9866938Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9867672Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should enforce policy based on PodSelector and NamespaceSelector [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9868260Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9868961Z   �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;9m�[1mNetworkPolicy between server and client [BeforeEach] �[0mshould allow ingress access from updated namespace [Feature:NetworkPolicy]�[0m
2023-06-06T21:24:19.9869514Z   �[38;5;243mtest/e2e/network/netpol/network_legacy.go:1944�[0m
2023-06-06T21:24:19.9869694Z 
2023-06-06T21:24:19.9962735Z �[38;5;9m�[1mRan 91 of 7207 Specs in 1894.227 seconds�[0m
2023-06-06T21:24:19.9964252Z �[38;5;9m�[1mFAIL!�[0m -- �[38;5;10m�[1m83 Passed�[0m | �[38;5;9m�[1m8 Failed�[0m | �[38;5;11m�[1m0 Pending�[0m | �[38;5;14m�[1m7116 Skipped�[0m
2023-06-06T21:24:20.0095003Z 

@aanm the jenkins job I saw failing where stuck without progressing, here the test are actually failing. Also, I would not trust much this network policy tests in 1.23, we had done a big refactor later and added CI to not regress.

Coming back to the existing failures

2023-06-06T21:18:41.7185142Z   �[38;5;9m[FAILED] Pod client-can-connect-81-rf56h should be able to connect to service svc-server, but was not able to connect.
2023-06-06T21:18:41.7185494Z   Pod logs:
2023-06-06T21:18:41.7185759Z   OTHER: dial tcp 10.96.124.65:81: connect: operation not permitted
2023-06-06T21:18:41.7186092Z   OTHER: dial tcp 10.96.124.65:81: connect: operation not permitted
2023-06-06T21:18:41.7186400Z   OTHER: dial tcp 10.96.124.65:81: connect: operation not permitted
2023-06-06T21:18:41.7186726Z   OTHER: dial tcp 10.96.124.65:81: connect: operation not permitted
2023-06-06T21:18:41.7187046Z   OTHER: dial tcp 10.96.124.65:81: connect: operation not permitted

That does look right

Also the fact that probes are not working :/

2023-06-06T21:18:41.6826368Z     Warning  Unhealthy  4m54s (x6 over 5m36s)  kubelet            Readiness probe failed: command "/agnhost connect --protocol=tcp --timeout=1s 127.0.0.1:81" timed out
2023-06-06T21:18:41.6827163Z     Warning  Unhealthy  2m26s (x6 over 5m33s)  kubelet            Readiness probe failed: command "/agnhost connect --protocol=tcp --timeout=1s 127.0.0.1:80" timed out
2023-06-06T21:18:41.6827527Z 

@aojea
Copy link
Contributor Author

aojea commented Jun 7, 2023

I'm going to repush to see a diff and check if the test failures are random or consistent

@maintainer-s-little-helper
Copy link

Commit 637ca2ec99062897e9bb4d385bdfd1e56b6444b3 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 7, 2023
@aojea
Copy link
Contributor Author

aojea commented Jun 7, 2023

This is much better now, only one failure

2023-06-07T10:24:28.8049739Z �[38;5;9m[FAIL]�[0m �[0m[sig-network] NetworkPolicyLegacy [LinuxOnly] �[38;5;243mNetworkPolicy between server and client �[38;5;9m�[1m[It] should deny ingress access to updated pod [Feature:NetworkPolicy]�[0m

and seem related to #24361

2023-06-07T10:02:41.6342876Z     Type     Reason                  Age   From               Message
2023-06-07T10:02:41.6343356Z     ----     ------                  ----  ----               -------
2023-06-07T10:02:41.6344119Z     Normal   Scheduled               81s   default-scheduler  Successfully assigned network-policy-817/server-fq7xh to cilium-testing-worker
2023-06-07T10:02:41.6345756Z     Warning  FailedCreatePodSandBox  63s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f8779e3c2b2e8d5c3ac1c9bd1d2d0e0026ea9d090e75fb2643436a1badf12de9": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
2023-06-07T10:02:41.6347030Z     Normal   Pulled                  51s   kubelet            Container image "registry.k8s.io/e2e-test-images/agnhost:2.43" already present on machine
2023-06-07T10:02:41.6373329Z     Normal   Created                 51s   kubelet            Created container server-container-80
2023-06-07T10:02:41.6374923Z     Normal   Started                 51s   kubelet            Started container server-container-80
2023-06-07T10:02:41.6376503Z     Normal   Pulled                  51s   kubelet            Container image "registry.k8s.io/e2e-test-images/agnhost:2.43" already present on machine
2023-06-07T10:02:41.6377906Z     Normal   Created                 51s   kubelet            Created container server-container-81
2023-06-07T10:02:41.6378938Z     Normal   Started                 50s   kubelet            Started container server-container-81

cc: @squeed @joestringer

These tests may be very pod intensive

@aanm
Copy link
Member

aanm commented Jun 7, 2023

@aojea where do you see the failure? It seems that it passed, no? https://github.com/cilium/cilium/actions/runs/5202623988/jobs/9384471810?pr=25913

I see, that failure that you just saw was fixed by my suggestion here

@aojea
Copy link
Contributor Author

aojea commented Jun 7, 2023

@aojea where do you see the failure? It seems that it passed, no? https://github.com/cilium/cilium/actions/runs/5202623988/jobs/9384471810?pr=25913

I see, that failure that you just saw was fixed by my suggestion here

oh, I may had a stale view, thanks for noticing

@nathanjsweet nathanjsweet changed the title ci: github actions job to run kubernetes network policies ci: github actions job to run kubernetes upstream conformance tests Jun 8, 2023
@christarazi
Copy link
Member

Could you update the commit msg to reflect the changes that Nate suggested?

@aojea
Copy link
Contributor Author

aojea commented Jun 9, 2023

Could you update the commit msg to reflect the changes that Nate suggested?

He only mentioned to change the names of the jobs, so I assume that is what you referring too , I updated the commit message

@aojea
Copy link
Contributor Author

aojea commented Jun 9, 2023

ConformanceK8sKind / kubernetes-e2e (ipv4) (pull_request) Failing after 19m

failure will be fixed in next kubernetes minor release kubernetes/kubernetes#118281

K8sUpstreamNetConformance / kubernetes-e2e-net-conformance (ipv4) (pull_request) Successful in 33m

it passed but I want to get the logs when it fails, it seems that it use to run in 33 minutes but when it fails sometimes it timeouts in 45 minutes 👀

@maintainer-s-little-helper
Copy link

Commit 407361bad3691df40d06cb87b05c315852ee789a does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 9, 2023
@maintainer-s-little-helper
Copy link

Commit 407361bad3691df40d06cb87b05c315852ee789a does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@aojea
Copy link
Contributor Author

aojea commented Jun 9, 2023

it passed again :(

K8sUpstreamNetConformance / kubernetes-e2e-net-conformance (ipv4) (pull_request) Successful in 30m

pushing another time

Use kind to run the kubernetes upstream e2e tests for
network policies

Signed-off-by: Antonio Ojea <aojea@google.com>
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 9, 2023
@aojea
Copy link
Contributor Author

aojea commented Jun 9, 2023

we need to analyze this last failure, something is wrong there, but I can't figure out what

copying from slack https://cilium.slack.com/archives/C2B917YHE/p1686328834976569?thread_ts=1685593436.669409&cid=C2B917YHE

I need your brain here
Job https://github.com/cilium/cilium/actions/runs/5222022497/jobs/9427006135
look for the test failing in network-policy-4339 namespace https://pipelines.actions.githubusercontent.com/serviceHosts/a38d642a-0bf8-46af-9d36-efb1e831862[…]6jRElXwYx5JEvG5aQFR9TT%2B8jJTEHnmLoap2%2FH3VYg%3D
the client-can-connect-80-8vddp on node cilium-testing-worker2
The test starts at 12:57:03.854 and times out at 13:02:47.974fails because the pod is not ready and shows the following errors

dial tcp 10.96.239.95:80: connect: operation not permitted

If you check kubelet logs in the cilium-testing-worker2 kubelet considers the pod running at

Jun 09 12:57:24 cilium-testing-worker2 kubelet[211]: I0609 12:57:24.171776     211 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="network-policy-4339/client-can-connect-80-8vddp" podStartSLOduration=21.171742317 podCreationTimestamp="2023-06-09 12:57:03 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2023-06-09 12:57:24.168398755 +0000 UTC m=+544.604540956" watchObservedRunningTime="2023-06-09 12:57:24.171742317 +0000 UTC m=+544.607884518"
but the probes keep failing so is not ready, the containerd logs is the part I don’t fully understand
Jun 09 12:57:14 cilium-testing-worker2 containerd[107]: time="2023-06-09T12:57:14.474258743Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:client-can-connect-80-8vddp,Uid:ae7ca6c8-b4cc-43de-9109-2ef132998b93,Namespace:network-policy-4339,Attempt:0,} returns sandbox id \"971d0493392228aa56c6b4f06c721a54b129150aa534407ccc54a0472b6a26ec\""

it tears down the network and never retries

Jun 09 12:57:44 cilium-testing-worker2 containerd[107]: time="2023-06-09T12:57:44.657826980Z" level=info msg="TearDown network for sandbox \"971d0493392228aa56c6b4f06c721a54b129150aa534407ccc54a0472b6a26ec\" successfully"
Jun 09 12:57:44 cilium-testing-worker2 containerd[107]: time="2023-06-09T12:57:44.657871881Z" level=info msg="StopPodSandbox for \"971d0493392228aa56c6b4f06c721a54b129150aa534407ccc54a0472b6a26ec\" returns successfully"

why it never retries? I can’t fully understand where is the problem between kubelet -> containerd -> cni -> cilium-cni, but seems one of those paths get lost and pods is left in the limbo without network but considered running (the probes fail so is not ready)

@joestringer joestringer added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jun 12, 2023
@aanm
Copy link
Member

aanm commented Jun 14, 2023

We will be merging this PR so that we get enough information about the test runs for this job.

Later on we can mark this job as required.

@aanm aanm merged commit f0a8e72 into cilium:main Jun 14, 2023
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants