Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run all ginkgo tests on GitHub actions #25713

Merged
merged 5 commits into from Jun 5, 2023
Merged

Run all ginkgo tests on GitHub actions #25713

merged 5 commits into from Jun 5, 2023

Conversation

aanm
Copy link
Member

@aanm aanm commented May 26, 2023

Description

The current CI setup runs all ginkgo tests on Jenkins. Some of the runtime tests were already transferred into Github actions #25516 and this PR is moving the missing pieces.

Testing

Example of some runs:

Pros vs Cons

Pros 馃帀

  • Instead of hours to wait for test feedback, the entire CI is executed in less than 20 minutes.
  • If a particular test is flaky, the developer can re-run the specific job for that test instead of re-running the entire suite. A good example just happened with https://github.com/cilium/cilium/actions/runs/5090956330 where attempt #1 failed with Kafka (also flaky on Jenkins) and the attempt #2 only had this Kafka test re-triggered while the remaining others were marked as successful.

Cons 馃槥

  • It might introduce some new flakes. This can be reassured by re-running the tests 10 times before this PR is merged.
  • Kind and GitHub don't have the same capabilities has a baremetal node and some changes had to be done:
    • Tests that need be disabled because Kind doesn't mount /proc/sys/net/core/default_qdisc from the host:
      • K8sDatapathBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, direct routing
      • K8sDatapathBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, geneve tunneling
      • K8sDatapathBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, vxlan tunneling
    • Tests had to be modified to take into account that GitHub doesn't support IPv6 connectivity:
      • K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully
  • The oldest K8s version supported by kind is 1.19.x. We will need to bump our minimal tested version from 1.16 to 1.19.

Notes

Re-introduce the f-XX prefix?

The #25516 had a prefix (f-XX) for "focus" matrix dimension but it was removed
because it didn't make sense. However, it might make sense to introduce them in this PR because it's hard to navigate
through the web UI between jobs that have the same. For example: datapath-egress-tun-disabled-endpoint-routes- disabled, datapath-egress-tun-disabled-endpoint-routes-enabled, datapath-egress-tun-vxlan-endpoint-routes- disabled and datapath-egress-tun-vxlan-endpoint-routes-enabled.

@aanm aanm added area/CI Continuous Integration testing issue or flake kind/epic Features that are considered beyond amazing area/CI-improvement Topic or proposal to improve the Continuous Integration workflow dont-merge/preview-only Only for preview or testing, don't merge it. release-note/ci This PR makes changes to the CI. labels May 26, 2023
@aanm aanm requested review from a team as code owners May 26, 2023 13:37
@nbusseneau nbusseneau self-requested a review May 26, 2023 14:14
@aanm
Copy link
Member Author

aanm commented May 26, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully

Failure Output

FAIL: Expected

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/225/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

test/helpers/kubectl.go Outdated Show resolved Hide resolved
test/helpers/kubectl.go Outdated Show resolved Hide resolved
test/helpers/kubectl.go Show resolved Hide resolved
test/k8s/manifests/demo-customcalls.yaml Outdated Show resolved Hide resolved
@aanm
Copy link
Member Author

aanm commented May 26, 2023

/test

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only looked at the contributing change, which is kind.sh - printing the version is fine :)

@aanm
Copy link
Member Author

aanm commented May 31, 2023

/test

Copy link
Contributor

@viktor-kurchenko viktor-kurchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, some questions below.

###
kafka: "K8sKafkaPolicyTest"
# K8sKafkaPolicyTest Kafka Policy Tests KafkaPolicies
###
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments with test names will be a pain to maintain, do we really need them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is really useful to get an quick idea oh what tests are being covered by the focus regex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think they will become obsolete very quickly as people will forget to add them when creating new tests, even though we're not supposed to it still happens xD

Comment on lines 555 to 474
case ${{ matrix.k8s-version }} in
1.27)
kube_image="kindest/node:v1.27.1@sha256:b7d12ed662b873bd8510879c1846e87c7e676a79fefc93e17b2a52989d3ff42b"
kernel="bpf-next-20230526.105339@sha256:4133d4e09b1e86ac175df8d899873180281bb4220dc43e2566c47b0241637411"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for handling kind and kernel versions like this instead of with matrix env variables? I think the less bash processing the better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Having the kernel and kube_image as part of the matrix it so would be considered a two new dimensions in the matrix and increase the number of combinations to test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can this not be handled via include, such that the matrix does not have 2 dimensions but the proper kernel env variables are inserted for the proper K8s version? 馃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, also one of my first approaches :D After working with it for a couple days it was really frustrating to maintain the workflow https://github.com/cilium/cilium/actions/runs/4573014881/workflow#L175-L497

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @nbusseneau about this. I've tested out @nbusseneau proposal here and it works really well.

Unfortunately, GitHub workflow files lack a GitHub context that allows access to the job name during their execution. This presents a significant challenge when generating the JUnit filename because it is generated based on the job name and matrix name. On datastudio, we retrieve the job name for the GitHub API, which typically appears as something like "name": "E2E Test (1.26, datapath-bgp-custom-lrp)". However, the JUnit filename incorporates all matrix dimensions and looks like this: E2E Test (1.26, datapath-bgp-custom-lrp, 5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c).xml. This discrepancy makes it exceedingly difficult to establish a clear correlation between the JUnit filename and the job name. We need to know which JUnit file was generated by which GitHub job so that we can store it properly on datastudio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we set up the filenames for sysdump and JUnit with ${{ join(matrix.*, '-') }}:

./cilium-cli sysdump --output-filename cilium-sysdump-${{ join(matrix.*, '-') }}-final || true
...
junit_filename="${{ env.job_name }} (${{ join(matrix.*, ', ') }}).xml"

To get back to the same behaviour as initially proposed in the PR, we can use ${{ matrix.k8s-version }}-${{ matrix.focus }}:

./cilium-cli sysdump --output-filename cilium-sysdump-${{ matrix.k8s-version }}-${{ matrix.focus }}-final || true
...
junit_filename="${{ env.job_name }} (${{ matrix.k8s-version }}-${{ matrix.focus }}).xml"

Comment on lines 598 to 516
case ${{ matrix.focus }} in
agent-chaos)
focus="${{ env.agent-chaos }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a similar fashion, instead of matrix + env glued together with this bash processing, wouldn't it be more convenient to have matrix.focus handle this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bash hack is simply to give a better UX in terms of maintaining the page of focus that we run and visualizing the page of test results. I've tried that in the beginning and they looked like this https://github.com/cilium/cilium/actions/runs/4495666151

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, then I would take same approach as suggested in the discussion above with include.

Comment on lines +728 to +562
cid=$(docker create quay.io/cilium/cilium-cli-ci:latest ls)
docker cp $cid:/usr/local/bin/cilium ./cilium-cli
docker rm $cid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this :D

Any reason to use latest? We typically pin the CLI version in other workflows, and it's then handled by Renovate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not on me, I'm simply copy pasting what other workflows are doing 馃槄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair 馃槃

@aanm aanm force-pushed the pr/add-ginkgo-on-gha branch 2 times, most recently from c493d50 to 428dfaf Compare June 2, 2023 19:09
@joestringer
Copy link
Member

Nice work! I think the overall approach here where we roll out the new logic, get at least a week's worth of feedback, then disable the old approach is a reasonable way to mitigate the risks associated with migrating this infrastructure.

I glanced through the PR at a high level and it made sense to me, but I didn't go into any depth on specific changes.

I note that during the discussion during the community meeting, we raised that Jenkins will continue to run on older branches. This is fine for now. It may be worth exploring the backport to 1.12 and maybe 1.11, but probably not 1.10. Given the upcoming 1.14 release, that could provide a path to migrate entirely off Jenkins definitely this year if not in a couple of months.

As for the missing datapath tests, will you file issues to follow up on the reduction of coverage?

Re-introduce the f-XX prefix?

I think that having focus groups would be helpful to be able to easily narrow down groups of tests. I glanced at the four datapath groups that you listed in the PR description, and I couldn't tell the difference at first look. I think that failing tests may result in similar assumptions amongst readers unless we add some sort of numeric identifier for the groups. I'd be in favour of reintroducing that.

I'll note my approval here though I'll skip the GitHub codeowners review since I didn't look closely at the implementation.

aanm added 4 commits June 5, 2023 12:58
Not all cluster have the specific node names but can have the labels to
designated which node is which. We should use ExecInHostNetNSByLabel to
execute some commands inside a particular host by its label and not by
their name.

Signed-off-by: Andr茅 Martins <andre@cilium.io>
This The deploymentManager variable was not initialized on this file
which was causing a panic to happen if the CI was running focusing on
particular tests of this file.

stack trace of panic:
     github.com/cilium/cilium/test/ginkgo-ext.beforeEach.func1.1()
      	/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:443 +0x47
      panic({0x2318820, 0x3ecde60})
      	/opt/hostedtoolcache/go/1.20.4/x64/src/runtime/panic.go:884 +0x213
      github.com/cilium/cilium/test/k8s.DeployCiliumOptionsAndDNS(0x0, {0xc000580360?, 0x1000002a53f01?}, 0xc00206cc78?)
      	/home/runner/work/cilium/cilium/test/k8s/assertion_helpers.go:204 +0x27
      github.com/cilium/cilium/test/helpers.(*DeploymentManager).DeployCilium(0xc000291950, 0xc00193cf00?, 0x2704c57?)
      	/home/runner/work/cilium/cilium/test/helpers/manifest.go:305 +0x3c
      github.com/cilium/cilium/test/k8s.prepareHostPolicyEnforcement(0xc000964690)
      	/home/runner/work/cilium/cilium/test/k8s/datapath_configuration.go:581 +0xa5
      github.com/cilium/cilium/test/k8s.glob..func20.7.23.1()
      	/home/runner/work/cilium/cilium/test/k8s/services.go:628 +0xc6
      github.com/cilium/cilium/test/ginkgo-ext.BeforeAll.func1()
      	/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:195 +0x69
      github.com/cilium/cilium/test/ginkgo-ext.beforeEach.func1()
      	/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:447 +0xe6
      github.com/cilium/cilium/test/ginkgo-ext.applyAdvice.func1({0x3f5e378, 0x0, 0x0})
      	/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:575 +0x8a
      github.com/cilium/cilium/test.Test(0xc000031040)
      	/home/runner/work/cilium/cilium/test/test_suite_test.go:112 +0x484
      testing.tRunner(0xc000031040, 0x2825e08)
      	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1576 +0x10b
      created by testing.(*T).Run
      	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1629 +0x3ea

Signed-off-by: Andr茅 Martins <andre@cilium.io>
Some cluster configurations might not support any IPv6 connectivity
which makes some tests to fail. They were failing because cilium health
probe was unable to connect to the IPv6 IP addresses on Kind with
Kubernetes <= v1.19, which is the environments that we will disable
IPv6 on Cilium.

Signed-off-by: Andr茅 Martins <andre@cilium.io>
Some cluster don't have IPv6 connectivity to the outside would which
makes some tests to fail. To workaround this issue we should only run
the IPv6 connectivity to the outside by simulating with the help of a
node, when available, that is not running Cilium. Otherwise we will skip
the IPv6 connectivity test entirely.

Signed-off-by: Andr茅 Martins <andre@cilium.io>
@aanm
Copy link
Member Author

aanm commented Jun 5, 2023

/test

@aanm
Copy link
Member Author

aanm commented Jun 5, 2023

/test

@aanm
Copy link
Member Author

aanm commented Jun 5, 2023

/test

This commit adds the ability to run the ginkgo tests on GH actions.
Running the ginkgo tests on GH actions will allow us to a better
parallelization of the test runs and smaller queues which results in
faster test feedback loop for developers.

The changes performed in the test/ directory were necessary so that we
can setup the infrastructure in GitHub actions.

Unfortunately GitHub does not have IPv6 connectivity and some tests
had to be modified to take that into account by simulating an external
host with IPv6 support.

Also, because kind doesn't mount "/proc/sys/net/core/default_qdisc"
we can't test bandwidth manager tests.

Signed-off-by: Andr茅 Martins <andre@cilium.io>
@aanm aanm added dont-merge/preview-only Only for preview or testing, don't merge it. and removed dont-merge/preview-only Only for preview or testing, don't merge it. labels Jun 5, 2023
@aanm aanm removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Jun 5, 2023
@aanm
Copy link
Member Author

aanm commented Jun 5, 2023

Last tests were done with the commit from main branch afb6b2b executed in here: https://github.com/cilium/cilium/actions/runs/5176732237

All tests were passing, I had to made some changes into the workflow file that doesn't require a new /test run. The diff can be found out here

image

Merging. Thank for the comments / reviews.

@aanm aanm merged commit a81ca1c into main Jun 5, 2023
42 of 43 checks passed
@aanm aanm deleted the pr/add-ginkgo-on-gha branch June 5, 2023 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/CI-improvement Topic or proposal to improve the Continuous Integration workflow kind/epic Features that are considered beyond amazing release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants