ci: stabilize workflows #10457

lenaschoenburg · 2022-09-22T16:27:58Z

This PR combines multiple improvements and has two goals:

Stabilize CI
Improve maintainability

To achieve the first goal of stabilizing CI:

Use 16-core self-hosted runners. These seem to be more reliable and are less contended.
Disable Testcontainer Cloud. This removes one potential source of flakiness.
Run much less jobs overall. Every job had the potential to break due to external dependencies so having less jobs decreases the chance to encounter those.
Run unit tests on self hosted runners where we can use our own caching maven repository.
Fail faster. Timeouts are decreased, previous workflow runs are cancelled and integration tests fail as soon as one test fails. This frees up resources and ensures that one failing PR doesn't block other PRs for too long.

To improve maintainability:

Merge code quality and test workflows. These used the same triggers anyway and it's nice to have all required checks in one file.
Run all unit tests in one job. This makes looking for failing checks in the GH UI easier
Merge multiple integration test jobs. Figuring out which tests ran where was tricky so now we just have one job that runs integration tests.

This has a couple of drawbacks:

Time to failure increases: Previously some test failures were found in 1-2 minutes, now we need to wait for all unit tests (>15 minutes) before checks fail.
Workflow run time increased by about 3 minutes, primarily coming from the unit test job.
Less granular retry.
Increased chances of failure due to using 16-core runners for integration tests and not using Testcontainer Cloud.

megglos · 2022-09-22T16:51:05Z

.github/workflows/test.yml

      - name: Create build output log file
        run: echo "BUILD_OUTPUT_FILE_PATH=$(mktemp)" >> $GITHUB_ENV
      - name: Maven Test Build
        run: >
          mvn -B --no-snapshot-updates
          -D skipITs -D skipChecks
-          -pl ${{ matrix.project }}
+          -D junitThreadCount=16


assuming there are maven modules that can run in parallel would adding -T1C or less eager a static value of 4/8 be worth it for speed-up?

Maybe, yeah 👍 I've tried to max out CPU via JUnit first but RandomizedRaftTest is taking 5 minutes and blocking progress for all others.

Done, it removed around 3 minutes

github-actions · 2022-09-22T17:54:52Z

Test Results

  866 files   866 suites 46m 44s ⏱️
6 535 tests 6 525 ✔️ 10 💤 0 ❌
6 723 runs 6 713 ✔️ 10 💤 0 ❌

Results for commit f766cf4.

♻️ This comment has been updated with latest results.

npepinpe · 2022-09-23T08:07:35Z

Can we have a short chat in the TR about solutions? Generally some of these are good ideas, but I have some ideas/suggestions (e.g. split unit tests across a fixed number of groups), and questions (e.g. why is Testcontainers Cloud a source of flakiness?).

Zelldon · 2022-09-23T08:10:44Z

Some potential alternative:

for now we could run bors with jenkins again 657be1f
we have selfhosted nodes in our k8

Modifies the concurrency of the code quality, test, and deploy workflows. The deploy workflow is sequenced such that every deploy is ran only after the previous push's deploy run is finished. This will avoid having concurrent deployments which could overwrite each other or mess up the ordering, ensuring the latest SNAPSHOT is from the latest successful commit. For other workflows, these will run and cancel previous runs, such that only one run at the same time is running. Hopefully this cuts down on resource usage and perhaps auto-scaling issues.

npepinpe

yolo

megglos

looks good to me let's see it in action 🤞

lenaschoenburg · 2022-09-23T13:27:50Z

bors r+

deepthidevaki · 2022-09-23T15:51:29Z

bors r-

already merged

zeebe-bors-camunda · 2022-09-23T15:51:32Z

Canceled.

lenaschoenburg force-pushed the os-merge-unit-tests branch from b0289c5 to 8376f28 Compare September 22, 2022 16:28

megglos reviewed Sep 22, 2022

View reviewed changes

lenaschoenburg added 4 commits September 23, 2022 08:07

ci: merge unit test jobs

f748a8e

ci: merge integration test jobs

57d730e

ci: use nexus cache for unit tests

b443175

ci: run randomized raft tests together with other property tests

e84b643

lenaschoenburg force-pushed the os-merge-unit-tests branch from 20a7a67 to 0640208 Compare September 23, 2022 06:07

lenaschoenburg added 3 commits September 23, 2022 08:26

ci: skip random tests

808e0b2

ci: disable Testcontainers Cloud

c90f8f8

ci: regenerate event file on retry

acf1bf2

lenaschoenburg force-pushed the os-merge-unit-tests branch from 0640208 to cfffca1 Compare September 23, 2022 06:41

lenaschoenburg changed the title ~~ci: merge test jobs~~ ci: stabilize workflows Sep 23, 2022

lenaschoenburg marked this pull request as ready for review September 23, 2022 07:51

lenaschoenburg requested review from megglos and npepinpe September 23, 2022 07:51

megglos mentioned this pull request Sep 23, 2022

chore(ci): use 16 core pool for integration test run #10453

Closed

10 tasks

This was referenced Sep 23, 2022

build(.github): configure Maven for CI timeouts #10452

Closed

build(.github): manage workflow concurrency #10456

Closed

lenaschoenburg force-pushed the os-merge-unit-tests branch 5 times, most recently from 87c6954 to efda33a Compare September 23, 2022 12:36

npepinpe and others added 4 commits September 23, 2022 14:56

ci: merge code quality workflow into test workflow

3f49e06

ci: skip integration tests after one failure

8c1e936

ci: try using 16 core nodes for integration tests

276c7be

lenaschoenburg added 4 commits September 23, 2022 14:56

ci: test up to two modules at a time when running unit tests

f14cace

ci: use self-hosted runners

0f5990c

fix: use port from SocketUtil

2cc70d4

fix: set port for container engine

f766cf4

lenaschoenburg force-pushed the os-merge-unit-tests branch from efda33a to f766cf4 Compare September 23, 2022 13:00

npepinpe approved these changes Sep 23, 2022

View reviewed changes

megglos approved these changes Sep 23, 2022

View reviewed changes

lenaschoenburg merged commit 67da38b into main Sep 23, 2022

lenaschoenburg deleted the os-merge-unit-tests branch September 23, 2022 13:29

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: stabilize workflows #10457

ci: stabilize workflows #10457

lenaschoenburg commented Sep 22, 2022 •

edited

Loading

megglos Sep 22, 2022 •

edited

Loading

lenaschoenburg Sep 22, 2022

lenaschoenburg Sep 23, 2022

github-actions bot commented Sep 22, 2022 •

edited

Loading

npepinpe commented Sep 23, 2022

Zelldon commented Sep 23, 2022 •

edited

Loading

npepinpe left a comment

megglos left a comment

lenaschoenburg commented Sep 23, 2022

deepthidevaki commented Sep 23, 2022

zeebe-bors-camunda bot commented Sep 23, 2022

ci: stabilize workflows #10457

ci: stabilize workflows #10457

Conversation

lenaschoenburg commented Sep 22, 2022 • edited Loading

megglos Sep 22, 2022 • edited Loading

Choose a reason for hiding this comment

lenaschoenburg Sep 22, 2022

Choose a reason for hiding this comment

lenaschoenburg Sep 23, 2022

Choose a reason for hiding this comment

github-actions bot commented Sep 22, 2022 • edited Loading

Test Results

npepinpe commented Sep 23, 2022

Zelldon commented Sep 23, 2022 • edited Loading

npepinpe left a comment

Choose a reason for hiding this comment

megglos left a comment

Choose a reason for hiding this comment

lenaschoenburg commented Sep 23, 2022

deepthidevaki commented Sep 23, 2022

zeebe-bors-camunda bot commented Sep 23, 2022

lenaschoenburg commented Sep 22, 2022 •

edited

Loading

megglos Sep 22, 2022 •

edited

Loading

github-actions bot commented Sep 22, 2022 •

edited

Loading

Zelldon commented Sep 23, 2022 •

edited

Loading