Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: stabilize workflows #10457

Merged
merged 15 commits into from
Sep 23, 2022
Merged

ci: stabilize workflows #10457

merged 15 commits into from
Sep 23, 2022

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Sep 22, 2022

This PR combines multiple improvements and has two goals:

  1. Stabilize CI
  2. Improve maintainability

To achieve the first goal of stabilizing CI:

  1. Use 16-core self-hosted runners. These seem to be more reliable and are less contended.
  2. Disable Testcontainer Cloud. This removes one potential source of flakiness.
  3. Run much less jobs overall. Every job had the potential to break due to external dependencies so having less jobs decreases the chance to encounter those.
  4. Run unit tests on self hosted runners where we can use our own caching maven repository.
  5. Fail faster. Timeouts are decreased, previous workflow runs are cancelled and integration tests fail as soon as one test fails. This frees up resources and ensures that one failing PR doesn't block other PRs for too long.

To improve maintainability:

  1. Merge code quality and test workflows. These used the same triggers anyway and it's nice to have all required checks in one file.
  2. Run all unit tests in one job. This makes looking for failing checks in the GH UI easier
  3. Merge multiple integration test jobs. Figuring out which tests ran where was tricky so now we just have one job that runs integration tests.

This has a couple of drawbacks:

  1. Time to failure increases: Previously some test failures were found in 1-2 minutes, now we need to wait for all unit tests (>15 minutes) before checks fail.
  2. Workflow run time increased by about 3 minutes, primarily coming from the unit test job.
  3. Less granular retry.
  4. Increased chances of failure due to using 16-core runners for integration tests and not using Testcontainer Cloud.

- name: Create build output log file
run: echo "BUILD_OUTPUT_FILE_PATH=$(mktemp)" >> $GITHUB_ENV
- name: Maven Test Build
run: >
mvn -B --no-snapshot-updates
-D skipITs -D skipChecks
-pl ${{ matrix.project }}
-D junitThreadCount=16
Copy link
Contributor

@megglos megglos Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming there are maven modules that can run in parallel would adding -T1C or less eager a static value of 4/8 be worth it for speed-up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, yeah 👍 I've tried to max out CPU via JUnit first but RandomizedRaftTest is taking 5 minutes and blocking progress for all others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it removed around 3 minutes

@github-actions
Copy link
Contributor

github-actions bot commented Sep 22, 2022

Test Results

   866 files     866 suites   46m 44s ⏱️
6 535 tests 6 525 ✔️ 10 💤 0
6 723 runs  6 713 ✔️ 10 💤 0

Results for commit f766cf4.

♻️ This comment has been updated with latest results.

@lenaschoenburg lenaschoenburg changed the title ci: merge test jobs ci: stabilize workflows Sep 23, 2022
@lenaschoenburg lenaschoenburg marked this pull request as ready for review September 23, 2022 07:51
@npepinpe
Copy link
Member

Can we have a short chat in the TR about solutions? Generally some of these are good ideas, but I have some ideas/suggestions (e.g. split unit tests across a fixed number of groups), and questions (e.g. why is Testcontainers Cloud a source of flakiness?).

@Zelldon
Copy link
Member

Zelldon commented Sep 23, 2022

Some potential alternative:

  • for now we could run bors with jenkins again 657be1f
  • we have selfhosted nodes in our k8

npepinpe and others added 4 commits September 23, 2022 14:56
Modifies the concurrency of the code quality, test, and deploy
workflows.

The deploy workflow is sequenced such that every deploy is ran only
after the previous push's deploy run is finished. This will avoid having
concurrent deployments which could overwrite each other or mess up the
ordering, ensuring the latest SNAPSHOT is from the latest successful
commit.

For other workflows, these will run and cancel previous runs, such that
only one run at the same time is running. Hopefully this cuts down on
resource usage and perhaps auto-scaling issues.
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yolo

Copy link
Contributor

@megglos megglos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me let's see it in action 🤞

@lenaschoenburg
Copy link
Member Author

bors r+

@lenaschoenburg lenaschoenburg merged commit 67da38b into main Sep 23, 2022
@lenaschoenburg lenaschoenburg deleted the os-merge-unit-tests branch September 23, 2022 13:29
@deepthidevaki
Copy link
Contributor

bors r-

already merged

@zeebe-bors-camunda
Copy link
Contributor

Canceled.

@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants