Skip to content

CI & Automation

Christian Nicolai edited this page Jun 19, 2025 · 59 revisions

This page has information and answers questions around how the C8 monorepo CI and related tools like Renovate work. It should serve as a knowledge base inclduing FAQ for Camundi and external contributors.

Git Branches

  • main: permanent branch for feature development of next C8 minor version (GitHub default branch)
  • stable/*: long-lived branches for maintenance of past C8 minor versions (deleted on support end)
    • stable/8.7 will also receive select new features until April 2025
  • release*: short-lived branches for release activities (helps to achieve code freeze) created from main or stable/*
  • any other branch: (short-lived) branches for feature development to be merged using Pull Requests, via merge queues

Available SNAPSHOT Artifacts

Maven artifacts are available on Artifactory and Docker images are available on DockerHub:

  • Pushed commits to main branch produce:

    • Maven artifacts with version 8.8.0-SNAPSHOT for all C8 components
    • Docker images with tag SNAPSHOT for Operate, Tasklist, Zeebe
    • Docker images with tag 8-SNAPSHOT for Optimize
  • Pushed commits to stable/8.7 branch produce:

    • Maven artifacts with version 8.7.0-SNAPSHOT for Operate, Tasklist, Zeebe
    • Docker images with tag 8.7.0-SNAPSHOT for Operate, Tasklist, Zeebe
  • Pushed commits to stable/optimize-8.7 branch produce:

    • Maven artifacts with version 8.7.0-SNAPSHOT for Optimize
    • Docker images with tag 8.7.0-SNAPSHOT for Optimize

Issue Tracking

All problems, bugs and feature requests regarding the CI of the C8 monorepo CI are tracked using GitHub Issues.

For visibility and prioritization there is the Monorepo CI project board that tracks high-level issues.

New reports of issues need to be checked against the GitHub Issues to avoid duplication: new occurrences of existing issues need to reported in comments, otherwise raise a new issue labelled area/build or reach out via Slack to the Monorepo CI DRI.

Related resources:

Prioritization

Prioritization of issues is done by the Monorepo CI DRI according to severity which follows from these criteria:

  1. Impacted functionality:
  2. Amount of users impacted:
    • Generally severity scales with the amount of affected people that interact with the monorepo (Camundi/external contributors)
    • Can be assessed with CI health or on anecdotal level
  3. Available workarounds:
    • Severity is lower if a workaround is available, especially if that workaround is easy to use/low effort

Dealing with reported issues that are identified as urgent/high severity:

  • Communicate the degraded functionality/impact and that there is an ongoing investigation to affected people.
  • Debug problems on GitHub Actions level yourself, involve the stakeholder teams (via their medic) or subject matter experts for advice on technical details.
  • Try to identify a (limited) workaround to unblock users.
  • Communicate any workarounds and resolution of the problem.

FAQ

Q: What do I do when I see the CI failing with a seemingly unrelated error?

A: Search the open GitHub Issues with the failure message to see if the problem is known: If you find an issue for the same problem, leave a comment with the new occurrance. Otherwise raise a new issue labelled area/build to start tracking that CI failure or reach out via Slack to the Monorepo CI DRI.

Q: How to deal with flaky tests that block CI?

A: Disable the flaky test(s) and comment on existing ticket or create a new one that the flaky test needs to be re-enabled after fixing it. No single test can be more important that the stability of the remaining CI system impacting dozens of developers.

GitHub Merge Queue

GitHub Merge Queue helps automate the Pull Request (PR) merging process by creating a temporary branch for each batch of PRs, running checks against the latest target branch, and merging changes only if the checks pass, ensuring a more streamlined and error-free workflow.

Merge queues exist per branch (one for main, one for stable/8.5, etc.) in the C8 monorepo CI and are configured independently via branch protection rules. So different branches can have different required status checks to control which CI workflows must be green to allow merging.

Related resources:

FAQ

Q: Why do we use merge queues instead of manually merging PRs?

A: In repositories like the C8 monorepo with a high number of contributing engineers and high development velocity dozens of Pull Requests can get created and merged each day. Avoiding downtimes like waiting for a window to merge PRs boosts productivity and allows us to scale.

Q: Why do we have required status checks for PRs and merge queues?

A: Automated software tests increase our confidence into delivering a working software product. Required status checks are a way to technically ensure that engineers get early feedback about potential problems. This way we only merge Pull Requests to main branch that will not fail those automated tests impacting quality or other engineers. This also helps with automerging dependency updates using Renovate.

Q: What are the current required status checks for PRs and merge queue to main?

A: As an repo admin you can find the up2date list here:

Q: Do those required status checks of main guarantee that all commits are green?

A: Yes, for the scope of the Unified CI except for an admin bypass of the merge queue in case of incidents.

Q: My PR had only green checks when I queued it, why was it removed from the merge queue?

A: The merge queue creates a temporary branch from the latest target branch (e.g. main) with your PR merged and then runs CI again. Your changes could be incompatible with the target branch or CI failed e.g. due to flakiness. Look up the check results for details on the CI failure.

Unified CI

"Unified CI" is the name of an approach to establish one central CI pipeline that runs checks for code changes of the whole monorepo instead of multiple unrelated, side-by-side pipelines for each component in the monorepo.

Goals

This central pipeline will use change detection to run checks only when needed thus improving runtime and lowering cost. After migrating, the central CI pipeline will be the only GitHub required status check for PRs and merge queue to main thus improving UX and preventing edge cases with multiple checks and path filters.

This central pipeline will run on all Pull Requests, the merge queue to main and on push for main (and in the future other stable branches). Out of scope are scheduled and release workflows.

This topic is work-in-progress as part of #17721 to migrate remaining workflows once they meet certain criteria (short runtimes under 10 minutes and low flakiness) to the central CI.

A full run of the central CI pipeline should take ideally around 15 minutes with individual jobs only taking 10 minutes of runtime at most.

GitHub Actions pipeline code should should be de-duplicated for the same task and moved out from ci.yml into other reusable workflows or composite actions to keep the ci.yml short and lean.

Workflow Inclusion Criteria

Workflows that seek inclusion to the Unified CI (and thus GitHub required status checks) need to fulfill the following criteria and best practices:

If the required short runtime cannot be achieved, consider moving long-running tests into nightly jobs or standalone workflows that are no required status checks and don't run in the merge queue (to preserve merge velocity).

Implementation

This section explains how to achieve including a CI check into the Unified CI as a required status check so that it is executed only when relevant files changed in a PR.

To include a workflow fitting the criteria into the Unified CI all of the following steps have to be taken for each job of that workflow:

  1. Change Detection: Define path filter for all file changes that should trigger the new job in this composite action. This information is relevant for the next step, make sure to:

    • Add a new output to the composite action representing the condition when the new job should be triggered.
    • The output have the same name as the new job it triggers.
    • The output condition should reuse existing filters and combine them as needed.
    • If no matching existing filter, add a new one in a step.
    • Adjust the detect-changes job to re-expose the new output under the same name.
  2. CI Check: Relies on the previous step to run the new job only if relevant files changed. Add the new job defintion to the ci.yml file, by:

    • Following this pattern:
      descriptive-job-name:
        # reuse information from change detection on whether to run this job
        if: needs.detect-changes.outputs.descriptive-job-name == 'true'
        needs: [detect-changes]
        runs-on: ubuntu-latest  # or other
        timeout-minutes: 10  # or less
        permissions: {}  # unless GITHUB_TOKEN is needed
        steps:
          - uses: actions/checkout@v4
          #
          # ...ACTUAL CI CHECK STEPS HERE...
          #
          - name: Observe build status
            if: always()
            continue-on-error: true
            uses: ./.github/actions/observe-build-status
            with:
              build_status: ${{ job.status }}
              secret_vault_address: ${{ secrets.VAULT_ADDR }}
              secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
              secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}
    • It is important to depend on the detect-changes job and use the newly defined output as a condition.
    • If the new job has many steps, you need to refactor them into a reusable workflow or composite action to keep ci.yml lean.
    • Adding observability for CI health is required.
  3. Results Check: Include the new job as needs dependency in check-results job (required status check). This is needed so that the Unified CI is marked as failure if one of its jobs fails.

Related resources:

CI Test Files

Ownership

Each CI test file has an owning team. The owning team can be found either through the CODEOWNERS file or on the metadata in the file itself. The CODEOWNERS file is organized and broken down by team, any additions to the file should follow that convention. The metadata on a GHA workflow file is used by a scraping tool so that it is easy to gather information about the current state of CI. You can look at the metadata for a quick overview of the owning team, where the tests live, how the test is called, and a description of what the file is actually testing

Metadata follows this structure and is placed at the beginning of a GHA workflow file

# description: <Describes what the GHA is running and what is being tested>
# test location: <The filepath of the tests being run>
# owner: <The name of the owning team>

Legacy CI

"Legacy CI" is a name for CI tests that has not been migrated to the Unified CI. Legacy tests do not meet the inclusion criteria for Unified CI, such as running under 10 minutes.

Tests that are marked as Legacy are to be migrated to Unified CI by the owning team in the future. Once migrated, the test should live inside the ci.yml file, or be part of a workflow file that is called by it. The label of "Legacy" should be removed as well

Names for Legacy tests should be prefixed with [Legacy] <componentName> so that Legacy tests are organized and appear together when run on a PR

Consolidated Unit Tests

The Consolidated Unit Test job in the Unified CI runs unit tests by team and component. (For example, Operate tests owned by the Data Layer team). These tests are run by JUnit5 Suites. Each suite selects which tests to run by package. This enables the CI job to run a sub-set of all tests in a module, so that the tests being run are relevant to the owning team. Any new package for tests should be added to the relevant suite.

Suite names must follow a naming convention of {componentName}{team}TestSuite. The composite of the component and and the team is used by the CI job to select which component and team to run the tests for. For example, OperateCoreFeaturesTestSuite is used to run Core Features tests on Operate

Renovate

Renovate is a bot and GitHub app that automates dependency updates in software projects by scanning the source code for outdated libraries and applications, then creating Pull Requests to upgrade them to the latest versions, which helps keeping the project secure and up-to-date.

Renovate supports many package ecosystems of which we use e.g. Maven, NPM, Docker and Helm. It can scan multiple branches (e.g. main, stable/8.5) inside of one repository and raise PRs independently for those.

Renovate is configured via a JSON configuration file on the main branch. In general we allow Renovate to run and create PRs at any time to avoid lagging behind with updates.

We also want Renovate to automatically merge dependency updates when CI is green and automated tests are passing. Assuming a nearly complete test coverage the efficiency gains outweight the risks. This is achieved by Renovate requesting to put every Pull Requests into the GitHub Merge Queue - GitHub will then ensure that required status checks pass before merging the PR.

We additionally use the renovate-approve bot to circumvent the PR reviewer requirements.

Related resources:

FAQ

Q: Why do we use Renovate instead of manually looking for dependency updates?

A: We automate repetetive and error-prone tasks as much as possible to save valuable Engineering time for solving problems requiring more creativity, e.g. complex major version upgrades of dependencies.

Q: Why do we use Renovate instead of Dependabot etc.?

A: Renovate is more flexible, supports more package ecosystems, has a detailed configuration and already used successfully in other places in Camunda so we can reuse existing experience.

Q: Why does Renovate attempt to merge a PR with failing status checks?

A: Renovate will always try to automerge dependency update PRs since it does not know about CI failures. It is GitHub's task to enforce required status checks and reject the merge attempt - as long as no PR with failing status check gets actually merged, everything is working as intended.

Q: Why does Renovate not detect dependency XYZ?

A: Renovate parses and analyzes most well known dependency management files (e.g. pom.xml) automatically. Not detecting a dependency can be due to an unrecognized file format, a typo in the name, a bug in Renovate or the dependency being missing from the package ecosystem. This will usually be reported in the Renovate logs.

Q: How to access the Renovate logs?

A: Click on the most recent run in the Renovate Dashboard and make sure to show debug information.

Q: Why are updates for dependency XYZ ignored in the Renovate configuration file?

A: The reasons for manually ignoring certain updates should be described in the comments. Using git annotate to figure out who put the ignore can also be a way to get more details.

CI Health Metrics

There are hundreds of CI jobs running each day in the C8 monorepo CI due to high development activity. This scale makes it challenging to assess whether there are any structural problems related to the "CI health" (e.g. reliability issues) that would impact developer productivity.

To achieve that for CI jobs we can collect metrics like build times, build failures and information about the hardware/runner via the CI Analytics framework. See how to instrument GHA workflows for metrics collection. We use the collected data for visualizations to get an overview of the CI health.

This topic is work-in-progress as part of #18210 to achieve better coverage, collect more different metrics for additional insights and establish a process for dealing with the results.

Metrics Collection

Any job in any GitHub Actions workflow can be instrumented to collect information about the build status by adding one step at the end, like the following snippet shows:

jobs:
  my-solo-job-name:
    steps:
      # initial checkout is required!
      - uses: actions/checkout@v4
      # keep all other steps here, then insert final step:
      - name: Observe build status
        if: always()
        continue-on-error: true
        uses: ./.github/actions/observe-build-status
        with:
          build_status: ${{ job.status }}
          secret_vault_address: ${{ secrets.VAULT_ADDR }}
          secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
          secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}

Special handling has to be done for matrix jobs since the job name is not unique among the different matrix builds, see below:

jobs:
  my-matrix-job-name:
    strategy:
      matrix:
        identifier: [configurationA, configurationB]
    steps:
      # initial checkout is required!
      - uses: actions/checkout@v4
      # keep all other steps here, then insert final step:
      - name: Observe build status
        if: always()
        continue-on-error: true
        uses: ./.github/actions/observe-build-status
        with:
          job_name: "${{ env.GITHUB_JOB }}/${{ matrix.identifier }}"
          build_status: ${{ job.status }}
          secret_vault_address: ${{ secrets.VAULT_ADDR }}
          secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
          secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}

Related resources:

Visualization

We visualize the collected data using an internal Grafana dashboard to analyze for high build failure rates in general and breakdowns per CI job.

Related resources:

CI Secret Management

All GitHub Action workflows of the C8 monorepo CI must use Vault to retrieve secrets e.g. with the Hashicorp Vault action as a best practice. Other approaches like GitHub Action Secrets will be sunset (outside of bootstrapping connection to Vault).

Historically, different paths have been used in Vault to store secrets depending on the managing team, e.g. products/zeebe/ci or products/operate/ci. This scheme can lead to redundancies in a monorepo and should be aligned for more synergy.

Secrets for the C8 monorepo CI should be stored in Vault under the path products/camunda/ci/*. Manually managed secrets should go into products/camunda/ci/github-actions.

Related resources:

CI Self-Hosted Runners

GitHub offers customers to use their own machines to execute GitHub Action workflows via self-hosted runners. We use this feature in cases when more resources are needed than what GitHub can provide or at a cheaper price. See the internal documentation for what is available.

Usage Guidelines

How to choose which runner to use for a GHA workflow:

  1. Use GitHub-hosted runners by default (free for public repositories)
  2. Use self-hosted runners (with -default name suffix) when a workflow needs:
    1. more resources (memory, CPU) than available on GitHub-hosted runners
    2. ARM CPU architecture

The -default self-hosted runners have no durability guarantees which makes them very cheap and the default choice, if GitHub-hosted runners are not sufficient. Exceptions: in case of reliability problems one can use the -longrunning suffix after approval by Monorepo CI DRI.

GitHub Actions Cache

Workflows run by GitHub Actions can avoid repeated downloads of tools and dependencies by using GitHub Actions Cache. This can shorten or avoid download times, make workflow executions faster, more robust and cheaper.

Important facts about the GHA cache:

  • size: 10 GiB

    • this is small for a monorepo usecase with Java, NodeJS, and many open PRs
  • access restrictions: workflow runs can restore caches created in either the current branch or the default branch

    • caches created for main are very useful
    • caches created on other branches/Pull Requests are of very limited use, can only be used on subsequent builds of same PR
  • cleaning policy: above 10 GiB, GitHub reserves the right to delete old cache entries any time

    • counter-intuitive: caches from main (more useful than those from PRs) are deleted first if they are the oldest

Observations from practice in the monorepo:

  • size: can grow to multiple 100 GiB
  • cleaning policy: GitHub runs one nightly cleanup around 4 AM UTC

Caching Strategy

To make the most efficient use of the limited GHA cache resources available in the monorepo and ensure consistency across many GHA workflows, we follow these guidelines:

  1. Docker/BuildKit layers: don't write to the GHA cache
  2. Java/Maven dependencies: do write to the GHA cache only from main and stable* branch builds
  3. NPM/Yarn dependencies: do write to the GHA cache only from main and stable* branch builds
  4. Golang dependencies: do write to the GHA cache only from main and stable* branch builds

Implementation:

  1. Do not use cache-from: type=gha and cache-to: type=gha parameters of docker/build-push-action.
  2. Use setup-maven-cache action.
  3. Use setup-yarn-cache action, see usage example in #21607.
  4. No implementation since Golang usage is low.

Disable cache restoration for a Pull Request

You can temporarily turn off cache restore functionality in a PR by using the /ci-disable-cache command as described under ChatOps. This could be useful to test GHA workflows without the caching mechanism. To restore standard functionality, you need to issue the /ci-enable-cache command or drop the empty commit.

Note: Disabling cache restore mechanism is only possible on PRs.

CI Security

Permissions of GITHUB_TOKEN

Every GHA workflow job is given a GITHUB_TOKEN environment variable with a valid GitHub API token by default. This token can have wide permissions which are unnecessary but open up attack surface, reducing security.

Best Practice: All GHA workflow jobs must request only actually required permissions on the GITHUB_TOKEN. Set permissions: {} by default and add what is needed.

Usage of Third Party GitHub Actions

GitHub Actions has a large ecosystem of existing useful actions from GitHub and third parties such as other companies and individuals. While reusing existing actions avoids code duplication and maintenance effort for Camunda, it increases the attack surface should any of those actions be hacked to perform malicious tasks.

Best Practice: To balance utility with risk, all GHA workflows must follow this policy:

  • Use the same action for the same (or similar) automation task, see recipes.
  • Use actions only from trusted sources (GitHub or small set of select 3rd parties, settings).
  • Move actions from Camundi personal accounts to camunda for long-term maintenance, or find replacement.

For camunda/camunda GHA workflows we use a GitHub feature to technically limit which actions can be used to:

  • Allow actions created by GitHub in the actions and github organizations.
  • Allow actions in any Camunda GitHub Enterprise organization like camunda, bpmn-io, etc.
  • Allow specific actions from 3rd parties that we need (full list see below).

If you need to use a 3rd party action not on the list, create an issue explaining the motivation and tag the Monorepo CI DRI for further discussion.

List of allowed 3rd party actions and reusable workflows EnricoMi/publish-unit-test-result-action@*, YunaBraska/java-info-action@*, asdf-vm/actions/install@*, atomicjar/testcontainers-cloud-setup-action@*, aws-actions/configure-aws-credentials@*, blombard/move-to-next-iteration@*, bobheadxi/deployments@*, browser-actions/setup-firefox@*, bufbuild/buf-action@*, codex-/return-dispatch@*, dcarbone/install-jq-action@*, deadsnakes/action@*, dlavrenuek/conventional-changelog-action@*, docker/build-push-action@*, docker/login-action@*, docker/metadata-action@*, docker/setup-buildx-action@*, docker/setup-qemu-action@*, dorny/paths-filter@*, fjogeleit/http-request-action@*, golangci/golangci-lint-action@*, google-github-actions/auth@*, google-github-actions/get-gke-credentials@*, google-github-actions/setup-gcloud@*, hadolint/hadolint-action@*, hashicorp/vault-action@*, hoverkraft-tech/compose-action@*, joelanford/go-apidiff@*, jwalton/gh-docker-logs@*, korthout/backport-action@*, marocchino/sticky-pull-request-comment@*, mavrosxristoforos/get-xml-info@*, misiekhardcore/infra-report-action@*, mshick/add-pr-comment@*, ncipollo/release-action@*, octokit/*@*, peter-evans/create-or-update-comment@*, peter-evans/find-comment@*, peter-evans/slash-command-dispatch@*, redhat-actions/oc-login@*, rodrigo-lourenco-lopes/move-to-current-iteration@*, s4u/maven-settings-action@*, s4u/setup-maven-action@*, slackapi/slack-github-action@*, snyk/actions/setup@*, stCarolas/setup-maven@*, stefanzweifel/git-auto-commit-action@*, teleport-actions/setup@*, teleport-actions/auth-k8s@*, test-summary/action@*, tibdex/github-app-token@*, vsgoulart/write-file-action@*, wagoid/commitlint-github-action@*,

Preview Environments

Engineers can request Preview Environments for specific Pull Requests of the C8 monorepo to be available via a designated URL, to allow more thorough testing and demonstration of the product features before the feature branches are merged into the base branch. For the C8 monorepo the components Identity, Operate, Optimize, Tasklist and Zeebe will get provisioned based on the camunda-platform Helm chart.

Assign the deploy-preview label to any PR targeting the main branch to request creation of a Preview Environment. Creation may take a while and a PR comment including an URL and additional info will be sent as notification. The creation/update of a Preview Environment may fail for various reasons including:

  • compilation errors on any code in the C8 monorepo
  • Docker image build errors
  • backwards incompatible changes in the upstream camunda-platform Helm chart
  • bugs preventing successful startup of any included C8 component

Preview Environments are provisioned on cheap but sometimes less reliable hardware to be cost efficient, and can get automatically stopped after inactivity.

Related resources:

Backporting Guidelines

We want crucial security, stability, cost, and other CI improvements applied to all long-living Git branches in the C8 monorepo.

Why we need CI backports

Changes affecting the CI such as introducing new jobs, new observability features or stability fixes are usually developed first on the main branch. We also have several stable/* branches living for multiple years to release maintenance updates.

Due to how Git branches work, every stable/* branch has its own copy of all GHA workflows at the time of forking. Those GHA workflows receive automated Renovate updates for actions. But every human-made CI change needs to be at least considered for manual backporting to ensure that crucial improvements are on all relevant branches.

How to backport CI changes

Follow these instructions to backport PRs with CI changes.

It may be required to resolve Git conflicts when backporting CI changes.

When to backport CI changes

If the CI change matches one of the following:

  • is security-related (incl dependency updates, permissions): MUST backport to all stable/* branches
  • is related to cost reduction, increased reliability or observability: SHOULD backport to all stable/* branches
  • is a new CI job for new product feature or test cases: backport only if the product feature is backported
  • is a new CI feature: backport only if required in the ticket
  • is related to an on: schedule GHA workflow: no need to backport, only works on main
  • is related to Preview Environments: no need to backport, only supported on main

Slack Notifications

All CI workflows in the camunda/camunda monorepo must use the "C8 Monorepo Notifications" Slack app. Messages to Slack should be send via webhooks. The webhook URLs are secrets and stored in Vault for each Slack channel.

If you need to send Slack messages to a channel for which no webhook URL exists yet, reach out via Slack to the Monorepo CI DRI to request one. They then will generate a new webhook URL for the "C8 Monorepo Notifications" Slack app and store it in Vault.

Webhook URL secrets can be retrieved from Vault in GitHub Actions workflows like this:

job-with-notification:
  steps:
    - uses: actions/checkout@v4

    - name: Import Secrets
      id: secrets
      uses: hashicorp/vault-action@v3
      with:
        url: ${{ secrets.VAULT_ADDR }}
        method: approle
        roleId: ${{ secrets.VAULT_ROLE_ID }}
        secretId: ${{ secrets.VAULT_SECRET_ID }}
        exportEnv: false # we rely on step outputs, no need for environment variables
        secrets: |
          secret/data/products/camunda/ci/github-actions SLACK_MYCHANNELNAME_WEBHOOK_URL;

    - name: Send notification
      uses: slackapi/slack-github-action@v2
      with:
        webhook: ${{ steps.secrets.outputs.SLACK_MYCHANNELNAME_WEBHOOK_URL }}
        webhook-type: webhook-trigger
        # For posting a rich message using Block Kit
        payload: |
          blocks:
            - type: "section"
              text:
                type: "mrkdwn"
                text: "Hello World"

ChatOps

In the camunda/camunda monorepo certain automated workflows can be triggered by posting comments with commands on GitHub Issues and/or Pull Requests. Those commands are then processed by a GitHub Actions workflow.

Available commands:

  • /ci-problems comment on a Pull Request:

    • Synopsis: Triggers a script that analyzes all CI runs related to that PR for CI failures and posts summary as new PR comment.
    • Use case: Can be used by any engineer to get actionable hints on how to address CI problems in a PR.
    • Capabilities:
      • detect problems with self-hosted runners (incl links to dashboards + Kubernetes logs)
      • pipeline timeouts
      • DockerHub connection problems
      • deep links to GHA logs for generic job failures
  • /ci-disable-cache comment on a Pull Request:

    • Synopsis: Adds a new label ci:no-cache to the list of labels of the Pull Request and creates a new empty commit to trigger a new CI run without cache restoration.
    • Use case: Can be used by any engineer to test workflows run from scratch without cache restoration.
  • /ci-enable-cache comment on a Pull Request:

    • Synopsis: Removes the ci:no-cache label from the list of labels of the Pull Request and creates a new empty commit to trigger a new CI run.
    • Use case: Complements the /ci-disable-cache commands and can be used to restore CI regular cache restoration step.

Flaky tests

Tests can be viewed as "flaky" when they are not consistenly passing although neither the source code, nor the test code, nor the environment has been meaningfully changed.

We should aim to have all tests consistently passing, avoid introducing new flaky tests and implement our observability tooling to detect and improve existing flaky tests. This allows for better developer experience and smoother processes like automated dependency updates.

GitHub Action workflows with Maven testing Java code should use the flaky-test-extractor-maven-plugin and report the resulting detailed flaky test statistics to our CI health database.

Related resources:

flaky-test-extractor-maven-plugin

Some Maven modules in the monorepo rerun failing Java tests multiple times (e.g. 3 times, configurable) when they fail and use the flaky-test-extractor-maven-plugin:

  • if a test succeeds at least once during the retries, it is classified as "flaky" by this plugin
  • if a test fails on all retries, it is classified as "failed"
    • this will cause the whole build to fail
    • see the FAQ on how to deal with such cases

Troubleshooting

How to deal with CI alerts that fire?

Follow the Monorepo CI medic routines and check out the available CI runbooks for each alert.

Why is my CI check failing?

There can be many factors influencing that and it is sometimes hard to find the root caues. Below list should provide guidance:

  1. Try to rerun the failing CI check(s) at least to overcome transient problems.
  2. If on a Pull Request, consider whether the code changes on that Pull Request might cause the CI check failure.
  3. Check if there is an open issue about that failing CI check, e.g. by searching for the error message.
  4. Ask Copilot about the failure using the Explain error button
    • If this doesn't work, find and google the error message.
  5. If the failing CI checks are not part of the Unified CI, contact their owner and see if those CI checks are known to be unstable or flaky.
    • Technically, failing CI checks outside the Unified CI do not prevent merging a PR. If the owners of those failing CI checks agree, you can still merge.
  6. If your PR is removed from the merge queue, check if concurrently there was another PR merged that changes code which your code depends on (e.g. leading to compilation errors in the merge queue).
  7. If a check from the Unified CI is failing on main or a stable branch, try to find the first build with that failing check and investigate the recently merged code changes.
    • Experience shows that most CI check failures are (indirectly) caused by camunda/camunda code changes, and not by external factors like 3rd party services or infrastructure.
    • CI Health metrics can also be used to narrow down the time range, less precise.
  8. Reach out on Slack for help!

How to verify that a CI check is robust and stable, not flaky?

First, create a dedicated branch YOURBRANCHNAME which can be used as a reference for running the CI check later.

If you are working on fixing a flaky test, push the code or build pipeline change(s) that you believe remove the flakiness onto that branch.

Is your CI heck part of the Unified CI's ci.yml?

  1. No, but it runs on Pull Requests. Then you have to create a draft PR for YOURBRANCHNAME and manually trigger reruns of the check in question.

  2. Yes! Then you can use the GitHub CLI tool to start repeated runs.

    Optional: remove CI checks you are not interested in from the ci.yml on YOURBRANCHNAME to speed up the execution and save resources.

    Open a new terminal, go to your checkout of camunda/camunda and execute in bash shell:

    for i in {1..10}; do gh workflow run ci.yml --ref YOURBRANCHNAME; done

    This loop will take a while (1 hour or more depending on the CI check) so let it run in the background. After it finished, visit https://github.com/camunda/camunda/actions/workflows/ci.yml?query=branch%3AYOURBRANCHNAME and see if there are any failures (indicates lack of robustness).

Clone this wiki locally