Skip to content

CI Runbooks

Christian Nicolai edited this page Oct 1, 2024 · 6 revisions

This page collects information on what alerts affecting the C8 monorepo CI mean and how to respond to them. For general information see the CI & Automation page.

Merge Queue High Failure Rate

Unsuccessful GHA jobs in the merge queue to main branch mean that PRs cannot get merged. Those same jobs being green is a precondition for enqueuing a PR for the merge queue thus a high failure rate over the last hours can indicate a general CI instability e.g. with infrastructure or remote network services.

This can potentially block all developers in the monorepo and needs to be investigated quickly.

Troubleshooting

Verify the high rate of unsuccessful GHA jobs (for merge queue) over the last hours in the CI Health dashboard. Drill down into the list of recent unsuccessful jobs and check their GHA logs for common symptoms and correlate known issues.

Check for GitHub Actions outages.

If the common symptom says that a job has been cancelled and superseeded by another job, this is a false positive alert and can be ignored. This can happen if users use the "Jump queue" option (discouraged) on their PR which cancels all running jobs since the merge queue order changed.

Solutions

Depends on the kind of failure. In general, quick mitigations/workarounds are preferred to unblock developers before going into root cause detection.

Push main High Failure Rate

Unsuccessful Unified CI GHA jobs on push to main branch mean that artifacts might not get built nor uploaded. Those same jobs being green is a precondition for the merge queue thus a high failure rate over the last hours can indicate a general CI instability e.g. with infrastructure or remote network services.

This can prevent artifact uploads and needs to be investigated.

Troubleshooting

Verify the high rate of unsuccessful GHA jobs (for push to main) over the last hours in the CI Health dashboard. Drill down into the list of recent unsuccessful jobs and check their GHA logs for common symptoms and correlate known issues.

Check for GitHub Actions outages.

Solutions

Depends on the kind of failure.

Selfhosted Runner High Disconnect Rate

Disconnected self-hosted runners mean that GHA jobs could not run until success producing failed builds. A high disconnect rate over the last hours can indicate a general CI instability e.g. with infrastructure.

This can block developers and needs to be investigated quickly.

Troubleshooting

Verify the high rate of disconnected self-hosted runners over the last hours in the CI Health dashboard. Drill down into the list of recent jobs aborted due to self-hosted runner disconnects, check their GHA logs and correlate with potential Kubernetes pod problems.

The /ci-problems command on PRs can be helpful to get links to resources.

Solutions

Depends on the kind of failure.