-
Notifications
You must be signed in to change notification settings - Fork 598
CI Runbooks
This page collects information on what alerts affecting the C8 monorepo CI mean and how to respond to them. For general information see the CI & Automation page.
Unsuccessful GHA jobs in the merge queue to main
branch mean that PRs cannot get merged. Those same jobs being green is a precondition for enqueuing a PR for the merge queue thus a high failure rate over the last hours can indicate a general CI instability e.g. with infrastructure or remote network services.
This can potentially block all developers in the monorepo and needs to be investigated quickly.
Verify the high rate of unsuccessful GHA jobs (for merge queue) over the last hours in the CI Health dashboard. Drill down into the list of recent unsuccessful jobs and check their GHA logs for common symptoms and correlate known issues.
Check for GitHub Actions outages.
If the common symptom says that a job has been cancelled and superseeded by another job, this is a false positive alert and can be ignored. This can happen if users use the "Jump queue" option (discouraged) on their PR which cancels all running jobs since the merge queue order changed.
Depends on the kind of failure. In general, quick mitigations/workarounds are preferred to unblock developers before going into root cause detection.
Unsuccessful Unified CI GHA jobs on push to main
branch mean that artifacts might not get built nor uploaded. Those same jobs being green is a precondition for the merge queue thus a high failure rate over the last hours can indicate a general CI instability e.g. with infrastructure or remote network services.
This can prevent artifact uploads and needs to be investigated.
Verify the high rate of unsuccessful GHA jobs (for push to main
) over the last hours in the CI Health dashboard. Drill down into the list of recent unsuccessful jobs and check their GHA logs for common symptoms and correlate known issues.
Check for GitHub Actions outages.
Depends on the kind of failure.
Disconnected self-hosted runners mean that GHA jobs could not run until success producing failed builds. A high disconnect rate over the last hours can indicate a general CI instability e.g. with infrastructure.
This can block developers and needs to be investigated quickly.
Verify the high rate of disconnected self-hosted runners over the last hours in the CI Health dashboard. Drill down into the list of recent jobs aborted due to self-hosted runner disconnects, check their GHA logs and correlate with potential Kubernetes pod problems.
The /ci-problems
command on PRs can be helpful to get links to resources.
Depends on the kind of failure.