release-25.4: release tooling: bundled backport of GHA-migration follow-ups#170804
Conversation
Backport 1/1 commits from cockroachdb#170298 on behalf of @rail. ---- Move the release build/sign, publish, branch-cut, and pick-SHA pipelines from TeamCity-driven workflows to GitHub Actions, while preserving the existing TC shell scripts as the underlying build steps. New GitHub Actions workflows under .github/workflows/: - release-build-and-sign.yml — per-platform builds (linux amd64/arm64, s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image builds, macOS notarization, IBM/GPG signing, sentry release upload, and Slack notification. Concurrency-grouped, per-job timeouts, all third-party actions pinned to commit SHAs, secrets fetched from GCP Secret Manager via WIF (no GitHub repo secrets), umask-restricted on-disk secret materialization. rcodesign installed from a SHA-pinned upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo). - release-publish.yml — promotes staged artifacts to DockerHub, the Red Hat container catalog, and opens RAFA cloud-rollout PRs. A single approve-publish job hosts the release-ops environment so the reviewer clicks approve once per dispatch and every downstream publish job inherits the gate transitively. - release-branch-cut.yml — cuts staging branches, files Jira tickets (ADF-rendered), creates the backport label, and posts to Slack. - release-pick-sha.yml — picks a release SHA, writes it back to the Jira ticket, dispatches release-build-and-sign, and notifies the docs release-notes API. Both build-and-sign and publish accept a comma-separated skip_jobs input (validated by a first-stage validate-skip-jobs job) so an operator can re-dispatch after a partial infra failure without re- running already-successful jobs. Downstream jobs honor a 'skipped' upstream as success-equivalent only when the upstream is explicitly in skip_jobs, so cascade-skips from real failures don't masquerade as successful resumes. Companion build/github/release-*.sh wrappers translate GHA env conventions to the existing build/teamcity/internal/release/... scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact Registry project selection so they can be invoked from either driver. TC code paths in every shared script are untouched. Branch-cut and pick-SHA additionally build and run their release binary inside the bazel docker container (via run_bazel) so the host runner doesn't need a bazel/Go toolchain installed. The wrappers forward GITHUB_REPOSITORY into the container so the binary's defaultRepo() helper picks up the dispatching repo instead of falling back to cockroachdb/cockroach. A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs release-notes API clients, with unit tests for the SHA-pick and branch-cut commands. All HTTP clients are bounded by named per-call timeouts via httputil.NewClientWithTimeout so a wedged upstream API can't hang the cron run. update-versions takes --cockroach-repo and --github-username flags so the push targets aren't bound to specific literals; the dry-run override fires on isProductionRepo() rather than matching a hardcoded repo name, so a future prod-repo rename is a zero-line code change in the binary. Per-ticket summary logs in the branch-cut runner are dry-run-aware so a rehearsal run doesn't claim to have cut a branch it skipped. Prod-vs-non-prod side effects (Slack channel selection, customer- facing publish jobs) are gated on the IS_PRODUCTION_REPO repository variable. WIF provider/SA/GCP project selection is gated on a separate USE_PROD_GCP variable so a staging-prod repo can exercise the prod control-flow paths against the dev GCP project — operators set both on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork. Forks default to dry-run automatically and cross-repo abuse is still blocked at the WIF attribute_condition. Per-job dispatch refs are restricted to master, release-*-rc, and staging-v* via if: allow-lists, with the release-ops environment's deployment-branches policy as the authoritative gate behind the single approve-publish job. Release-26.1 adaptations not in the upstream PR: - build-linux and build-per-platform-ibm in release-build-and-sign.yml gain docker/setup-buildx-action and docker/setup-qemu-action steps because the per-platform build script still does an in-job arm64 docker build on this branch (master moved that to the separate build-docker job). Without QEMU binfmt handlers the arm64 RUN steps abort with 'exec format error'. - build-cockroach-release-cloud-only.sh drops the pre-WIF unconditional gcr_staged_credentials assignments in the upper if/else block; the WIF-aware block below already handles them, and leaving the unconditionals in place trips set -u under WIF where GCS_CREDENTIALS_PROD/DEV are unset. Release justification: release automation changes Epic: none Release note: None
The release binary's isProductionRepo() check reads the IS_PRODUCTION_REPO env var to decide between the production Slack channel (#db-release-status) and the rehearsal channel (#db-release-test). The branch-cut and pick-sha workflows gate themselves on vars.IS_PRODUCTION_REPO via their job-level if:, but never exported the value to the run step or the bazel docker container. As a result, a non-dry-run on the production repo still routed Slack to #db-release-test. Add IS_PRODUCTION_REPO to each workflow's env: block and to the docker -e passthrough list in the wrapper scripts, so the var reaches the binary on both layers. Release note: None Epic: none
Two unrelated bugs were preventing prod release runs from completing cleanly.
1. Build/publish workflows used a service account that does not exist.
The prod RELEASE_GCP_SERVICE_ACCOUNT in release-build-and-sign.yml and
release-publish.yml pointed at release-artifacts@releases-prod, whose
Gaia ID does not resolve. GCS uploads of release artifacts failed with
a 404 ("Gaia id not found"). Point both workflows at
gha-releases@releases-prod.iam.gserviceaccount.com, matching what
release-pick-sha.yml and release-branch-cut.yml already use.
2. pick-sha's release-notes API call wrapped its payload incorrectly.
The docs release-notes endpoint expects the payload fields at the top
level of the request body, but postReleaseNotes wrapped them in a
{"ReleaseNotesPayload": ...} envelope. The server rejected every call
with 400 ("Field validation for 'CurrentRelease' / 'ReleaseDate' /
'ReleaseSHA' failed on the 'required' tag"). Marshal the payload
directly and update the test that previously asserted the envelope.
Epic: none
Release note: None
The docs release-notes endpoint is a Cloud Function that synthesizes a release-notes draft. Cold starts plus the docs work itself can easily exceed the previous 30s client timeout; pick-sha runs were failing the post-pick release-notes call with "context deadline exceeded". Raise the timeout to 2 minutes so the legitimate work has time to complete while still bounding pathological hangs. Epic: none Release note: None
Two more independent auth failures in the prod release pipeline.
1. macOS signing impersonated a service account that does not exist.
RELEASE_SIGNING_SERVICE_ACCOUNT in release-build-and-sign.yml pointed
at release-signing@releases-prod, whose Gaia ID does not resolve.
Fetching Apple signing secrets from Secret Manager failed with 404
("Gaia id not found"). Point the prod arm at
gha-releases@releases-prod.iam.gserviceaccount.com, the consolidated
prod identity already used by the other release workflows.
2. The IBM docker build fetched cockroach tarballs from GCS using
gsutil, which does not honor the WIF credentials file that
google-github-actions/auth writes. gsutil fell back to the runner
VM's metadata service account and got a 403 from the release-staged
bucket. Switch the two gs:// downloads in
build-cockroach-release-docker.sh to 'gcloud storage cp', which
reads CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE / GOOGLE_APPLICATION_CREDENTIALS
and uses the impersonated identity as intended.
Epic: none
Release note: None
The sentry-panic helper shelled out to 'gsutil cp' to fetch the linux-amd64 release tarball. Under Workload Identity Federation, gsutil does not honor the credentials file google-github-actions/auth writes and silently falls back to the runner VM's metadata service account, which does not have access to the release-staged bucket. The download failed with a bare 'exit status 1' because the command's stderr was discarded. Switch to 'gcloud storage cp', which reads the WIF credentials, and wire the subcommand's stdout/stderr through to the parent so future failures surface real diagnostics instead of an empty exit code. Epic: none Release note: None
Three independent follow-ups, all surfaced by operating the new GHA release pipeline on release-26.2.1-rc. 1. cloud-rollout job needs Go on PATH. rafa-production's update_versions.sh shells out to `go run main.go update-versions ...`. The cockroach GHA runner doesn't ship Go, so the job died with `go: command not found` (exit 127). Add an actions/setup-go step pinned to go.mod's 1.26.2 so the rafa tool builds. The symmetric update-versions-master job in release-publish.yml doesn't need this — it runs inside the bazel container via run_bazel, which already has Go. 2. Narrow the pick-sha success Slack audience to RelEng. The 'SHA picked / build-and-sign triggered' announcement was posting to #db-release-status (or #db-release-test for fork rehearsals), but only RelEng acts on it. Route it to #release-ops (or #release-ops-staging) via two new pick-sha-specific channel constants. Branch-cut deliberately keeps the wider channels — its notification is genuinely useful to the broader release audience. 3. Inline the generated release-notes PR in the success message. The docs release-notes API now returns the PR URL; operators previously had to dig it out of the docs ticket. postReleaseNotes now parses the response into a releaseNotesResponse, and the call is reordered to happen between the Jira SHA update and the Slack post so the PR URL can be rendered as an extra bullet in the announcement (and in the mirrored Jira comment). The reorder is safe because the Pick SHA subtask is already transitioned to Done before this point, so the idempotency gate at the top of processCandidate prevents any retry from re-firing the API and producing a duplicate docs draft. Epic: none Release note: None
The 'Slack notification' steps in release-build-and-sign.yml and release-publish.yml advertise themselves as 'Post to #release-ops' in their step names, but the hardcoded channel ID C08H9FA1W3C actually resolves to #db-release-test — so the notifications were landing in the wrong channel. Replace the ID with the literal '#release-ops' channel name, matching the convention the other release workflows (release-pick-sha.yml, release-branch-cut.yml) already use. Epic: none Release note: None
|
😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details. |
|
Thanks for opening a backport. Before merging, please confirm that it falls into one of the following categories (select one):
Add a brief release justification to the PR description explaining your selection. Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy. All backports must be reviewed by the TL and EM for the owning area. |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
a5b0f48
into
cockroachdb:release-25.4
Backport 8/8 commits from the following release-26.1 backports on behalf of @rail:
The migration PR (#170348) sets up the new GHA release pipeline; the
remaining commits are the operational fixes surfaced while running
that pipeline on release-26.2.1-rc. Cherry-picked cleanly onto
release-25.4 with no conflicts.
Release justification: release-tooling fixes for GHA migration.
Epic: none
Release note: None