Skip to content

release-25.4: release tooling: bundled backport of GHA-migration follow-ups#170804

Merged
trunk-io[bot] merged 8 commits into
cockroachdb:release-25.4from
rail:backport25.4-170348-170392-170657-170670-170686-170727-170765-170779
May 22, 2026
Merged

release-25.4: release tooling: bundled backport of GHA-migration follow-ups#170804
trunk-io[bot] merged 8 commits into
cockroachdb:release-25.4from
rail:backport25.4-170348-170392-170657-170670-170686-170727-170765-170779

Conversation

@rail
Copy link
Copy Markdown
Member

@rail rail commented May 22, 2026

Backport 8/8 commits from the following release-26.1 backports on behalf of @rail:

The migration PR (#170348) sets up the new GHA release pipeline; the
remaining commits are the operational fixes surfaced while running
that pipeline on release-26.2.1-rc. Cherry-picked cleanly onto
release-25.4 with no conflicts.


Release justification: release-tooling fixes for GHA migration.

Epic: none
Release note: None

trunk-io Bot and others added 8 commits May 22, 2026 09:05
Backport 1/1 commits from cockroachdb#170298 on behalf of @rail.

----

Move the release build/sign, publish, branch-cut, and pick-SHA pipelines
from TeamCity-driven workflows to GitHub Actions, while preserving the
existing TC shell scripts as the underlying build steps.

New GitHub Actions workflows under .github/workflows/:

- release-build-and-sign.yml — per-platform builds (linux amd64/arm64,
  s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image
  builds, macOS notarization, IBM/GPG signing, sentry release upload,
  and Slack notification. Concurrency-grouped, per-job timeouts, all
  third-party actions pinned to commit SHAs, secrets fetched from GCP
  Secret Manager via WIF (no GitHub repo secrets), umask-restricted
  on-disk secret materialization. rcodesign installed from a SHA-pinned
  upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo).
- release-publish.yml — promotes staged artifacts to DockerHub, the
  Red Hat container catalog, and opens RAFA cloud-rollout PRs. A
  single approve-publish job hosts the release-ops environment so the
  reviewer clicks approve once per dispatch and every downstream
  publish job inherits the gate transitively.
- release-branch-cut.yml — cuts staging branches, files Jira tickets
  (ADF-rendered), creates the backport label, and posts to Slack.
- release-pick-sha.yml — picks a release SHA, writes it back to the
  Jira ticket, dispatches release-build-and-sign, and notifies the
  docs release-notes API.

Both build-and-sign and publish accept a comma-separated skip_jobs
input (validated by a first-stage validate-skip-jobs job) so an
operator can re-dispatch after a partial infra failure without re-
running already-successful jobs. Downstream jobs honor a 'skipped'
upstream as success-equivalent only when the upstream is explicitly
in skip_jobs, so cascade-skips from real failures don't masquerade as
successful resumes.

Companion build/github/release-*.sh wrappers translate GHA env
conventions to the existing build/teamcity/internal/release/...
scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact
Registry project selection so they can be invoked from either driver.
TC code paths in every shared script are untouched. Branch-cut and
pick-SHA additionally build and run their release binary inside the
bazel docker container (via run_bazel) so the host runner doesn't
need a bazel/Go toolchain installed. The wrappers forward
GITHUB_REPOSITORY into the container so the binary's defaultRepo()
helper picks up the dispatching repo instead of falling back to
cockroachdb/cockroach.

A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA
workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs
release-notes API clients, with unit tests for the SHA-pick and
branch-cut commands. All HTTP clients are bounded by named per-call
timeouts via httputil.NewClientWithTimeout so a wedged upstream API
can't hang the cron run. update-versions takes --cockroach-repo and
--github-username flags so the push targets aren't bound to specific
literals; the dry-run override fires on isProductionRepo() rather
than matching a hardcoded repo name, so a future prod-repo rename is
a zero-line code change in the binary. Per-ticket summary logs in the
branch-cut runner are dry-run-aware so a rehearsal run doesn't claim
to have cut a branch it skipped.

Prod-vs-non-prod side effects (Slack channel selection, customer-
facing publish jobs) are gated on the IS_PRODUCTION_REPO repository
variable. WIF provider/SA/GCP project selection is gated on a separate
USE_PROD_GCP variable so a staging-prod repo can exercise the prod
control-flow paths against the dev GCP project — operators set both
on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork.
Forks default to dry-run automatically and cross-repo abuse is still
blocked at the WIF attribute_condition.

Per-job dispatch refs are restricted to master, release-*-rc, and
staging-v* via if: allow-lists, with the release-ops environment's
deployment-branches policy as the authoritative gate behind the
single approve-publish job.

Release-26.1 adaptations not in the upstream PR:

- build-linux and build-per-platform-ibm in release-build-and-sign.yml
  gain docker/setup-buildx-action and docker/setup-qemu-action steps
  because the per-platform build script still does an in-job arm64
  docker build on this branch (master moved that to the separate
  build-docker job). Without QEMU binfmt handlers the arm64 RUN steps
  abort with 'exec format error'.
- build-cockroach-release-cloud-only.sh drops the pre-WIF unconditional
  gcr_staged_credentials assignments in the upper if/else block; the
  WIF-aware block below already handles them, and leaving the
  unconditionals in place trips set -u under WIF where
  GCS_CREDENTIALS_PROD/DEV are unset.

Release justification: release automation changes
Epic: none
Release note: None
The release binary's isProductionRepo() check reads the IS_PRODUCTION_REPO
env var to decide between the production Slack channel
(#db-release-status) and the rehearsal channel (#db-release-test). The
branch-cut and pick-sha workflows gate themselves on
vars.IS_PRODUCTION_REPO via their job-level if:, but never exported the
value to the run step or the bazel docker container. As a result, a
non-dry-run on the production repo still routed Slack to
#db-release-test.

Add IS_PRODUCTION_REPO to each workflow's env: block and to the docker
-e passthrough list in the wrapper scripts, so the var reaches the
binary on both layers.

Release note: None
Epic: none
Two unrelated bugs were preventing prod release runs from completing cleanly.

1. Build/publish workflows used a service account that does not exist.
   The prod RELEASE_GCP_SERVICE_ACCOUNT in release-build-and-sign.yml and
   release-publish.yml pointed at release-artifacts@releases-prod, whose
   Gaia ID does not resolve. GCS uploads of release artifacts failed with
   a 404 ("Gaia id not found"). Point both workflows at
   gha-releases@releases-prod.iam.gserviceaccount.com, matching what
   release-pick-sha.yml and release-branch-cut.yml already use.

2. pick-sha's release-notes API call wrapped its payload incorrectly.
   The docs release-notes endpoint expects the payload fields at the top
   level of the request body, but postReleaseNotes wrapped them in a
   {"ReleaseNotesPayload": ...} envelope. The server rejected every call
   with 400 ("Field validation for 'CurrentRelease' / 'ReleaseDate' /
   'ReleaseSHA' failed on the 'required' tag"). Marshal the payload
   directly and update the test that previously asserted the envelope.

Epic: none
Release note: None
The docs release-notes endpoint is a Cloud Function that synthesizes a
release-notes draft. Cold starts plus the docs work itself can easily
exceed the previous 30s client timeout; pick-sha runs were failing the
post-pick release-notes call with "context deadline exceeded".

Raise the timeout to 2 minutes so the legitimate work has time to
complete while still bounding pathological hangs.

Epic: none
Release note: None
Two more independent auth failures in the prod release pipeline.

1. macOS signing impersonated a service account that does not exist.
   RELEASE_SIGNING_SERVICE_ACCOUNT in release-build-and-sign.yml pointed
   at release-signing@releases-prod, whose Gaia ID does not resolve.
   Fetching Apple signing secrets from Secret Manager failed with 404
   ("Gaia id not found"). Point the prod arm at
   gha-releases@releases-prod.iam.gserviceaccount.com, the consolidated
   prod identity already used by the other release workflows.

2. The IBM docker build fetched cockroach tarballs from GCS using
   gsutil, which does not honor the WIF credentials file that
   google-github-actions/auth writes. gsutil fell back to the runner
   VM's metadata service account and got a 403 from the release-staged
   bucket. Switch the two gs:// downloads in
   build-cockroach-release-docker.sh to 'gcloud storage cp', which
   reads CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE / GOOGLE_APPLICATION_CREDENTIALS
   and uses the impersonated identity as intended.

Epic: none
Release note: None
The sentry-panic helper shelled out to 'gsutil cp' to fetch the
linux-amd64 release tarball. Under Workload Identity Federation,
gsutil does not honor the credentials file google-github-actions/auth
writes and silently falls back to the runner VM's metadata service
account, which does not have access to the release-staged bucket. The
download failed with a bare 'exit status 1' because the command's
stderr was discarded.

Switch to 'gcloud storage cp', which reads the WIF credentials, and
wire the subcommand's stdout/stderr through to the parent so future
failures surface real diagnostics instead of an empty exit code.

Epic: none
Release note: None
Three independent follow-ups, all surfaced by operating the new GHA
release pipeline on release-26.2.1-rc.

1. cloud-rollout job needs Go on PATH.
   rafa-production's update_versions.sh shells out to `go run main.go
   update-versions ...`. The cockroach GHA runner doesn't ship Go,
   so the job died with `go: command not found` (exit 127). Add an
   actions/setup-go step pinned to go.mod's 1.26.2 so the rafa tool
   builds. The symmetric update-versions-master job in
   release-publish.yml doesn't need this — it runs inside the bazel
   container via run_bazel, which already has Go.

2. Narrow the pick-sha success Slack audience to RelEng.
   The 'SHA picked / build-and-sign triggered' announcement was
   posting to #db-release-status (or #db-release-test for fork
   rehearsals), but only RelEng acts on it. Route it to #release-ops
   (or #release-ops-staging) via two new pick-sha-specific channel
   constants. Branch-cut deliberately keeps the wider channels — its
   notification is genuinely useful to the broader release audience.

3. Inline the generated release-notes PR in the success message.
   The docs release-notes API now returns the PR URL; operators
   previously had to dig it out of the docs ticket. postReleaseNotes
   now parses the response into a releaseNotesResponse, and the call
   is reordered to happen between the Jira SHA update and the Slack
   post so the PR URL can be rendered as an extra bullet in the
   announcement (and in the mirrored Jira comment). The reorder is
   safe because the Pick SHA subtask is already transitioned to Done
   before this point, so the idempotency gate at the top of
   processCandidate prevents any retry from re-firing the API and
   producing a duplicate docs draft.

Epic: none
Release note: None
The 'Slack notification' steps in release-build-and-sign.yml and
release-publish.yml advertise themselves as 'Post to #release-ops'
in their step names, but the hardcoded channel ID C08H9FA1W3C
actually resolves to #db-release-test — so the notifications were
landing in the wrong channel. Replace the ID with the literal
'#release-ops' channel name, matching the convention the other
release workflows (release-pick-sha.yml, release-branch-cut.yml)
already use.

Epic: none
Release note: None
@rail rail requested a review from a team as a code owner May 22, 2026 13:06
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 22, 2026

😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 22, 2026

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes OR fixes for serious issues. Non-production includes test-only changes, build system changes, etc. Serious issues are defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to. Reference the approved ENGREQ ticket in the PR body (e.g., "Fixes ENGREQ-123").

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl Bot added backport Label PR's that are backports to older release branches T-code-systems labels May 22, 2026
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 22, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@rail rail self-assigned this May 22, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 22, 2026

Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link)

@trunk-io trunk-io Bot merged commit a5b0f48 into cockroachdb:release-25.4 May 22, 2026
28 of 29 checks passed
@rail rail deleted the backport25.4-170348-170392-170657-170670-170686-170727-170765-170779 branch May 22, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches T-code-systems target-release-25.4.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants