Skip to content

release: migrate release pipelines from TeamCity to GitHub Actions#170298

Merged
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
rail:rail/gha-releases
May 14, 2026
Merged

release: migrate release pipelines from TeamCity to GitHub Actions#170298
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
rail:rail/gha-releases

Conversation

@rail
Copy link
Copy Markdown
Member

@rail rail commented May 13, 2026

Move the release build/sign, publish, branch-cut, and pick-SHA pipelines from TeamCity-driven workflows to GitHub Actions, while preserving the existing TC shell scripts as the underlying build steps.

New GitHub Actions workflows under .github/workflows/:

  • release-build-and-sign.yml — per-platform builds (linux amd64/arm64, s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image builds, macOS notarization, IBM/GPG signing, sentry release upload, and Slack notification. Concurrency-grouped, per-job timeouts, all third-party actions pinned to commit SHAs, secrets fetched from GCP Secret Manager via WIF (no GitHub repo secrets), umask-restricted on-disk secret materialization. rcodesign installed from a SHA-pinned upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo).
  • release-publish.yml — promotes staged artifacts to DockerHub, the Red Hat container catalog, and opens RAFA cloud-rollout PRs. A single approve-publish job hosts the release-ops environment so the reviewer clicks approve once per dispatch and every downstream publish job inherits the gate transitively.
  • release-branch-cut.yml — cuts staging branches, files Jira tickets (ADF-rendered), creates the backport label, and posts to Slack.
  • release-pick-sha.yml — picks a release SHA, writes it back to the Jira ticket, dispatches release-build-and-sign, and notifies the docs release-notes API.

Both build-and-sign and publish accept a comma-separated skip_jobs input (validated by a first-stage validate-skip-jobs job) so an operator can re-dispatch after a partial infra failure without re- running already-successful jobs. Downstream jobs honor a 'skipped' upstream as success-equivalent only when the upstream is explicitly in skip_jobs, so cascade-skips from real failures don't masquerade as successful resumes.

Companion build/github/release-*.sh wrappers translate GHA env conventions to the existing build/teamcity/internal/release/... scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact Registry project selection so they can be invoked from either driver. TC code paths in every shared script are untouched. Branch-cut and pick-SHA additionally build and run their release binary inside the bazel docker container (via run_bazel) so the host runner doesn't need a bazel/Go toolchain installed. The wrappers forward GITHUB_REPOSITORY into the container so the binary's defaultRepo() helper picks up the dispatching repo instead of falling back to cockroachdb/cockroach.

A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs release-notes API clients, with unit tests for the SHA-pick and branch-cut commands. All HTTP clients are bounded by named per-call timeouts via httputil.NewClientWithTimeout so a wedged upstream API can't hang the cron run. update-versions takes --cockroach-repo and --github-username flags so the push targets aren't bound to specific literals; the dry-run override fires on isProductionRepo() rather than matching a hardcoded repo name, so a future prod-repo rename is a zero-line code change in the binary. Per-ticket summary logs in the branch-cut runner are dry-run-aware so a rehearsal run doesn't claim to have cut a branch it skipped.

Prod-vs-non-prod side effects (Slack channel selection, customer- facing publish jobs) are gated on the IS_PRODUCTION_REPO repository variable. WIF provider/SA/GCP project selection is gated on a separate USE_PROD_GCP variable so a staging-prod repo can exercise the prod control-flow paths against the dev GCP project — operators set both on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork. Forks default to dry-run automatically and cross-repo abuse is still blocked at the WIF attribute_condition.

Per-job dispatch refs are restricted to master, release--rc, and staging-v via if: allow-lists, with the release-ops environment's deployment-branches policy as the authoritative gate behind the single approve-publish job.

Epic: none
Release note: None

@rail rail self-assigned this May 13, 2026
@rail rail requested a review from a team as a code owner May 13, 2026 20:27
@rail rail added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-release T-release Release Engineering & Automation Team labels May 13, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 13, 2026

😎 Merged successfully - details.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 13, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Move the release build/sign, publish, branch-cut, and pick-SHA pipelines
from TeamCity-driven workflows to GitHub Actions, while preserving the
existing TC shell scripts as the underlying build steps.

New GitHub Actions workflows under .github/workflows/:

- release-build-and-sign.yml — per-platform builds (linux amd64/arm64,
  s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image
  builds, macOS notarization, IBM/GPG signing, sentry release upload,
  and Slack notification. Concurrency-grouped, per-job timeouts, all
  third-party actions pinned to commit SHAs, secrets fetched from GCP
  Secret Manager via WIF (no GitHub repo secrets), umask-restricted
  on-disk secret materialization. rcodesign installed from a SHA-pinned
  upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo).
- release-publish.yml — promotes staged artifacts to DockerHub, the
  Red Hat container catalog, and opens RAFA cloud-rollout PRs. A
  single approve-publish job hosts the release-ops environment so the
  reviewer clicks approve once per dispatch and every downstream
  publish job inherits the gate transitively.
- release-branch-cut.yml — cuts staging branches, files Jira tickets
  (ADF-rendered), creates the backport label, and posts to Slack.
- release-pick-sha.yml — picks a release SHA, writes it back to the
  Jira ticket, dispatches release-build-and-sign, and notifies the
  docs release-notes API.

Both build-and-sign and publish accept a comma-separated skip_jobs
input (validated by a first-stage validate-skip-jobs job) so an
operator can re-dispatch after a partial infra failure without re-
running already-successful jobs. Downstream jobs honor a 'skipped'
upstream as success-equivalent only when the upstream is explicitly
in skip_jobs, so cascade-skips from real failures don't masquerade as
successful resumes.

Companion build/github/release-*.sh wrappers translate GHA env
conventions to the existing build/teamcity/internal/release/...
scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact
Registry project selection so they can be invoked from either driver.
TC code paths in every shared script are untouched. Branch-cut and
pick-SHA additionally build and run their release binary inside the
bazel docker container (via run_bazel) so the host runner doesn't
need a bazel/Go toolchain installed. The wrappers forward
GITHUB_REPOSITORY into the container so the binary's defaultRepo()
helper picks up the dispatching repo instead of falling back to
cockroachdb/cockroach.

A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA
workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs
release-notes API clients, with unit tests for the SHA-pick and
branch-cut commands. All HTTP clients are bounded by named per-call
timeouts via httputil.NewClientWithTimeout so a wedged upstream API
can't hang the cron run. update-versions takes --cockroach-repo and
--github-username flags so the push targets aren't bound to specific
literals; the dry-run override fires on isProductionRepo() rather
than matching a hardcoded repo name, so a future prod-repo rename is
a zero-line code change in the binary. Per-ticket summary logs in the
branch-cut runner are dry-run-aware so a rehearsal run doesn't claim
to have cut a branch it skipped.

Prod-vs-non-prod side effects (Slack channel selection, customer-
facing publish jobs) are gated on the IS_PRODUCTION_REPO repository
variable. WIF provider/SA/GCP project selection is gated on a separate
USE_PROD_GCP variable so a staging-prod repo can exercise the prod
control-flow paths against the dev GCP project — operators set both
on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork.
Forks default to dry-run automatically and cross-repo abuse is still
blocked at the WIF attribute_condition.

Per-job dispatch refs are restricted to master, release-*-rc, and
staging-v* via if: allow-lists, with the release-ops environment's
deployment-branches policy as the authoritative gate behind the
single approve-publish job.

Epic: none
Release note: None
@rail rail force-pushed the rail/gha-releases branch from 10e0aef to 7e523af Compare May 13, 2026 20:30
@rail rail requested a review from celiala May 13, 2026 21:21
Copy link
Copy Markdown
Contributor

@celiala celiala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a large, well-constructed migration. The idempotency gates (pick-SHA subtask, Baking-transition status guard, branch-existence recovery), WIF secret handling, TC compatibility blocks, and HTTP timeout discipline are all solid. A few non-blocking things to consider before merging.

Findings not tied to a specific line:

  • [tests] cutRunner.run() and pickSHARunner.run() both continue past per-candidate errors and call notifyFailure for each, but no test exercises a multi-candidate list where candidate 1 fails and candidate 2 succeeds — verifying the ops-channel Slack post fires AND the second candidate's side effects land. Two releases sharing a cut date is normal at series boundaries.
  • [tests] The notifyReleaseNotes docs-subtask-Done gate (prevents duplicate docs drafts on re-run) is never hit by any test. TestPickSHARunnerProcessCandidateIdempotent exits at the outer pick-SHA-subtask gate before reaching it. Add a variant where the pick-SHA subtask is open but the docs subtask is Done; assert the release-notes API mock receives zero calls.
  • [tests] The ops-channel notifyFailure Slack post path is untested. If it silently breaks, operators get no alert when the daily cron fails.
  • [tests] CreateLabel HTTP 422 (already-exists idempotency) is never exercised in tests. After a go-github version bump that changes the error type, re-run recovery on an existing label would fail.
  • [nit] darwin-amd64 is absent from the macOS signing loop — was this intentionally excluded from the prior TC signing scope?

(made with /review-crdb)

source "$dir/build/release/teamcity-support.sh"

secrets_dir="${RUNNER_TEMP:-.}/.secrets"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): ${RUNNER_TEMP:-.} silently falls back to . if RUNNER_TEMP is unset, writing GPG private keys into the workspace root — a host-visible bind-mount that persists after the container is removed.

Suggested change
secrets_dir="${RUNNER_TEMP:?RUNNER_TEMP must be set}/.secrets"

source "$dir/build/shlib.sh"

secrets_dir="${RUNNER_TEMP:-.}/.secrets"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): Same issue as release-sign-ibm.sh — Apple signing certs would land in the workspace root if RUNNER_TEMP is unset.

Suggested change
secrets_dir="${RUNNER_TEMP:?RUNNER_TEMP must be set}/.secrets"

}

msg := buildSlackMessage(details, r.dryRun)
link, err := r.slack.PostMessage(r.channel, msg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): The Slack branch-cut announcement and the Jira comment (a few lines below) both fire unconditionally — including when branchAlreadyCut=true (re-entry after a partial failure). The release team would receive a duplicate "branch has been cut" Slack message and a second Jira comment for the same branch.

Consider guarding both on !branchAlreadyCut.

"dry_run": strconv.FormatBool(r.dryRun),
}
if err := r.gh.DispatchWorkflow(ctx, ref, r.buildWorkflow, inputs); err != nil {
return errors.Wrapf(err, "dispatching %s on %s", r.buildWorkflow, ref)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): DispatchWorkflow fires before Transition(subtaskDone) (line 291). If the dispatch succeeds but the subtask transition fails, the next cron run will re-dispatch build-and-sign — a multi-hour multi-platform rebuild. The code comment acknowledges this window.

Consider swapping the order: transition the subtask to Done first, then dispatch. A transition failure with no dispatch is recoverable (next cron re-dispatches); a duplicate dispatch is expensive and non-obvious.

Name string `json:"name"`
}
if err := json.Unmarshal(raw, &s); err != nil {
return ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): statusName() silently swallows unmarshal errors with no log. When Jira changes the shape of the status field, statusName() will return "", the Baking transition will fire, and the resulting Jira API rejection ("already in this state") will be confusing to diagnose with no trace in the run log.

Suggested change
return ""
log.Printf("warning: failed to parse Jira status field: %v", err)
return ""

defer resp.Body.Close()
if resp.StatusCode >= 300 {
respBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1024))
return errors.Newf("release-notes API returned %s: %s", resp.Status, respBody)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(nit): io.ReadAll error is dropped with _. If the read fails (e.g. connection reset), the returned error contains only the status code with no body context. Consistent with how jira_client.go handles this:

Suggested change
return errors.Newf("release-notes API returned %s: %s", resp.Status, respBody)
respBody, readErr := io.ReadAll(io.LimitReader(resp.Body, 1024))
if readErr != nil {
return errors.Wrapf(readErr, "release-notes API returned %s (body read failed)", resp.Status)
}
return errors.Newf("release-notes API returned %s: %s", resp.Status, respBody)

Copy link
Copy Markdown
Contributor

@celiala celiala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few non-blocking things to consider before merging.

approving / LGTM, since these are all non-blocking

@trunk-io trunk-io Bot merged commit 995abe6 into cockroachdb:master May 14, 2026
33 checks passed
@rail
Copy link
Copy Markdown
Member Author

rail commented May 14, 2026

blathers backport 26.2

@rail rail deleted the rail/gha-releases branch May 14, 2026 15:34
rail pushed a commit to rail/cockroach that referenced this pull request May 14, 2026
Backport 1/1 commits from cockroachdb#170298 on behalf of @rail.

----

Move the release build/sign, publish, branch-cut, and pick-SHA pipelines
from TeamCity-driven workflows to GitHub Actions, while preserving the
existing TC shell scripts as the underlying build steps.

New GitHub Actions workflows under .github/workflows/:

- release-build-and-sign.yml — per-platform builds (linux amd64/arm64,
  s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image
  builds, macOS notarization, IBM/GPG signing, sentry release upload,
  and Slack notification. Concurrency-grouped, per-job timeouts, all
  third-party actions pinned to commit SHAs, secrets fetched from GCP
  Secret Manager via WIF (no GitHub repo secrets), umask-restricted
  on-disk secret materialization. rcodesign installed from a SHA-pinned
  upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo).
- release-publish.yml — promotes staged artifacts to DockerHub, the
  Red Hat container catalog, and opens RAFA cloud-rollout PRs. A
  single approve-publish job hosts the release-ops environment so the
  reviewer clicks approve once per dispatch and every downstream
  publish job inherits the gate transitively.
- release-branch-cut.yml — cuts staging branches, files Jira tickets
  (ADF-rendered), creates the backport label, and posts to Slack.
- release-pick-sha.yml — picks a release SHA, writes it back to the
  Jira ticket, dispatches release-build-and-sign, and notifies the
  docs release-notes API.

Both build-and-sign and publish accept a comma-separated skip_jobs
input (validated by a first-stage validate-skip-jobs job) so an
operator can re-dispatch after a partial infra failure without re-
running already-successful jobs. Downstream jobs honor a 'skipped'
upstream as success-equivalent only when the upstream is explicitly
in skip_jobs, so cascade-skips from real failures don't masquerade as
successful resumes.

Companion build/github/release-*.sh wrappers translate GHA env
conventions to the existing build/teamcity/internal/release/...
scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact
Registry project selection so they can be invoked from either driver.
TC code paths in every shared script are untouched. Branch-cut and
pick-SHA additionally build and run their release binary inside the
bazel docker container (via run_bazel) so the host runner doesn't
need a bazel/Go toolchain installed. The wrappers forward
GITHUB_REPOSITORY into the container so the binary's defaultRepo()
helper picks up the dispatching repo instead of falling back to
cockroachdb/cockroach.

A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA
workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs
release-notes API clients, with unit tests for the SHA-pick and
branch-cut commands. All HTTP clients are bounded by named per-call
timeouts via httputil.NewClientWithTimeout so a wedged upstream API
can't hang the cron run. update-versions takes --cockroach-repo and
--github-username flags so the push targets aren't bound to specific
literals; the dry-run override fires on isProductionRepo() rather
than matching a hardcoded repo name, so a future prod-repo rename is
a zero-line code change in the binary. Per-ticket summary logs in the
branch-cut runner are dry-run-aware so a rehearsal run doesn't claim
to have cut a branch it skipped.

Prod-vs-non-prod side effects (Slack channel selection, customer-
facing publish jobs) are gated on the IS_PRODUCTION_REPO repository
variable. WIF provider/SA/GCP project selection is gated on a separate
USE_PROD_GCP variable so a staging-prod repo can exercise the prod
control-flow paths against the dev GCP project — operators set both
on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork.
Forks default to dry-run automatically and cross-repo abuse is still
blocked at the WIF attribute_condition.

Per-job dispatch refs are restricted to master, release-*-rc, and
staging-v* via if: allow-lists, with the release-ops environment's
deployment-branches policy as the authoritative gate behind the
single approve-publish job.

Release-26.1 adaptations not in the upstream PR:

- build-linux and build-per-platform-ibm in release-build-and-sign.yml
  gain docker/setup-buildx-action and docker/setup-qemu-action steps
  because the per-platform build script still does an in-job arm64
  docker build on this branch (master moved that to the separate
  build-docker job). Without QEMU binfmt handlers the arm64 RUN steps
  abort with 'exec format error'.
- build-cockroach-release-cloud-only.sh drops the pre-WIF unconditional
  gcr_staged_credentials assignments in the upper if/else block; the
  WIF-aware block below already handles them, and leaving the
  unconditionals in place trips set -u under WIF where
  GCS_CREDENTIALS_PROD/DEV are unset.

Release justification: release automation changes
Epic: none
Release note: None
trunk-io Bot added a commit that referenced this pull request May 22, 2026
Backport 1/1 commits from #170298 on behalf of @rail.

----

Move the release build/sign, publish, branch-cut, and pick-SHA pipelines
from TeamCity-driven workflows to GitHub Actions, while preserving the
existing TC shell scripts as the underlying build steps.

New GitHub Actions workflows under .github/workflows/:

- release-build-and-sign.yml — per-platform builds (linux amd64/arm64,
  s390x, FIPS, darwin amd64/arm64, windows), Docker multi-arch image
  builds, macOS notarization, IBM/GPG signing, sentry release upload,
  and Slack notification. Concurrency-grouped, per-job timeouts, all
  third-party actions pinned to commit SHAs, secrets fetched from GCP
  Secret Manager via WIF (no GitHub repo secrets), umask-restricted
  on-disk secret materialization. rcodesign installed from a SHA-pinned
  upstream binary into $RUNNER_TEMP/bin (no cargo, no sudo).
- release-publish.yml — promotes staged artifacts to DockerHub, the
  Red Hat container catalog, and opens RAFA cloud-rollout PRs. A
  single approve-publish job hosts the release-ops environment so the
  reviewer clicks approve once per dispatch and every downstream
  publish job inherits the gate transitively.
- release-branch-cut.yml — cuts staging branches, files Jira tickets
  (ADF-rendered), creates the backport label, and posts to Slack.
- release-pick-sha.yml — picks a release SHA, writes it back to the
  Jira ticket, dispatches release-build-and-sign, and notifies the
  docs release-notes API.

Both build-and-sign and publish accept a comma-separated skip_jobs
input (validated by a first-stage validate-skip-jobs job) so an
operator can re-dispatch after a partial infra failure without re-
running already-successful jobs. Downstream jobs honor a 'skipped'
upstream as success-equivalent only when the upstream is explicitly
in skip_jobs, so cascade-skips from real failures don't masquerade as
successful resumes.

Companion build/github/release-*.sh wrappers translate GHA env
conventions to the existing build/teamcity/internal/release/...
scripts, which gain conditional WIF auth and dev-vs-prod GCS / Artifact
Registry project selection so they can be invoked from either driver.
TC code paths in every shared script are untouched. Branch-cut and
pick-SHA additionally build and run their release binary inside the
bazel docker container (via run_bazel) so the host runner doesn't
need a bazel/Go toolchain installed. The wrappers forward
GITHUB_REPOSITORY into the container so the binary's defaultRepo()
helper picks up the dispatching repo instead of falling back to
cockroachdb/cockroach.

A new pkg/cmd/release Go CLI drives the branch-cut and pick-SHA
workflows. It includes Jira (REST v3 + ADF), GitHub, Slack, and docs
release-notes API clients, with unit tests for the SHA-pick and
branch-cut commands. All HTTP clients are bounded by named per-call
timeouts via httputil.NewClientWithTimeout so a wedged upstream API
can't hang the cron run. update-versions takes --cockroach-repo and
--github-username flags so the push targets aren't bound to specific
literals; the dry-run override fires on isProductionRepo() rather
than matching a hardcoded repo name, so a future prod-repo rename is
a zero-line code change in the binary. Per-ticket summary logs in the
branch-cut runner are dry-run-aware so a rehearsal run doesn't claim
to have cut a branch it skipped.

Prod-vs-non-prod side effects (Slack channel selection, customer-
facing publish jobs) are gated on the IS_PRODUCTION_REPO repository
variable. WIF provider/SA/GCP project selection is gated on a separate
USE_PROD_GCP variable so a staging-prod repo can exercise the prod
control-flow paths against the dev GCP project — operators set both
on the real prod repo, only IS_PRODUCTION_REPO on a rehearsal fork.
Forks default to dry-run automatically and cross-repo abuse is still
blocked at the WIF attribute_condition.

Per-job dispatch refs are restricted to master, release-*-rc, and
staging-v* via if: allow-lists, with the release-ops environment's
deployment-branches policy as the authoritative gate behind the
single approve-publish job.

Release-26.1 adaptations not in the upstream PR:

- build-linux and build-per-platform-ibm in release-build-and-sign.yml
  gain docker/setup-buildx-action and docker/setup-qemu-action steps
  because the per-platform build script still does an in-job arm64
  docker build on this branch (master moved that to the separate
  build-docker job). Without QEMU binfmt handlers the arm64 RUN steps
  abort with 'exec format error'.
- build-cockroach-release-cloud-only.sh drops the pre-WIF unconditional
  gcr_staged_credentials assignments in the upper if/else block; the
  WIF-aware block below already handles them, and leaving the
  unconditionals in place trips set -u under WIF where
  GCS_CREDENTIALS_PROD/DEV are unset.

Release justification: release automation changes
Epic: none
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-release C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-release Release Engineering & Automation Team target-release-26.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants