ci: split orchestrator integrity into parallel jobs for faster validation by frostebite · Pull Request #809 · game-ci/unity-builder

frostebite · 2026-03-05T13:41:51Z

Summary

Rewrites the monolithic orchestrator-integrity.yml workflow into 4 parallel jobs, each running on its own GitHub Actions runner with an isolated 14GB disk. Wall-clock time drops from 3+ hours to ~1 hour, and the chronic disk space exhaustion that caused flaky failures is eliminated.

This is infrastructure work that benefits the entire LTS release — every open PR triggers this workflow, so making it fast and reliable unblocks all feature development.

Problem

The existing workflow runs 24 integration tests sequentially in a single job. This creates two compounding problems:

Time: With setup, infrastructure provisioning, test execution, and cleanup for each test, the job takes 3+ hours. Every push to any open PR waits for this full cycle, creating a bottleneck across all LTS development.

Disk space exhaustion: The single runner's 14GB disk must simultaneously support:

Consumer	Disk Impact
k3d (Kubernetes simulation)	Pulls ~3.9GB of Unity Docker images into containerd's image store. Node images accumulate across test runs.
LocalStack (AWS simulation)	Accumulates CloudFormation stacks, S3 objects, ECS task definitions, and internal state. Each test creates resources that are not fully reclaimed.
Docker build cache	Unity base images, test project layers, and intermediate build artifacts.
Test artifacts	Build outputs, log files, and cached assets from each of the 24 tests.

When k3d and LocalStack compete for the same disk, the runner hits capacity mid-run — causing non-deterministic failures that are difficult to diagnose and impossible to fix without architectural changes.

Solution

Split the monolith into 4 parallel jobs based on infrastructure requirements. Each job runs on its own fresh runner, so disk-hungry consumers (k3d, LocalStack) never compete for the same space.

Job Architecture

Job	Infrastructure	Tests	Est. Time
`k8s-tests`	k3d cluster + LocalStack	5 — image, kubernetes, s3-steps, e2e-caching, e2e-retaining	~60 min
`aws-provider-tests`	LocalStack only	10 — image, environment, s3-steps, hooks, e2e-caching, e2e-retaining, caching, locking-core, locking-get-locked, e2e-locking	~60 min
`local-docker-tests`	Docker + LocalStack (S3 tests)	9 — image, hooks, local-persistence, locking-core, locking-get-locked, caching, github-checks, s3-steps, e2e-caching	~45 min
`rclone-tests`	rclone + LocalStack	1 — rclone-steps	~10 min

Workflow Topology

integrity-check.yml (entry point)
  └── orchestrator-integrity.yml (workflow_call)
        ├── k8s-tests          ─┐
        ├── aws-provider-tests  ├── parallel (4 runners)
        ├── local-docker-tests  │
        └── rclone-tests       ─┘

Within each job, tests still run sequentially — they share Docker state within a provider strategy. The parallelism is between provider strategies, not between individual tests.

Before / After

Metric	Before	After
Jobs	1 monolith	4 parallel
Wall-clock time	~3 hours	~1 hour
Disk per job	14GB shared across 24 tests	14GB dedicated per job
k3d + LocalStack	Competing on same disk	Isolated to separate runners
Cleanup blocks	15 copy-pasted 30-line blocks	Reusable shell functions
Cleanup strategy	Heavy prune after every test	Light prune between tests, heavy at job boundaries
File	1109 lines	1232 lines (+123, but cleaner structure)

Cleanup Function Pattern

The previous workflow had 15 near-identical 30-line cleanup blocks scattered throughout. These are now replaced with reusable shell functions sourced from a temporary script:

# Each job creates /tmp/cleanup-functions.sh at startup, then:
source /tmp/cleanup-functions.sh

light_cleanup       # Remove cache dirs + docker system prune -f
full_k8s_cleanup    # K8s resources + k3d node images + light cleanup

Cleanup frequency is also reduced: light cleanup (docker system prune -f) runs between tests within a job, while heavy cleanup (prune -af --volumes) only runs at job start and end. This avoids unnecessarily re-pulling base images mid-job.

Disk Space Analysis

Why k3d eats disk

k3d creates a Kubernetes cluster using Docker containers. Each k3d node runs containerd internally, and when the orchestrator tests pull Unity Docker images into the cluster, those images are stored in containerd's content store inside the k3d node container — not in Docker's image cache. This means:

Each Unity image pull consumes ~3.9GB inside the k3d node
docker system prune does not reclaim this space (it is inside the container)
Only deleting k3d node images (k3d image rm) or destroying the cluster frees the space
Previously, k3d cleanup ran in every test — even AWS and Docker tests where no k3d cluster existed

Why LocalStack eats disk

LocalStack simulates AWS services locally. Each test creates CloudFormation stacks, S3 buckets with objects, ECS task definitions, and internal state. Even with cleanup between tests:

CloudFormation stack state accumulates in LocalStack's internal storage
S3 objects written during tests are not always fully garbage-collected
ECS task definitions and container images create persistent layer data
LocalStack's own logs and temporary files grow over the run

The fix

By giving each job its own runner, k3d (in k8s-tests) and LocalStack (in aws-provider-tests and others) each get a full 14GB disk instead of splitting one. The k3d node image cleanup now only runs in the k8s-tests job where it actually matters.

What Did NOT Change

workflow_call interface — Inputs, outputs, and permissions are identical. integrity-check.yml requires zero modifications.
Test coverage — All 24 tests are preserved across the 4 jobs. No tests were added, removed, or modified.
Environment variables — Same env vars, same secrets, same configuration.
Secrets usage — No new secrets required. Same UNITY_LICENSE, AWS_*, and GH_TOKEN usage.
Test execution logic — The actual test commands (npm run cli -- ...) are unchanged.

Testing

This workflow tests itself — pushing to the PR branch triggers integrity-check.yml, which calls the modified orchestrator-integrity.yml. Verification:

integrity-check.yml calls orchestrator-integrity.yml without modifications
All 4 jobs appear in the GitHub Actions UI and run in parallel
Each job provisions only the infrastructure it needs (no k3d in aws-provider-tests, etc.)
Cleanup functions source correctly in each job
Disk usage in k8s-tests stays within 14GB (previously the bottleneck)
Total wall-clock time is under 90 minutes
All 24 tests pass across the 4 jobs

Cross-References

Benefits all open LTS PRs — every feature branch triggers this workflow on push. Faster, more reliable CI unblocks:

PR	Feature
#777	Enterprise features — CLI providers, caching, LFS, hooks
#778	GCP Cloud Run + Azure ACI providers
#783	Provider load balancing
#784	Unit test coverage
#786	Secure git authentication
#787	Premade secret sources
#790	Test workflow engine
#791	Hot runner protocol
#798	Generic artifact system
#799	Incremental sync protocol
#804	Community plugin validation
#806	CI dispatch providers
#808	Build reliability features

Generated with Claude Code

Tracking:

Summary by CodeRabbit

Chores
- Restructured CI/CD pipeline to execute tests in parallel across multiple infrastructure providers with isolated environments
- Established standardized cleanup procedures and health verification across all test phases
- Enhanced macOS build robustness to prevent pipeline interruptions

…tion Rewrite the monolith orchestrator-integrity.yml (1110 lines, single job, 3+ hour sequential execution) into 4 parallel jobs that run on separate runners: - k8s-tests: k3d cluster + LocalStack, 5 tests - aws-provider-tests: LocalStack only, 10 tests - local-docker-tests: Docker + LocalStack for S3 tests, 9 tests - rclone-tests: rclone + LocalStack, 1 test Key improvements: - Wall-clock time drops from ~3h to ~1h (longest single job) - Disk exhaustion eliminated: each job gets its own fresh 14GB runner - Cleanup logic deduplicated via sourced shell functions instead of 15 copy-pasted 30-line blocks - K3d node image cleanup only runs in the k8s job (where it matters) - Light cleanup (cache + docker prune -f) between tests; heavy cleanup (prune -af --volumes) only at job boundaries - workflow_call interface unchanged; integrity-check.yml needs no changes Ref: #794 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-05T13:42:13Z

📝 Walkthrough

Walkthrough

Refactors CI into parallel provider-specific jobs, centralizes reusable cleanup scripts, standardizes LocalStack and k3d lifecycle management, expands test matrices (k8s, AWS/LocalStack, local-docker, rclone), and adds explicit per-stage initialization, health checks, and teardown steps for consistent isolation and resource reclamation. (50 words)

Changes

Cohort / File(s)	Summary
Orchestrator integrity workflow `.github/workflows/orchestrator-integrity.yml`	Splits a monolithic CI job into parallel jobs (`k8s-tests`, `aws-provider-tests`, `local-docker-tests`, `rclone-tests`); introduces reusable cleanup functions (`/tmp/cleanup-functions.sh`), composite cleanup routines, structured setup/teardown (LocalStack lifecycle, S3 provisioning, k3d cluster creation, connectivity checks), per-test cleanup, expanded test matrix and retry/error-handling logic.
macOS build job behavior `.github/workflows/build-tests-mac.yml`	Sets `continue-on-error: true` for the macOS build job (`buildForAllPlatformsMacOS`), allowing that job to fail without failing the entire workflow.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as CI Runner
  participant LocalStack as LocalStack
  participant K3d as k3d Cluster
  participant Tests as Test Suites
  participant Storage as S3 / rclone

  Runner->>LocalStack: start LocalStack container(s)
  activate LocalStack
  LocalStack-->>Runner: health OK
  Runner->>Storage: create S3 buckets / configure AWS CLI
  Runner->>K3d: create k3d cluster(s)
  activate K3d
  K3d-->>Runner: cluster ready
  Runner->>Tests: run provider-specific test groups (k8s, aws, local-docker, rclone)
  Tests-->>Storage: exercise S3 / rclone flows
  Tests-->>K3d: deploy/validate k8s resources
  Tests-->>Runner: report results
  Runner->>Tests: per-test cleanup
  Runner->>K3d: cleanup clusters, PVCs, Secrets
  Runner->>LocalStack: stop & remove containers, volumes
  deactivate K3d
  deactivate LocalStack

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through jobs both near and wide,
Buckets made and clusters tied,
Cleanup carrots, tidy trail,
Parallel hops that never fail,
CI carrots—freshly supplied! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main architectural change: splitting a monolithic workflow into parallel jobs to improve validation speed.
Description check	✅ Passed	The description comprehensively covers all template sections: detailed problem statement, solution architecture, before/after metrics, cleanup strategy, disk space analysis, testing plan, and cross-references. All required sections are well-populated.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ci/orchestrator-integrity-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

.github/workflows/orchestrator-integrity.yml (2)

6-10: ⚠️ Potential issue | 🟠 Major

Honor runGithubIntegrationTests input for the GitHub checks suite.

runGithubIntegrationTests is declared (Line 6-Line 10) but the GitHub checks test runs unconditionally (Line 1025+), which changes expected behavior and runtime when callers leave the default 'false'.

💡 Suggested guard

       - name: Run orchestrator-github-checks test (local-docker)
+        if: ${{ inputs.runGithubIntegrationTests == 'true' }}
         timeout-minutes: 30
         run: yarn run test "orchestrator-github-checks" --detectOpenHandles --forceExit --runInBand
@@
       - name: Cleanup after orchestrator-github-checks (local-docker)
-        if: always()
+        if: ${{ always() && inputs.runGithubIntegrationTests == 'true' }}
         run: |
           source /tmp/cleanup-functions.sh
           light_cleanup

Also applies to: 1025-1027, 1038-1040

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml around lines 6 - 10, The
workflow input runGithubIntegrationTests is declared but the GitHub checks
integration job/steps still run unconditionally; wrap the GitHub checks job or
the specific steps (references: the input name runGithubIntegrationTests and the
GitHub checks job/steps at the later block currently running unconditionally)
with a conditional such as if: ${{ inputs.runGithubIntegrationTests == 'true' }}
(or the equivalent expression for your workflow_call/workflow_dispatch context)
so the suite only runs when the input is explicitly set to 'true'; apply the
same guard to the other two occurrences you noted.

66-69: ⚠️ Potential issue | 🟠 Major

Replace curl | bash patterns with pinned versions and checksums.

Two instances directly execute remote installer scripts without pinning or integrity checks:

Line 68: k3d installer from main branch
Line 1182: rclone installer

These patterns create supply-chain risks and reduce auditability. Pin to tested versions, download separately, verify checksums, and execute locally:

Safer pattern (example)

-          curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+          K3D_REF="v5.8.3" # pin to a tested ref
+          curl -fsSL "https://raw.githubusercontent.com/k3d-io/k3d/${K3D_REF}/install.sh" -o /tmp/k3d-install.sh
+          bash /tmp/k3d-install.sh

-          curl https://rclone.org/install.sh | sudo bash
+          RCLONE_VERSION="v1.67.0" # pin to a tested release
+          curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/rclone-${RCLONE_VERSION}-linux-amd64.zip"
+          curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/SHA256SUMS"
+          grep "rclone-${RCLONE_VERSION}-linux-amd64.zip" SHA256SUMS | sha256sum -c -
+          unzip -q "rclone-${RCLONE_VERSION}-linux-amd64.zip" -d /tmp
+          sudo install "/tmp/rclone-${RCLONE_VERSION}-linux-amd64/rclone" /usr/local/bin/rclone

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml around lines 66 - 69, The
workflow currently pipes remote installers to the shell (the "Install k3d" step
running "curl ... | bash" and the rclone installer later); replace these with
pinned-release downloads and checksum verification: choose explicit k3d and
rclone versions, fetch the release artifact (e.g., wget/curl to a file), fetch
the corresponding published checksum or signature, verify the checksum/signature
before executing, and then run the local installer with sh; update the step
names ("Install k3d" and the rclone install step) to reflect the pinned-version
approach and fail the job if checksum verification fails so the pipeline no
longer runs unverified remote scripts.

🧹 Nitpick comments (1)

.github/workflows/orchestrator-integrity.yml (1)

140-140: Pin LocalStack image tag instead of latest.

Using localstack/localstack:latest makes CI non-deterministic and can introduce sudden breakage across all four jobs.

🧩 Suggested pinning approach

 env:
   AWS_STACK_NAME: game-ci-team-pipelines
+  LOCALSTACK_IMAGE: localstack/localstack:3.7.2
@@
-            localstack/localstack:latest || true
+            $LOCALSTACK_IMAGE || true

Apply the same replacement at each LocalStack docker run site.

Also applies to: 506-506, 834-834, 1139-1139

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml at line 140, Replace the
unpinned LocalStack image reference "localstack/localstack:latest" with a pinned
tag or workflow variable and update every docker run that uses it (the
occurrences matching the string "localstack/localstack:latest" in this
workflow). Add a single source of truth like an env var LOCALSTACK_VERSION
(e.g., set LOCALSTACK_VERSION: "0.14.0") at the top of the workflow and change
each usage to localstack/localstack:${{ env.LOCALSTACK_VERSION }} (or hardcode a
specific version string) so CI is deterministic; update all other matching
occurrences noted in the comment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 36-37: Update the header comment counts for the job groups to
match the workflow definition: change the "aws-provider-tests - Needs LocalStack
only (no k3d). 8 tests." comment to reflect 10 tests for aws-provider-tests and
change "local-docker-tests - Needs Docker only (some tests also need
LocalStack). 10 tests." to reflect 9 tests for local-docker-tests (or
alternatively adjust the actual job definitions aws-provider-tests and
local-docker-tests to match the comment); ensure the referenced job names
aws-provider-tests and local-docker-tests in the header are accurate and
consistent with the workflow job list.

---

Outside diff comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 6-10: The workflow input runGithubIntegrationTests is declared but
the GitHub checks integration job/steps still run unconditionally; wrap the
GitHub checks job or the specific steps (references: the input name
runGithubIntegrationTests and the GitHub checks job/steps at the later block
currently running unconditionally) with a conditional such as if: ${{
inputs.runGithubIntegrationTests == 'true' }} (or the equivalent expression for
your workflow_call/workflow_dispatch context) so the suite only runs when the
input is explicitly set to 'true'; apply the same guard to the other two
occurrences you noted.
- Around line 66-69: The workflow currently pipes remote installers to the shell
(the "Install k3d" step running "curl ... | bash" and the rclone installer
later); replace these with pinned-release downloads and checksum verification:
choose explicit k3d and rclone versions, fetch the release artifact (e.g.,
wget/curl to a file), fetch the corresponding published checksum or signature,
verify the checksum/signature before executing, and then run the local installer
with sh; update the step names ("Install k3d" and the rclone install step) to
reflect the pinned-version approach and fail the job if checksum verification
fails so the pipeline no longer runs unverified remote scripts.

---

Nitpick comments:
In @.github/workflows/orchestrator-integrity.yml:
- Line 140: Replace the unpinned LocalStack image reference
"localstack/localstack:latest" with a pinned tag or workflow variable and update
every docker run that uses it (the occurrences matching the string
"localstack/localstack:latest" in this workflow). Add a single source of truth
like an env var LOCALSTACK_VERSION (e.g., set LOCALSTACK_VERSION: "0.14.0") at
the top of the workflow and change each usage to localstack/localstack:${{
env.LOCALSTACK_VERSION }} (or hardcode a specific version string) so CI is
deterministic; update all other matching occurrences noted in the comment.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 564d09cb-7987-4823-8cba-548dd9bc7abf

📥 Commits

Reviewing files that changed from the base of the PR and between 9d47543 and 9789eb5.

⛔ Files ignored due to path filters (1)

dist/index.js.map is excluded by !**/dist/**, !**/*.map

📒 Files selected for processing (1)

.github/workflows/orchestrator-integrity.yml

coderabbitai · 2026-03-05T13:48:07Z

.github/workflows/orchestrator-integrity.yml

+#   aws-provider-tests - Needs LocalStack only (no k3d). 8 tests.
+#   local-docker-tests - Needs Docker only (some tests also need LocalStack). 10 tests.


⚠️ Potential issue | 🟡 Minor

Header test counts are out of sync with actual jobs.

Line 36-Line 37 says AWS has 8 tests and local-docker has 10, but this workflow defines AWS 10 and local-docker 9. Keeping these comments accurate will prevent maintenance confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/orchestrator-integrity.yml around lines 36 - 37, Update the header comment counts for the job groups to match the workflow definition: change the "aws-provider-tests - Needs LocalStack only (no k3d). 8 tests." comment to reflect 10 tests for aws-provider-tests and change "local-docker-tests - Needs Docker only (some tests also need LocalStack). 10 tests." to reflect 9 tests for local-docker-tests (or alternatively adjust the actual job definitions aws-provider-tests and local-docker-tests to match the comment); ensure the referenced job names aws-provider-tests and local-docker-tests in the header are accurate and consistent with the workflow job list.

github-actions · 2026-03-05T15:28:06Z

codecov · 2026-03-05T15:29:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 31.25%. Comparing base (9d47543) to head (d21188e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #809   +/-   ##
=======================================
  Coverage   31.25%   31.25%           
=======================================
  Files          84       84           
  Lines        4563     4563           
  Branches     1103     1103           
=======================================
  Hits         1426     1426           
  Misses       3137     3137

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The monolithic orchestrator-integrity workflow runs 25+ tests sequentially in a single job, consistently hitting the 60-minute timeout on PR runs. Split into 4 parallel jobs (k8s, aws-provider, local-docker, rclone) each on its own runner, cutting wall-clock time from 3+ hours to ~1 hour and eliminating disk space exhaustion from shared runner contention. Adopts the parallel architecture from PR #809. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-tests-mac.yml:
- Line 15: Remove the job-level continue-on-error and implement a step-scoped
retry sequence mirroring Ubuntu/Windows workflows: create three build steps with
ids build-1 (always runs, continue-on-error: true), build-2 (runs only if
steps.build-1.outcome == 'failure', continue-on-error: true), and build-3 (runs
only if steps.build-2.outcome == 'failure' and has no continue-on-error so
failures propagate); add sleep steps after build-1 and build-2 conditioned on
their failure outcomes with delays 60–120s and 240s respectively; ensure step
ids and conditional expressions (steps.build-1.outcome, steps.build-2.outcome)
are used exactly as referenced so the final build-3 signals a real failure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fcf43df8-2335-4a6b-9416-d0d989c7dc51

📥 Commits

Reviewing files that changed from the base of the PR and between 9789eb5 and d21188e.

📒 Files selected for processing (1)

.github/workflows/build-tests-mac.yml

coderabbitai · 2026-03-05T23:36:43Z

.github/workflows/build-tests-mac.yml

  buildForAllPlatformsMacOS:
    name: ${{ matrix.targetPlatform }} on ${{ matrix.unityVersion }}
    runs-on: macos-latest
+    continue-on-error: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the files exist and their structure echo "=== Checking .github/workflows/build-tests-mac.yml ===" head -20 .github/workflows/build-tests-mac.yml echo -e "\n=== Checking build-tests-ubuntu.yml lines 136-195 ===" sed -n '136,195p' .github/workflows/build-tests-ubuntu.yml echo -e "\n=== Checking build-tests-windows.yml lines 72-143 ===" sed -n '72,143p' .github/workflows/build-tests-windows.yml

Repository: game-ci/unity-builder

Length of output: 5909

🏁 Script executed:

# Get the full build-tests-mac.yml to see complete step definitions cat .github/workflows/build-tests-mac.yml

Repository: game-ci/unity-builder

Length of output: 3097

Move continue-on-error to step scope with an explicit retry mechanism.

The job-level continue-on-error: true suppresses all macOS matrix failures at the job level, making the workflow pass even when builds consistently fail. This also means macOS gets zero retry attempts, unlike Ubuntu and Windows workflows which implement a three-attempt retry with escalating delays (60s/120s → 240s) between each try. The correct pattern is step-level continue-on-error on the first two attempts only, with a final third attempt without continue-on-error to preserve the failure signal. This allows transient failures to self-heal while still catching real issues.

Mirror the retry structure from .github/workflows/build-tests-ubuntu.yml:136-195 and .github/workflows/build-tests-windows.yml:72-143:

Build attempt 1: id: build-1, continue-on-error: true, always runs

Sleep step: runs on steps.build-1.outcome == 'failure' (60–120 sec delay)

Build attempt 2: id: build-2, continue-on-error: true, conditional on build-1 failure

Sleep step: runs on steps.build-2.outcome == 'failure' (240 sec delay)

Build attempt 3: id: build-3, no continue-on-error, conditional on build-2 failure (final attempt, lets failure propagate)

Remove the job-level continue-on-error: true and implement the step-based retry pattern instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/build-tests-mac.yml at line 15, Remove the job-level continue-on-error and implement a step-scoped retry sequence mirroring Ubuntu/Windows workflows: create three build steps with ids build-1 (always runs, continue-on-error: true), build-2 (runs only if steps.build-1.outcome == 'failure', continue-on-error: true), and build-3 (runs only if steps.build-2.outcome == 'failure' and has no continue-on-error so failures propagate); add sleep steps after build-1 and build-2 conditioned on their failure outcomes with delays 60–120s and 240s respectively; ensure step ids and conditional expressions (steps.build-1.outcome, steps.build-2.outcome) are used exactly as referenced so the final build-3 signals a real failure.

frostebite mentioned this pull request Mar 5, 2026

tracking: Orchestrator LTS (v2.0.0) — Long-Term Support orchestrator release #794

Open

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

frostebite added ci CI/CD pipeline and workflow improvements orchestrator Orchestrator module enhancement New feature or request LTS 2.0 Orchestrator LTS v2.0 milestone labels Mar 5, 2026

frostebite mentioned this pull request Mar 5, 2026

Introducing Orchestator Long Term Support (LTS) Version #810

Open

frostebite mentioned this pull request Mar 5, 2026

LTS 2.0.0 -- Combined Release #815

Draft

15 tasks

ci: set macOS builds to continue-on-error

d21188e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

frostebite mentioned this pull request Mar 5, 2026

LTS Infrastructure -- Combined Integration #816

Draft

9 tasks

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: split orchestrator integrity into parallel jobs for faster validation#809

ci: split orchestrator integrity into parallel jobs for faster validation#809
frostebite wants to merge 2 commits intomainfrom
ci/orchestrator-integrity-speedup

frostebite commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

codecov bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# aws-provider-tests - Needs LocalStack only (no k3d). 8 tests.
		# local-docker-tests - Needs Docker only (some tests also need LocalStack). 10 tests.

Uh oh!

Conversation

frostebite commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Job Architecture

Workflow Topology

Before / After

Cleanup Function Pattern

Disk Space Analysis

Why k3d eats disk

Why LocalStack eats disk

The fix

What Did NOT Change

Testing

Cross-References

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

codecov bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

frostebite commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

codecov bot commented Mar 5, 2026 •

edited

Loading