Skip to content

ci: split orchestrator integrity into parallel jobs for faster validation#809

Draft
frostebite wants to merge 2 commits intomainfrom
ci/orchestrator-integrity-speedup
Draft

ci: split orchestrator integrity into parallel jobs for faster validation#809
frostebite wants to merge 2 commits intomainfrom
ci/orchestrator-integrity-speedup

Conversation

@frostebite
Copy link
Member

@frostebite frostebite commented Mar 5, 2026

Summary

Rewrites the monolithic orchestrator-integrity.yml workflow into 4 parallel jobs, each running on its own GitHub Actions runner with an isolated 14GB disk. Wall-clock time drops from 3+ hours to ~1 hour, and the chronic disk space exhaustion that caused flaky failures is eliminated.

This is infrastructure work that benefits the entire LTS release — every open PR triggers this workflow, so making it fast and reliable unblocks all feature development.

Problem

The existing workflow runs 24 integration tests sequentially in a single job. This creates two compounding problems:

Time: With setup, infrastructure provisioning, test execution, and cleanup for each test, the job takes 3+ hours. Every push to any open PR waits for this full cycle, creating a bottleneck across all LTS development.

Disk space exhaustion: The single runner's 14GB disk must simultaneously support:

Consumer Disk Impact
k3d (Kubernetes simulation) Pulls ~3.9GB of Unity Docker images into containerd's image store. Node images accumulate across test runs.
LocalStack (AWS simulation) Accumulates CloudFormation stacks, S3 objects, ECS task definitions, and internal state. Each test creates resources that are not fully reclaimed.
Docker build cache Unity base images, test project layers, and intermediate build artifacts.
Test artifacts Build outputs, log files, and cached assets from each of the 24 tests.

When k3d and LocalStack compete for the same disk, the runner hits capacity mid-run — causing non-deterministic failures that are difficult to diagnose and impossible to fix without architectural changes.

Solution

Split the monolith into 4 parallel jobs based on infrastructure requirements. Each job runs on its own fresh runner, so disk-hungry consumers (k3d, LocalStack) never compete for the same space.

Job Architecture

Job Infrastructure Tests Est. Time
k8s-tests k3d cluster + LocalStack 5 — image, kubernetes, s3-steps, e2e-caching, e2e-retaining ~60 min
aws-provider-tests LocalStack only 10 — image, environment, s3-steps, hooks, e2e-caching, e2e-retaining, caching, locking-core, locking-get-locked, e2e-locking ~60 min
local-docker-tests Docker + LocalStack (S3 tests) 9 — image, hooks, local-persistence, locking-core, locking-get-locked, caching, github-checks, s3-steps, e2e-caching ~45 min
rclone-tests rclone + LocalStack 1 — rclone-steps ~10 min

Workflow Topology

integrity-check.yml (entry point)
  └── orchestrator-integrity.yml (workflow_call)
        ├── k8s-tests          ─┐
        ├── aws-provider-tests  ├── parallel (4 runners)
        ├── local-docker-tests  │
        └── rclone-tests       ─┘

Within each job, tests still run sequentially — they share Docker state within a provider strategy. The parallelism is between provider strategies, not between individual tests.

Before / After

Metric Before After
Jobs 1 monolith 4 parallel
Wall-clock time ~3 hours ~1 hour
Disk per job 14GB shared across 24 tests 14GB dedicated per job
k3d + LocalStack Competing on same disk Isolated to separate runners
Cleanup blocks 15 copy-pasted 30-line blocks Reusable shell functions
Cleanup strategy Heavy prune after every test Light prune between tests, heavy at job boundaries
File 1109 lines 1232 lines (+123, but cleaner structure)

Cleanup Function Pattern

The previous workflow had 15 near-identical 30-line cleanup blocks scattered throughout. These are now replaced with reusable shell functions sourced from a temporary script:

# Each job creates /tmp/cleanup-functions.sh at startup, then:
source /tmp/cleanup-functions.sh

light_cleanup       # Remove cache dirs + docker system prune -f
full_k8s_cleanup    # K8s resources + k3d node images + light cleanup

Cleanup frequency is also reduced: light cleanup (docker system prune -f) runs between tests within a job, while heavy cleanup (prune -af --volumes) only runs at job start and end. This avoids unnecessarily re-pulling base images mid-job.

Disk Space Analysis

Why k3d eats disk

k3d creates a Kubernetes cluster using Docker containers. Each k3d node runs containerd internally, and when the orchestrator tests pull Unity Docker images into the cluster, those images are stored in containerd's content store inside the k3d node container — not in Docker's image cache. This means:

  • Each Unity image pull consumes ~3.9GB inside the k3d node
  • docker system prune does not reclaim this space (it is inside the container)
  • Only deleting k3d node images (k3d image rm) or destroying the cluster frees the space
  • Previously, k3d cleanup ran in every test — even AWS and Docker tests where no k3d cluster existed

Why LocalStack eats disk

LocalStack simulates AWS services locally. Each test creates CloudFormation stacks, S3 buckets with objects, ECS task definitions, and internal state. Even with cleanup between tests:

  • CloudFormation stack state accumulates in LocalStack's internal storage
  • S3 objects written during tests are not always fully garbage-collected
  • ECS task definitions and container images create persistent layer data
  • LocalStack's own logs and temporary files grow over the run

The fix

By giving each job its own runner, k3d (in k8s-tests) and LocalStack (in aws-provider-tests and others) each get a full 14GB disk instead of splitting one. The k3d node image cleanup now only runs in the k8s-tests job where it actually matters.

What Did NOT Change

  • workflow_call interface — Inputs, outputs, and permissions are identical. integrity-check.yml requires zero modifications.
  • Test coverage — All 24 tests are preserved across the 4 jobs. No tests were added, removed, or modified.
  • Environment variables — Same env vars, same secrets, same configuration.
  • Secrets usage — No new secrets required. Same UNITY_LICENSE, AWS_*, and GH_TOKEN usage.
  • Test execution logic — The actual test commands (npm run cli -- ...) are unchanged.

Testing

This workflow tests itself — pushing to the PR branch triggers integrity-check.yml, which calls the modified orchestrator-integrity.yml. Verification:

  • integrity-check.yml calls orchestrator-integrity.yml without modifications
  • All 4 jobs appear in the GitHub Actions UI and run in parallel
  • Each job provisions only the infrastructure it needs (no k3d in aws-provider-tests, etc.)
  • Cleanup functions source correctly in each job
  • Disk usage in k8s-tests stays within 14GB (previously the bottleneck)
  • Total wall-clock time is under 90 minutes
  • All 24 tests pass across the 4 jobs

Cross-References

Benefits all open LTS PRs — every feature branch triggers this workflow on push. Faster, more reliable CI unblocks:

PR Feature
#777 Enterprise features — CLI providers, caching, LFS, hooks
#778 GCP Cloud Run + Azure ACI providers
#783 Provider load balancing
#784 Unit test coverage
#786 Secure git authentication
#787 Premade secret sources
#790 Test workflow engine
#791 Hot runner protocol
#798 Generic artifact system
#799 Incremental sync protocol
#804 Community plugin validation
#806 CI dispatch providers
#808 Build reliability features

Generated with Claude Code


Tracking:

Summary by CodeRabbit

  • Chores
    • Restructured CI/CD pipeline to execute tests in parallel across multiple infrastructure providers with isolated environments
    • Established standardized cleanup procedures and health verification across all test phases
    • Enhanced macOS build robustness to prevent pipeline interruptions

…tion

Rewrite the monolith orchestrator-integrity.yml (1110 lines, single job,
3+ hour sequential execution) into 4 parallel jobs that run on separate
runners:

- k8s-tests: k3d cluster + LocalStack, 5 tests
- aws-provider-tests: LocalStack only, 10 tests
- local-docker-tests: Docker + LocalStack for S3 tests, 9 tests
- rclone-tests: rclone + LocalStack, 1 test

Key improvements:
- Wall-clock time drops from ~3h to ~1h (longest single job)
- Disk exhaustion eliminated: each job gets its own fresh 14GB runner
- Cleanup logic deduplicated via sourced shell functions instead of
  15 copy-pasted 30-line blocks
- K3d node image cleanup only runs in the k8s job (where it matters)
- Light cleanup (cache + docker prune -f) between tests; heavy cleanup
  (prune -af --volumes) only at job boundaries
- workflow_call interface unchanged; integrity-check.yml needs no changes

Ref: #794

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Refactors CI into parallel provider-specific jobs, centralizes reusable cleanup scripts, standardizes LocalStack and k3d lifecycle management, expands test matrices (k8s, AWS/LocalStack, local-docker, rclone), and adds explicit per-stage initialization, health checks, and teardown steps for consistent isolation and resource reclamation. (50 words)

Changes

Cohort / File(s) Summary
Orchestrator integrity workflow
​.github/workflows/orchestrator-integrity.yml
Splits a monolithic CI job into parallel jobs (k8s-tests, aws-provider-tests, local-docker-tests, rclone-tests); introduces reusable cleanup functions (/tmp/cleanup-functions.sh), composite cleanup routines, structured setup/teardown (LocalStack lifecycle, S3 provisioning, k3d cluster creation, connectivity checks), per-test cleanup, expanded test matrix and retry/error-handling logic.
macOS build job behavior
​.github/workflows/build-tests-mac.yml
Sets continue-on-error: true for the macOS build job (buildForAllPlatformsMacOS), allowing that job to fail without failing the entire workflow.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as CI Runner
  participant LocalStack as LocalStack
  participant K3d as k3d Cluster
  participant Tests as Test Suites
  participant Storage as S3 / rclone

  Runner->>LocalStack: start LocalStack container(s)
  activate LocalStack
  LocalStack-->>Runner: health OK
  Runner->>Storage: create S3 buckets / configure AWS CLI
  Runner->>K3d: create k3d cluster(s)
  activate K3d
  K3d-->>Runner: cluster ready
  Runner->>Tests: run provider-specific test groups (k8s, aws, local-docker, rclone)
  Tests-->>Storage: exercise S3 / rclone flows
  Tests-->>K3d: deploy/validate k8s resources
  Tests-->>Runner: report results
  Runner->>Tests: per-test cleanup
  Runner->>K3d: cleanup clusters, PVCs, Secrets
  Runner->>LocalStack: stop & remove containers, volumes
  deactivate K3d
  deactivate LocalStack
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through jobs both near and wide,
Buckets made and clusters tied,
Cleanup carrots, tidy trail,
Parallel hops that never fail,
CI carrots—freshly supplied! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main architectural change: splitting a monolithic workflow into parallel jobs to improve validation speed.
Description check ✅ Passed The description comprehensively covers all template sections: detailed problem statement, solution architecture, before/after metrics, cleanup strategy, disk space analysis, testing plan, and cross-references. All required sections are well-populated.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ci/orchestrator-integrity-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/orchestrator-integrity.yml (2)

6-10: ⚠️ Potential issue | 🟠 Major

Honor runGithubIntegrationTests input for the GitHub checks suite.

runGithubIntegrationTests is declared (Line 6-Line 10) but the GitHub checks test runs unconditionally (Line 1025+), which changes expected behavior and runtime when callers leave the default 'false'.

💡 Suggested guard
       - name: Run orchestrator-github-checks test (local-docker)
+        if: ${{ inputs.runGithubIntegrationTests == 'true' }}
         timeout-minutes: 30
         run: yarn run test "orchestrator-github-checks" --detectOpenHandles --forceExit --runInBand
@@
       - name: Cleanup after orchestrator-github-checks (local-docker)
-        if: always()
+        if: ${{ always() && inputs.runGithubIntegrationTests == 'true' }}
         run: |
           source /tmp/cleanup-functions.sh
           light_cleanup

Also applies to: 1025-1027, 1038-1040

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml around lines 6 - 10, The
workflow input runGithubIntegrationTests is declared but the GitHub checks
integration job/steps still run unconditionally; wrap the GitHub checks job or
the specific steps (references: the input name runGithubIntegrationTests and the
GitHub checks job/steps at the later block currently running unconditionally)
with a conditional such as if: ${{ inputs.runGithubIntegrationTests == 'true' }}
(or the equivalent expression for your workflow_call/workflow_dispatch context)
so the suite only runs when the input is explicitly set to 'true'; apply the
same guard to the other two occurrences you noted.

66-69: ⚠️ Potential issue | 🟠 Major

Replace curl | bash patterns with pinned versions and checksums.

Two instances directly execute remote installer scripts without pinning or integrity checks:

  • Line 68: k3d installer from main branch
  • Line 1182: rclone installer

These patterns create supply-chain risks and reduce auditability. Pin to tested versions, download separately, verify checksums, and execute locally:

Safer pattern (example)
-          curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
+          K3D_REF="v5.8.3" # pin to a tested ref
+          curl -fsSL "https://raw.githubusercontent.com/k3d-io/k3d/${K3D_REF}/install.sh" -o /tmp/k3d-install.sh
+          bash /tmp/k3d-install.sh

-          curl https://rclone.org/install.sh | sudo bash
+          RCLONE_VERSION="v1.67.0" # pin to a tested release
+          curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/rclone-${RCLONE_VERSION}-linux-amd64.zip"
+          curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/SHA256SUMS"
+          grep "rclone-${RCLONE_VERSION}-linux-amd64.zip" SHA256SUMS | sha256sum -c -
+          unzip -q "rclone-${RCLONE_VERSION}-linux-amd64.zip" -d /tmp
+          sudo install "/tmp/rclone-${RCLONE_VERSION}-linux-amd64/rclone" /usr/local/bin/rclone
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml around lines 66 - 69, The
workflow currently pipes remote installers to the shell (the "Install k3d" step
running "curl ... | bash" and the rclone installer later); replace these with
pinned-release downloads and checksum verification: choose explicit k3d and
rclone versions, fetch the release artifact (e.g., wget/curl to a file), fetch
the corresponding published checksum or signature, verify the checksum/signature
before executing, and then run the local installer with sh; update the step
names ("Install k3d" and the rclone install step) to reflect the pinned-version
approach and fail the job if checksum verification fails so the pipeline no
longer runs unverified remote scripts.
🧹 Nitpick comments (1)
.github/workflows/orchestrator-integrity.yml (1)

140-140: Pin LocalStack image tag instead of latest.

Using localstack/localstack:latest makes CI non-deterministic and can introduce sudden breakage across all four jobs.

🧩 Suggested pinning approach
 env:
   AWS_STACK_NAME: game-ci-team-pipelines
+  LOCALSTACK_IMAGE: localstack/localstack:3.7.2
@@
-            localstack/localstack:latest || true
+            $LOCALSTACK_IMAGE || true

Apply the same replacement at each LocalStack docker run site.

Also applies to: 506-506, 834-834, 1139-1139

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml at line 140, Replace the
unpinned LocalStack image reference "localstack/localstack:latest" with a pinned
tag or workflow variable and update every docker run that uses it (the
occurrences matching the string "localstack/localstack:latest" in this
workflow). Add a single source of truth like an env var LOCALSTACK_VERSION
(e.g., set LOCALSTACK_VERSION: "0.14.0") at the top of the workflow and change
each usage to localstack/localstack:${{ env.LOCALSTACK_VERSION }} (or hardcode a
specific version string) so CI is deterministic; update all other matching
occurrences noted in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 36-37: Update the header comment counts for the job groups to
match the workflow definition: change the "aws-provider-tests - Needs LocalStack
only (no k3d). 8 tests." comment to reflect 10 tests for aws-provider-tests and
change "local-docker-tests - Needs Docker only (some tests also need
LocalStack). 10 tests." to reflect 9 tests for local-docker-tests (or
alternatively adjust the actual job definitions aws-provider-tests and
local-docker-tests to match the comment); ensure the referenced job names
aws-provider-tests and local-docker-tests in the header are accurate and
consistent with the workflow job list.

---

Outside diff comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 6-10: The workflow input runGithubIntegrationTests is declared but
the GitHub checks integration job/steps still run unconditionally; wrap the
GitHub checks job or the specific steps (references: the input name
runGithubIntegrationTests and the GitHub checks job/steps at the later block
currently running unconditionally) with a conditional such as if: ${{
inputs.runGithubIntegrationTests == 'true' }} (or the equivalent expression for
your workflow_call/workflow_dispatch context) so the suite only runs when the
input is explicitly set to 'true'; apply the same guard to the other two
occurrences you noted.
- Around line 66-69: The workflow currently pipes remote installers to the shell
(the "Install k3d" step running "curl ... | bash" and the rclone installer
later); replace these with pinned-release downloads and checksum verification:
choose explicit k3d and rclone versions, fetch the release artifact (e.g.,
wget/curl to a file), fetch the corresponding published checksum or signature,
verify the checksum/signature before executing, and then run the local installer
with sh; update the step names ("Install k3d" and the rclone install step) to
reflect the pinned-version approach and fail the job if checksum verification
fails so the pipeline no longer runs unverified remote scripts.

---

Nitpick comments:
In @.github/workflows/orchestrator-integrity.yml:
- Line 140: Replace the unpinned LocalStack image reference
"localstack/localstack:latest" with a pinned tag or workflow variable and update
every docker run that uses it (the occurrences matching the string
"localstack/localstack:latest" in this workflow). Add a single source of truth
like an env var LOCALSTACK_VERSION (e.g., set LOCALSTACK_VERSION: "0.14.0") at
the top of the workflow and change each usage to localstack/localstack:${{
env.LOCALSTACK_VERSION }} (or hardcode a specific version string) so CI is
deterministic; update all other matching occurrences noted in the comment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 564d09cb-7987-4823-8cba-548dd9bc7abf

📥 Commits

Reviewing files that changed from the base of the PR and between 9d47543 and 9789eb5.

⛔ Files ignored due to path filters (1)
  • dist/index.js.map is excluded by !**/dist/**, !**/*.map
📒 Files selected for processing (1)
  • .github/workflows/orchestrator-integrity.yml

Comment on lines +36 to +37
# aws-provider-tests - Needs LocalStack only (no k3d). 8 tests.
# local-docker-tests - Needs Docker only (some tests also need LocalStack). 10 tests.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Header test counts are out of sync with actual jobs.

Line 36-Line 37 says AWS has 8 tests and local-docker has 10, but this workflow defines AWS 10 and local-docker 9. Keeping these comments accurate will prevent maintenance confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/orchestrator-integrity.yml around lines 36 - 37, Update
the header comment counts for the job groups to match the workflow definition:
change the "aws-provider-tests - Needs LocalStack only (no k3d). 8 tests."
comment to reflect 10 tests for aws-provider-tests and change
"local-docker-tests - Needs Docker only (some tests also need LocalStack). 10
tests." to reflect 9 tests for local-docker-tests (or alternatively adjust the
actual job definitions aws-provider-tests and local-docker-tests to match the
comment); ensure the referenced job names aws-provider-tests and
local-docker-tests in the header are accurate and consistent with the workflow
job list.

@frostebite frostebite added ci CI/CD pipeline and workflow improvements orchestrator Orchestrator module enhancement New feature or request LTS 2.0 Orchestrator LTS v2.0 milestone labels Mar 5, 2026
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Cat Gif

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 31.25%. Comparing base (9d47543) to head (d21188e).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #809   +/-   ##
=======================================
  Coverage   31.25%   31.25%           
=======================================
  Files          84       84           
  Lines        4563     4563           
  Branches     1103     1103           
=======================================
  Hits         1426     1426           
  Misses       3137     3137           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

frostebite added a commit that referenced this pull request Mar 5, 2026
The monolithic orchestrator-integrity workflow runs 25+ tests sequentially
in a single job, consistently hitting the 60-minute timeout on PR runs.
Split into 4 parallel jobs (k8s, aws-provider, local-docker, rclone) each
on its own runner, cutting wall-clock time from 3+ hours to ~1 hour and
eliminating disk space exhaustion from shared runner contention.

Adopts the parallel architecture from PR #809.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frostebite frostebite mentioned this pull request Mar 5, 2026
15 tasks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-tests-mac.yml:
- Line 15: Remove the job-level continue-on-error and implement a step-scoped
retry sequence mirroring Ubuntu/Windows workflows: create three build steps with
ids build-1 (always runs, continue-on-error: true), build-2 (runs only if
steps.build-1.outcome == 'failure', continue-on-error: true), and build-3 (runs
only if steps.build-2.outcome == 'failure' and has no continue-on-error so
failures propagate); add sleep steps after build-1 and build-2 conditioned on
their failure outcomes with delays 60–120s and 240s respectively; ensure step
ids and conditional expressions (steps.build-1.outcome, steps.build-2.outcome)
are used exactly as referenced so the final build-3 signals a real failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fcf43df8-2335-4a6b-9416-d0d989c7dc51

📥 Commits

Reviewing files that changed from the base of the PR and between 9789eb5 and d21188e.

📒 Files selected for processing (1)
  • .github/workflows/build-tests-mac.yml

buildForAllPlatformsMacOS:
name: ${{ matrix.targetPlatform }} on ${{ matrix.unityVersion }}
runs-on: macos-latest
continue-on-error: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the files exist and their structure
echo "=== Checking .github/workflows/build-tests-mac.yml ===" 
head -20 .github/workflows/build-tests-mac.yml

echo -e "\n=== Checking build-tests-ubuntu.yml lines 136-195 ===" 
sed -n '136,195p' .github/workflows/build-tests-ubuntu.yml

echo -e "\n=== Checking build-tests-windows.yml lines 72-143 ===" 
sed -n '72,143p' .github/workflows/build-tests-windows.yml

Repository: game-ci/unity-builder

Length of output: 5909


🏁 Script executed:

# Get the full build-tests-mac.yml to see complete step definitions
cat .github/workflows/build-tests-mac.yml

Repository: game-ci/unity-builder

Length of output: 3097


Move continue-on-error to step scope with an explicit retry mechanism.

The job-level continue-on-error: true suppresses all macOS matrix failures at the job level, making the workflow pass even when builds consistently fail. This also means macOS gets zero retry attempts, unlike Ubuntu and Windows workflows which implement a three-attempt retry with escalating delays (60s/120s → 240s) between each try. The correct pattern is step-level continue-on-error on the first two attempts only, with a final third attempt without continue-on-error to preserve the failure signal. This allows transient failures to self-heal while still catching real issues.

Mirror the retry structure from .github/workflows/build-tests-ubuntu.yml:136-195 and .github/workflows/build-tests-windows.yml:72-143:

  • Build attempt 1: id: build-1, continue-on-error: true, always runs
  • Sleep step: runs on steps.build-1.outcome == 'failure' (60–120 sec delay)
  • Build attempt 2: id: build-2, continue-on-error: true, conditional on build-1 failure
  • Sleep step: runs on steps.build-2.outcome == 'failure' (240 sec delay)
  • Build attempt 3: id: build-3, no continue-on-error, conditional on build-2 failure (final attempt, lets failure propagate)

Remove the job-level continue-on-error: true and implement the step-based retry pattern instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-tests-mac.yml at line 15, Remove the job-level
continue-on-error and implement a step-scoped retry sequence mirroring
Ubuntu/Windows workflows: create three build steps with ids build-1 (always
runs, continue-on-error: true), build-2 (runs only if steps.build-1.outcome ==
'failure', continue-on-error: true), and build-3 (runs only if
steps.build-2.outcome == 'failure' and has no continue-on-error so failures
propagate); add sleep steps after build-1 and build-2 conditioned on their
failure outcomes with delays 60–120s and 240s respectively; ensure step ids and
conditional expressions (steps.build-1.outcome, steps.build-2.outcome) are used
exactly as referenced so the final build-3 signals a real failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD pipeline and workflow improvements enhancement New feature or request LTS 2.0 Orchestrator LTS v2.0 milestone orchestrator Orchestrator module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant