ci: split orchestrator integrity into parallel jobs for faster validation#809
ci: split orchestrator integrity into parallel jobs for faster validation#809frostebite wants to merge 2 commits intomainfrom
Conversation
…tion Rewrite the monolith orchestrator-integrity.yml (1110 lines, single job, 3+ hour sequential execution) into 4 parallel jobs that run on separate runners: - k8s-tests: k3d cluster + LocalStack, 5 tests - aws-provider-tests: LocalStack only, 10 tests - local-docker-tests: Docker + LocalStack for S3 tests, 9 tests - rclone-tests: rclone + LocalStack, 1 test Key improvements: - Wall-clock time drops from ~3h to ~1h (longest single job) - Disk exhaustion eliminated: each job gets its own fresh 14GB runner - Cleanup logic deduplicated via sourced shell functions instead of 15 copy-pasted 30-line blocks - K3d node image cleanup only runs in the k8s job (where it matters) - Light cleanup (cache + docker prune -f) between tests; heavy cleanup (prune -af --volumes) only at job boundaries - workflow_call interface unchanged; integrity-check.yml needs no changes Ref: #794 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughRefactors CI into parallel provider-specific jobs, centralizes reusable cleanup scripts, standardizes LocalStack and k3d lifecycle management, expands test matrices (k8s, AWS/LocalStack, local-docker, rclone), and adds explicit per-stage initialization, health checks, and teardown steps for consistent isolation and resource reclamation. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as CI Runner
participant LocalStack as LocalStack
participant K3d as k3d Cluster
participant Tests as Test Suites
participant Storage as S3 / rclone
Runner->>LocalStack: start LocalStack container(s)
activate LocalStack
LocalStack-->>Runner: health OK
Runner->>Storage: create S3 buckets / configure AWS CLI
Runner->>K3d: create k3d cluster(s)
activate K3d
K3d-->>Runner: cluster ready
Runner->>Tests: run provider-specific test groups (k8s, aws, local-docker, rclone)
Tests-->>Storage: exercise S3 / rclone flows
Tests-->>K3d: deploy/validate k8s resources
Tests-->>Runner: report results
Runner->>Tests: per-test cleanup
Runner->>K3d: cleanup clusters, PVCs, Secrets
Runner->>LocalStack: stop & remove containers, volumes
deactivate K3d
deactivate LocalStack
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/orchestrator-integrity.yml (2)
6-10:⚠️ Potential issue | 🟠 MajorHonor
runGithubIntegrationTestsinput for the GitHub checks suite.
runGithubIntegrationTestsis declared (Line 6-Line 10) but the GitHub checks test runs unconditionally (Line 1025+), which changes expected behavior and runtime when callers leave the default'false'.💡 Suggested guard
- name: Run orchestrator-github-checks test (local-docker) + if: ${{ inputs.runGithubIntegrationTests == 'true' }} timeout-minutes: 30 run: yarn run test "orchestrator-github-checks" --detectOpenHandles --forceExit --runInBand @@ - name: Cleanup after orchestrator-github-checks (local-docker) - if: always() + if: ${{ always() && inputs.runGithubIntegrationTests == 'true' }} run: | source /tmp/cleanup-functions.sh light_cleanupAlso applies to: 1025-1027, 1038-1040
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/orchestrator-integrity.yml around lines 6 - 10, The workflow input runGithubIntegrationTests is declared but the GitHub checks integration job/steps still run unconditionally; wrap the GitHub checks job or the specific steps (references: the input name runGithubIntegrationTests and the GitHub checks job/steps at the later block currently running unconditionally) with a conditional such as if: ${{ inputs.runGithubIntegrationTests == 'true' }} (or the equivalent expression for your workflow_call/workflow_dispatch context) so the suite only runs when the input is explicitly set to 'true'; apply the same guard to the other two occurrences you noted.
66-69:⚠️ Potential issue | 🟠 MajorReplace
curl | bashpatterns with pinned versions and checksums.Two instances directly execute remote installer scripts without pinning or integrity checks:
- Line 68: k3d installer from main branch
- Line 1182: rclone installer
These patterns create supply-chain risks and reduce auditability. Pin to tested versions, download separately, verify checksums, and execute locally:
Safer pattern (example)
- curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash + K3D_REF="v5.8.3" # pin to a tested ref + curl -fsSL "https://raw.githubusercontent.com/k3d-io/k3d/${K3D_REF}/install.sh" -o /tmp/k3d-install.sh + bash /tmp/k3d-install.sh - curl https://rclone.org/install.sh | sudo bash + RCLONE_VERSION="v1.67.0" # pin to a tested release + curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/rclone-${RCLONE_VERSION}-linux-amd64.zip" + curl -fsSLO "https://downloads.rclone.org/${RCLONE_VERSION}/SHA256SUMS" + grep "rclone-${RCLONE_VERSION}-linux-amd64.zip" SHA256SUMS | sha256sum -c - + unzip -q "rclone-${RCLONE_VERSION}-linux-amd64.zip" -d /tmp + sudo install "/tmp/rclone-${RCLONE_VERSION}-linux-amd64/rclone" /usr/local/bin/rclone🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/orchestrator-integrity.yml around lines 66 - 69, The workflow currently pipes remote installers to the shell (the "Install k3d" step running "curl ... | bash" and the rclone installer later); replace these with pinned-release downloads and checksum verification: choose explicit k3d and rclone versions, fetch the release artifact (e.g., wget/curl to a file), fetch the corresponding published checksum or signature, verify the checksum/signature before executing, and then run the local installer with sh; update the step names ("Install k3d" and the rclone install step) to reflect the pinned-version approach and fail the job if checksum verification fails so the pipeline no longer runs unverified remote scripts.
🧹 Nitpick comments (1)
.github/workflows/orchestrator-integrity.yml (1)
140-140: Pin LocalStack image tag instead oflatest.Using
localstack/localstack:latestmakes CI non-deterministic and can introduce sudden breakage across all four jobs.🧩 Suggested pinning approach
env: AWS_STACK_NAME: game-ci-team-pipelines + LOCALSTACK_IMAGE: localstack/localstack:3.7.2 @@ - localstack/localstack:latest || true + $LOCALSTACK_IMAGE || trueApply the same replacement at each LocalStack
docker runsite.Also applies to: 506-506, 834-834, 1139-1139
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/orchestrator-integrity.yml at line 140, Replace the unpinned LocalStack image reference "localstack/localstack:latest" with a pinned tag or workflow variable and update every docker run that uses it (the occurrences matching the string "localstack/localstack:latest" in this workflow). Add a single source of truth like an env var LOCALSTACK_VERSION (e.g., set LOCALSTACK_VERSION: "0.14.0") at the top of the workflow and change each usage to localstack/localstack:${{ env.LOCALSTACK_VERSION }} (or hardcode a specific version string) so CI is deterministic; update all other matching occurrences noted in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 36-37: Update the header comment counts for the job groups to
match the workflow definition: change the "aws-provider-tests - Needs LocalStack
only (no k3d). 8 tests." comment to reflect 10 tests for aws-provider-tests and
change "local-docker-tests - Needs Docker only (some tests also need
LocalStack). 10 tests." to reflect 9 tests for local-docker-tests (or
alternatively adjust the actual job definitions aws-provider-tests and
local-docker-tests to match the comment); ensure the referenced job names
aws-provider-tests and local-docker-tests in the header are accurate and
consistent with the workflow job list.
---
Outside diff comments:
In @.github/workflows/orchestrator-integrity.yml:
- Around line 6-10: The workflow input runGithubIntegrationTests is declared but
the GitHub checks integration job/steps still run unconditionally; wrap the
GitHub checks job or the specific steps (references: the input name
runGithubIntegrationTests and the GitHub checks job/steps at the later block
currently running unconditionally) with a conditional such as if: ${{
inputs.runGithubIntegrationTests == 'true' }} (or the equivalent expression for
your workflow_call/workflow_dispatch context) so the suite only runs when the
input is explicitly set to 'true'; apply the same guard to the other two
occurrences you noted.
- Around line 66-69: The workflow currently pipes remote installers to the shell
(the "Install k3d" step running "curl ... | bash" and the rclone installer
later); replace these with pinned-release downloads and checksum verification:
choose explicit k3d and rclone versions, fetch the release artifact (e.g.,
wget/curl to a file), fetch the corresponding published checksum or signature,
verify the checksum/signature before executing, and then run the local installer
with sh; update the step names ("Install k3d" and the rclone install step) to
reflect the pinned-version approach and fail the job if checksum verification
fails so the pipeline no longer runs unverified remote scripts.
---
Nitpick comments:
In @.github/workflows/orchestrator-integrity.yml:
- Line 140: Replace the unpinned LocalStack image reference
"localstack/localstack:latest" with a pinned tag or workflow variable and update
every docker run that uses it (the occurrences matching the string
"localstack/localstack:latest" in this workflow). Add a single source of truth
like an env var LOCALSTACK_VERSION (e.g., set LOCALSTACK_VERSION: "0.14.0") at
the top of the workflow and change each usage to localstack/localstack:${{
env.LOCALSTACK_VERSION }} (or hardcode a specific version string) so CI is
deterministic; update all other matching occurrences noted in the comment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 564d09cb-7987-4823-8cba-548dd9bc7abf
⛔ Files ignored due to path filters (1)
dist/index.js.mapis excluded by!**/dist/**,!**/*.map
📒 Files selected for processing (1)
.github/workflows/orchestrator-integrity.yml
| # aws-provider-tests - Needs LocalStack only (no k3d). 8 tests. | ||
| # local-docker-tests - Needs Docker only (some tests also need LocalStack). 10 tests. |
There was a problem hiding this comment.
Header test counts are out of sync with actual jobs.
Line 36-Line 37 says AWS has 8 tests and local-docker has 10, but this workflow defines AWS 10 and local-docker 9. Keeping these comments accurate will prevent maintenance confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/orchestrator-integrity.yml around lines 36 - 37, Update
the header comment counts for the job groups to match the workflow definition:
change the "aws-provider-tests - Needs LocalStack only (no k3d). 8 tests."
comment to reflect 10 tests for aws-provider-tests and change
"local-docker-tests - Needs Docker only (some tests also need LocalStack). 10
tests." to reflect 9 tests for local-docker-tests (or alternatively adjust the
actual job definitions aws-provider-tests and local-docker-tests to match the
comment); ensure the referenced job names aws-provider-tests and
local-docker-tests in the header are accurate and consistent with the workflow
job list.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #809 +/- ##
=======================================
Coverage 31.25% 31.25%
=======================================
Files 84 84
Lines 4563 4563
Branches 1103 1103
=======================================
Hits 1426 1426
Misses 3137 3137 🚀 New features to boost your workflow:
|
The monolithic orchestrator-integrity workflow runs 25+ tests sequentially in a single job, consistently hitting the 60-minute timeout on PR runs. Split into 4 parallel jobs (k8s, aws-provider, local-docker, rclone) each on its own runner, cutting wall-clock time from 3+ hours to ~1 hour and eliminating disk space exhaustion from shared runner contention. Adopts the parallel architecture from PR #809. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-tests-mac.yml:
- Line 15: Remove the job-level continue-on-error and implement a step-scoped
retry sequence mirroring Ubuntu/Windows workflows: create three build steps with
ids build-1 (always runs, continue-on-error: true), build-2 (runs only if
steps.build-1.outcome == 'failure', continue-on-error: true), and build-3 (runs
only if steps.build-2.outcome == 'failure' and has no continue-on-error so
failures propagate); add sleep steps after build-1 and build-2 conditioned on
their failure outcomes with delays 60–120s and 240s respectively; ensure step
ids and conditional expressions (steps.build-1.outcome, steps.build-2.outcome)
are used exactly as referenced so the final build-3 signals a real failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fcf43df8-2335-4a6b-9416-d0d989c7dc51
📒 Files selected for processing (1)
.github/workflows/build-tests-mac.yml
| buildForAllPlatformsMacOS: | ||
| name: ${{ matrix.targetPlatform }} on ${{ matrix.unityVersion }} | ||
| runs-on: macos-latest | ||
| continue-on-error: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the files exist and their structure
echo "=== Checking .github/workflows/build-tests-mac.yml ==="
head -20 .github/workflows/build-tests-mac.yml
echo -e "\n=== Checking build-tests-ubuntu.yml lines 136-195 ==="
sed -n '136,195p' .github/workflows/build-tests-ubuntu.yml
echo -e "\n=== Checking build-tests-windows.yml lines 72-143 ==="
sed -n '72,143p' .github/workflows/build-tests-windows.ymlRepository: game-ci/unity-builder
Length of output: 5909
🏁 Script executed:
# Get the full build-tests-mac.yml to see complete step definitions
cat .github/workflows/build-tests-mac.ymlRepository: game-ci/unity-builder
Length of output: 3097
Move continue-on-error to step scope with an explicit retry mechanism.
The job-level continue-on-error: true suppresses all macOS matrix failures at the job level, making the workflow pass even when builds consistently fail. This also means macOS gets zero retry attempts, unlike Ubuntu and Windows workflows which implement a three-attempt retry with escalating delays (60s/120s → 240s) between each try. The correct pattern is step-level continue-on-error on the first two attempts only, with a final third attempt without continue-on-error to preserve the failure signal. This allows transient failures to self-heal while still catching real issues.
Mirror the retry structure from .github/workflows/build-tests-ubuntu.yml:136-195 and .github/workflows/build-tests-windows.yml:72-143:
- Build attempt 1:
id: build-1,continue-on-error: true, always runs - Sleep step: runs on
steps.build-1.outcome == 'failure'(60–120 sec delay) - Build attempt 2:
id: build-2,continue-on-error: true, conditional on build-1 failure - Sleep step: runs on
steps.build-2.outcome == 'failure'(240 sec delay) - Build attempt 3:
id: build-3, nocontinue-on-error, conditional on build-2 failure (final attempt, lets failure propagate)
Remove the job-level continue-on-error: true and implement the step-based retry pattern instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build-tests-mac.yml at line 15, Remove the job-level
continue-on-error and implement a step-scoped retry sequence mirroring
Ubuntu/Windows workflows: create three build steps with ids build-1 (always
runs, continue-on-error: true), build-2 (runs only if steps.build-1.outcome ==
'failure', continue-on-error: true), and build-3 (runs only if
steps.build-2.outcome == 'failure' and has no continue-on-error so failures
propagate); add sleep steps after build-1 and build-2 conditioned on their
failure outcomes with delays 60–120s and 240s respectively; ensure step ids and
conditional expressions (steps.build-1.outcome, steps.build-2.outcome) are used
exactly as referenced so the final build-3 signals a real failure.

Summary
Rewrites the monolithic
orchestrator-integrity.ymlworkflow into 4 parallel jobs, each running on its own GitHub Actions runner with an isolated 14GB disk. Wall-clock time drops from 3+ hours to ~1 hour, and the chronic disk space exhaustion that caused flaky failures is eliminated.This is infrastructure work that benefits the entire LTS release — every open PR triggers this workflow, so making it fast and reliable unblocks all feature development.
Problem
The existing workflow runs 24 integration tests sequentially in a single job. This creates two compounding problems:
Time: With setup, infrastructure provisioning, test execution, and cleanup for each test, the job takes 3+ hours. Every push to any open PR waits for this full cycle, creating a bottleneck across all LTS development.
Disk space exhaustion: The single runner's 14GB disk must simultaneously support:
When k3d and LocalStack compete for the same disk, the runner hits capacity mid-run — causing non-deterministic failures that are difficult to diagnose and impossible to fix without architectural changes.
Solution
Split the monolith into 4 parallel jobs based on infrastructure requirements. Each job runs on its own fresh runner, so disk-hungry consumers (k3d, LocalStack) never compete for the same space.
Job Architecture
k8s-testsaws-provider-testslocal-docker-testsrclone-testsWorkflow Topology
Within each job, tests still run sequentially — they share Docker state within a provider strategy. The parallelism is between provider strategies, not between individual tests.
Before / After
Cleanup Function Pattern
The previous workflow had 15 near-identical 30-line cleanup blocks scattered throughout. These are now replaced with reusable shell functions sourced from a temporary script:
Cleanup frequency is also reduced: light cleanup (
docker system prune -f) runs between tests within a job, while heavy cleanup (prune -af --volumes) only runs at job start and end. This avoids unnecessarily re-pulling base images mid-job.Disk Space Analysis
Why k3d eats disk
k3d creates a Kubernetes cluster using Docker containers. Each k3d node runs containerd internally, and when the orchestrator tests pull Unity Docker images into the cluster, those images are stored in containerd's content store inside the k3d node container — not in Docker's image cache. This means:
docker system prunedoes not reclaim this space (it is inside the container)k3d image rm) or destroying the cluster frees the spaceWhy LocalStack eats disk
LocalStack simulates AWS services locally. Each test creates CloudFormation stacks, S3 buckets with objects, ECS task definitions, and internal state. Even with cleanup between tests:
The fix
By giving each job its own runner, k3d (in
k8s-tests) and LocalStack (inaws-provider-testsand others) each get a full 14GB disk instead of splitting one. The k3d node image cleanup now only runs in thek8s-testsjob where it actually matters.What Did NOT Change
workflow_callinterface — Inputs, outputs, and permissions are identical.integrity-check.ymlrequires zero modifications.UNITY_LICENSE,AWS_*, andGH_TOKENusage.npm run cli -- ...) are unchanged.Testing
This workflow tests itself — pushing to the PR branch triggers
integrity-check.yml, which calls the modifiedorchestrator-integrity.yml. Verification:integrity-check.ymlcallsorchestrator-integrity.ymlwithout modificationsaws-provider-tests, etc.)k8s-testsstays within 14GB (previously the bottleneck)Cross-References
Benefits all open LTS PRs — every feature branch triggers this workflow on push. Faster, more reliable CI unblocks:
Generated with Claude Code
Tracking:
Summary by CodeRabbit