ci: optimize CI sharding by NathanDrake2406 · Pull Request #1768 · cloudflare/vinext

NathanDrake2406 · 2026-06-05T15:10:56Z

What this changes

Replaces the integration suite's native Vitest sharding with balanced, provenance-backed sharding, and tunes the CI matrix for wall-clock.

Integration tests are split into weighted file groups by measured per-file timings, instead of Vitest's native --shard (which splits by file count and lets a few heavy files dominate one shard).
Per-file weights live in a committed manifest (scripts/ci-integration-timings.json), regenerated from real CI runs with full provenance: median, p75, sample count, and the source run IDs.
CI matrix: integration 10 weighted shards, unit 3-way, app-router E2E project 3-way.
Hardens the manifest into a CI gate, and fixes a workflow template-injection flagged by code scanning.

No test behavior, timeouts, reporters, or coverage policy change.

Why

CI wall-clock is the contributor feedback bottleneck, and integration was its slowest stage. Native sharding could not balance it: Vitest shards by file, and several integration files are large, indivisible scheduling units.

This repo is public, so GitHub-hosted runner minutes are free; the objective is wall-clock, not compute. Measured runs:

Configuration	CI wall-clock	Slowest integration shard
Baseline, native 4-way integration	4m15s	3m31s
Weighted 6 shards + unit/E2E split	3m01s	2m19s
Weighted 10 shards + unit 3-way + E2E 3-way (this PR)	~2m32s (±10s)*	1m32s

* At 10 shards the integration critical path (slowest shard plus the serial report job) is ~2m08s, the lowest measured. The remaining wall is gated by create-next-app on windows-latest, a single un-shardable Windows job that ran 2m15s to 2m45s across these runs, not by anything sharding controls. Lowering the wall further means changing that job and the report tail, which is left to a follow-up.

Method

Manifest (version: 2) records, per file, the estimateMs the planner uses, plus medianMs, p75Ms, samples, and the source CI runs. The planner reads only estimateMs; the rest is provenance a reviewer can trust.
Refresh (scripts/ci-integration-timings-refresh.mjs) regenerates the manifest from Vitest blob reports downloaded from successful CI runs (gh run download -p 'blob-report-*'), using p75 per file. It fails closed if the blobs do not cover every discovered file, or do not back the claimed source runs (each file must have exactly one sample per --run).
Planner (scripts/ci-integration-shard.mjs + scripts/lib/*) packs files with longest-processing-time greedy plus a deterministic local search. --check is the CI gate and fails closed on shard-count drift, missing or stale files, malformed or non-positive weights, and bucket-coverage errors. --recommend advises on shard count from the real weights.
One source of truth for the count: manifest.shardTotal drives the matrix, --check enforces no drift, and refresh inherits the count from the manifest rather than a constant.
Template-injection fix: the integration run step passes its computed file list through an env var instead of expanding a step output into the shell, closing the vector flagged by code scanning (zizmor).

Validation

vp run check (format, lint, type) and vp run knip pass.
--check passes at 10 shards and fails on drift; all 59 integration files map to exactly one shard with no duplicates.
Refresh is idempotent and rejects blobs that do not back the claimed provenance.
CI is green across Check, Vitest (unit and integration), and E2E.

Follow-ups

Lower the wall below the create-next-app-windows floor (run it on push-to-main only, or speed up the Windows scaffold) and trim the serial integration report tail on PRs. Both touch other jobs and are out of scope here.

pkg-pr-new · 2026-06-05T15:11:59Z

Open in StackBlitz

npm i https://pkg.pr.new/@vinext/cloudflare@1768

npm i https://pkg.pr.new/vinext@1768

commit: b14215a

NathanDrake2406 · 2026-06-05T16:24:24Z

@james-elicx 3m05s

…ng manifest

…provenance Integration shard weights lived in a hand-seeded flat path->ms map ("aggregation": "manual seed"). A reviewer could not tell a measured number from a guess, and the guesses were wrong: favicon-short-circuit was seeded at 5s but runs ~35s in CI across five runs, a 7x under-weight that mis-packed the shards. The seed had no provenance and no way to regenerate from real data. Restructure the manifest to a v2 provenance model: per file estimateMs (the weight the planner uses), plus medianMs/p75Ms/samples and a generatedFrom.runs list, an estimator metric, and generatedAt. Add scripts/ci-integration-timings-refresh.mjs to aggregate Vitest blob reports downloaded from successful CI runs (p75 per file, nearest-rank) and rewrite the manifest deterministically, failing closed when the blobs do not cover every discovered file. The manifest here was regenerated from 5 successful runs (30 blobs); all six shards now pack to 84s. Extract planning and blob parsing into scripts/lib/* so the fragile Vite+ blob-parser probe lives in one place. Replace the O(files*shards) lightest-group scan with an O(n log m) binary min-heap and collapse the three duplicated local-search move/swap helpers into one makespanAfter + transfer primitive. Behavior preserved: the --check gate still verifies every file lands in exactly one shard. Harden --check to fail closed on no discovered files, missing, stale, malformed/zero/negative timings, shard-count drift, and bucket coverage. Add an advisory --recommend mode that models the optimal shard count from real weights and flags when integration has dropped below the competing cross-job bottleneck. It is advisory only and never runs in CI; the count stays declarative in manifest.shardTotal with the matrix enforced against it.

The integration shard step expanded ${{ steps.shard.outputs.files }} directly into the run: block. That output is a list of test file paths discovered from `vp test list`, and on pull_request runs a filename is attacker-controllable: a fork PR adding a file whose name contains shell metacharacters would inject it into the runner shell. GitHub code scanning (zizmor) flagged this as template-injection, alert 163. Route the file list and the other computed values through env vars and reference them in the script, leaving $SHARD_FILES unquoted so the shell still word-splits it into separate file arguments. The shell now treats the value as data, never as script text. Verified with zizmor: the pre-fix workflow reports template-injection on this line, the fixed workflow reports no findings.

The refresh tool recorded every --run id as provenance but only checked that each discovered file had at least one timing sample. Passing five --run ids with blobs for a single complete run still produced a manifest claiming five-run provenance while every file held one sample. The manifest could claim stronger provenance than the blob directory backs. A test file runs in exactly one shard per run, so one complete run yields exactly one sample per file. Require samples === runIds.length for every discovered file: too few means a claimed run's blobs are missing, too many means the directory holds blobs beyond the claimed runs. --allow-partial relaxes the check to "at least one sample per file" for the re-run-failed-shard case while still recording the true per-file sample count.

… knee Temporary, for benchmarking only. Repacks the same provenance weights into 5 integration shards instead of 6 (manifest shardTotal and matrix set to 5, Check gate updated to match) so the 5 vs 6 trade-off can be measured with the same weights, unit split, and E2E split. To be reverted to 6 after the run is captured.

…-router E2E) Runner minutes are free on this public repo, so the objective is pure wall-clock. Attack the whole critical-path cluster at once: integration to 8 shards (~63s test load each, near the per-file floor), unit to 3, and the app-router E2E project to 3-way so none of them becomes the new ceiling once the others drop. Report job left as-is. Benchmarking only; final counts settle after the run lands.

Public repo, so runner minutes are free and the objective is pure wall-clock. At 10 shards each integration shard carries ~51s of test load; combined with the serial report tail this brings the integration critical path down to roughly where the un-shardable create-next-app (windows) job sits, so additional shards stop moving the overall wall. Keeps unit at 3 shards and the app-router E2E project at 3-way from the prior step. Benchmarking continues; counts can still change.

NathanDrake2406 · 2026-06-06T02:42:59Z

@codex adversarial review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a31c99c4fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T02:48:40Z

+const shardTotalRaw = args.find((a) => a.startsWith("--shard-total="));
+const shardTotal = shardTotalRaw
+  ? Number.parseInt(shardTotalRaw.slice("--shard-total=".length), 10)
+  : 6;


Set refresh default to the 10-shard manifest

When maintainers follow the documented refresh command in .github/workflows/ci.yml without passing --shard-total, this fallback writes shardTotal: 6 into scripts/ci-integration-timings.json even though the CI matrix and manifest check now require 10 shards. The next run then fails the Verify integration shard manifest step with shard-count drift, so refreshing timings via the advertised workflow produces an unusable manifest unless the operator knows to add an undocumented flag.

Useful? React with 👍 / 👎.

…constant ci-integration-timings-refresh.mjs defaulted --shard-total to a hardcoded 6. The documented refresh command in ci.yml omits --shard-total, so once the matrix moved past 6 shards, following the advertised workflow rewrote shardTotal: 6 into the manifest and the next run failed the Verify integration shard manifest step with shard-count drift. Default to the current manifest's shardTotal instead. manifest.shardTotal is the single source of truth for the count: the matrix mirrors it and --check enforces no drift, so a plain refresh now preserves whatever the matrix uses. An explicit --shard-total still overrides it for an intentional count change, and a missing count with no existing manifest now fails with a clear message instead of silently picking a number. Found by Codex review on a31c99c.

The invalid --shard-total message reconstructed the flag as '--shard-total=<value>', printing a doubled prefix ('Invalid --shard-total: --shard-total=abc'). parseFlag already returns just the value, so print it directly to match the planner CLI's wording.

NathanDrake2406 · 2026-06-06T04:34:47Z

2m32s @james-elicx
The feedback loop gonna be tight tight tight

The integration shard check fails closed when a discovered file is missing from the timing manifest, so adding one integration test reds CI until someone hand-refreshes scripts/ci-integration-timings.json. The per-file weights are only a load-balancing hint: a missing or stale weight costs a little shard balance, never test correctness or coverage. Gating on a freshness signal blocks contributors (and forks, which run the secret-free ci.yml against the committed manifest) for an imbalance worth a few seconds on one shard. checkPlan now returns warnings separately from errors. Missing and stale files become warnings; the structural invariants (schema, shard-count drift, zero discovery, dropped or duplicated file) stay fail-closed. The check job prints warnings as ::warning:: annotations and exits 0, so the plan stays valid and a maintainer refreshes the manifest at leisure. Separately, runShard packed into whatever N/M the workflow passed while only --check compared the manifest to --shard-total, so a future edit could drift the matrix count from the manifest and silently drop or double-run tests at the point tests are selected. Guard manifest.shardTotal against the requested total in runShard too, dying on a mismatch instead of producing a malformed plan.

NathanDrake2406 · 2026-06-06T05:09:57Z

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

james-elicx · 2026-06-06T12:39:18Z

/bigbonk review for issues

ask-bonk

Review: CI sharding optimization

Solid, well-documented work. The packer produces a perfectly balanced plan locally (all 10 shards at 51s), the --check gate passes, and the manifest provenance design is genuinely nice. The template-injection fix (routing steps.shard.outputs.files through SHARD_FILES env instead of inline ${{ }} expansion) is correct.

A few things worth addressing before merge, in priority order:

1. No tests for ~480 lines of new pure logic. scripts/lib/integration-shard-plan.mjs (greedy + local-search packing, manifest validation, min-heap) and vitest-blob-timings.mjs (percentile/median, path normalization) are all pure and highly testable, but ship with zero tests. AGENTS.md explicitly asks for focused unit tests when moving real behavior into helper modules. Without them, a future refactor that silently breaks balancing or drops a file from a shard would not be caught — and the failure mode (tests silently not running) is exactly the kind this PR is meant to prevent. Suggest a small tests/ file asserting: every discovered file lands in exactly one shard, validateManifest rejects each malformed case, and percentile/median edge cases (single value, even/odd length).

2. Empty shard output silently runs the entire suite. See inline comment on the CI step. Not reachable at 59 files / 10 shards, but it is an unguarded foot-gun.

3. The Codex P2 comment about --shard-total defaulting to 6 is stale — it reviewed a31c99c4fd. On the current commit, ci-integration-timings-refresh.mjs defaults to the existing manifest's shardTotal (10), so the documented refresh command produces a correct manifest. No action needed; flagging so it does not block.

Minor notes inline.

Nothing here is blocking correctness for the current file set; #1 is the one I would most want before merge given the stated goal of preventing silent test-scheduling drift.

ask-bonk · 2026-06-06T12:42:12Z

+          vp test run --project integration $COVERAGE \
+            --reporter=blob \
+            --outputFile.blob=".vitest-reports/blob-${SHARD_INDEX}-${SHARD_TOTAL}.json" \
+            $SHARD_FILES


If SHARD_FILES is ever empty, this falls through to vp test run --project integration with no file filter, which runs the entire integration suite in that shard — silently, with a green check. It is not reachable today (59 files / 10 shards), but it becomes reachable if the planner ever emits an empty shard (e.g. shardTotal > file count after a future trim, or a packing bug). Consider failing closed when the computed list is empty:

if [ -z "${SHARD_FILES// /}" ]; then echo "::error::shard ${SHARD_INDEX}/${SHARD_TOTAL} resolved to no files"; exit 1 fi

This keeps a mis-scheduled shard from masquerading as a fast pass while another shard double-runs.

ask-bonk · 2026-06-06T12:42:12Z

+export function extractDurations(testFiles) {
+  return testFiles.map((testFile) => ({
+    file: normalizeTestPath(testFile.filepath),
+    durationMs: testFile.result?.duration ?? 0,


durationMs: testFile.result?.duration ?? 0 silently turns a missing duration into a 0ms sample, which would pull p75 down and under-weight a file in the planner. Since this feeds a committed weight, a 0 (or missing-result) sample is more likely a malformed blob than a genuinely instant file. Consider skipping such entries (and warning) rather than recording 0, so a bad blob does not quietly corrupt a weight.

ask-bonk · 2026-06-06T12:42:12Z

+    console.error(`No samples for ${uncovered.length} discovered file(s):`);
+    for (const c of uncovered) console.error(`  ${c.file}`);
+    die("Every discovered file needs at least one sample to compute a weight.");
+  }


With --allow-partial, a file may have fewer samples than runIds.length, yet generatedFrom.runs still lists every --run id. The per-file samples count stays honest, but generatedFrom.runs then overclaims which runs actually backed a given file's weight. Minor, and arguably acceptable for the re-run case, but worth a one-line note in the manifest or a comment so a future reader does not assume every listed run contributed to every file.

ask-bonk · 2026-06-06T12:42:22Z

Review posted successfully. Here's a summary of what I reviewed and found.

Review summary — PR #1768 (ci: optimize CI sharding)

I checked out the branch, ran the tooling, and verified behavior end-to-end. The implementation is solid: the packer produces a perfectly balanced 10-shard plan (all shards at 51s), the --check gate passes, and the template-injection fix correctly routes the planner's file list through a SHARD_FILES env var instead of inline ${{ }} expansion.

Findings posted (1 review, 3 inline comments):

No unit tests for ~480 lines of new pure logic (highest priority) — integration-shard-plan.mjs (greedy + local-search packing, manifest validation, min-heap) and vitest-blob-timings.mjs (percentile/median) ship with zero tests. AGENTS.md explicitly requires focused tests for extracted helper modules, and the failure mode this guards against (silently mis-scheduled tests) is exactly what the PR aims to prevent.
Empty SHARD_FILES silently runs the full suite (ci.yml:101) — not reachable at 59 files/10 shards, but an unguarded foot-gun; suggested a fail-closed guard.
durationMs ?? 0 (vitest-blob-timings.mjs:55) — a missing duration becomes a 0ms sample that silently corrupts a committed weight; suggested skipping + warning instead.
--allow-partial provenance overclaim (refresh.mjs:100) — generatedFrom.runs lists all run ids even when a file lacks samples from some; minor.

Noted as non-blocking: The Codex P2 comment about --shard-total defaulting to 6 is stale — it reviewed an earlier commit; the current code defaults to the manifest's shardTotal (10), so the documented refresh command is correct.

I also confirmed the local-search loop terminates (strict integer makespan decrease) and the matrix/manifest/check shard-count values are consistently fail-closed.

github run

james-elicx · 2026-06-06T12:44:44Z

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

Why do we need a manifest for running test files? Won't that be inconvenient whenever we add new test files?

NathanDrake2406 · 2026-06-06T12:47:52Z

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

Why do we need a manifest for running test files? Won't that be inconvenient whenever we add new test files?

timing hint so the shards will automatically be balanced for speedmaxxing. It's just hints which affect 10% of the speed at most, and it won't materially change when adding a few new tests. Only hassle is needing to refresh it like once a month

ci: add fifth integration shard

c661db4

NathanDrake2406 and others added 8 commits June 6, 2026 01:27

ci: weight integration shards from timing data

d0d2c8b

ci: shard app-router e2e

76b34b4

ci: rerun optimization experiment

b9bff6c

ci: add sixth weighted integration shard

322cca2

ci: rebalance weighted integration shards

71dc226

ci: shard unit tests

b8dfe20

Merge branch 'main' into nathan/fifth-integration-shard

271ff0d

ci: rebalance integration shards from current timings

874c69b

NathanDrake2406 changed the title ~~ci: add fifth integration shard (experiment)~~ ci: optimize CI sharding (experiment) Jun 5, 2026

NathanDrake2406 marked this pull request as ready for review June 5, 2026 16:24

NathanDrake2406 marked this pull request as draft June 5, 2026 16:31

ci: move weighted integration shard list from YAML into script + timi…

f864041

…ng manifest

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Fixed

NathanDrake2406 added 7 commits June 6, 2026 02:49

ci: move weighted integration shard list from YAML into script + timi…

a19d087

…ng manifest

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

NathanDrake2406 added 4 commits June 6, 2026 14:12

refactor(ci): share integration shard CLI helpers

74da23c

refactor(ci): clarify shard local search

cb62f31

NathanDrake2406 changed the title ~~ci: optimize CI sharding (experiment)~~ ci: optimize CI sharding Jun 6, 2026

NathanDrake2406 marked this pull request as ready for review June 6, 2026 04:32

docs(ci): clarify missing timing warning

8d124fd

ask-bonk Bot reviewed Jun 6, 2026

View reviewed changes

NathanDrake2406 and others added 2 commits June 6, 2026 22:51

fix(ci): harden integration shard refresh

672ae8c

Merge branch 'main' into nathan/fifth-integration-shard

b14215a

james-elicx enabled auto-merge (squash) June 6, 2026 12:54

james-elicx disabled auto-merge June 6, 2026 12:54

james-elicx merged commit df066cb into cloudflare:main Jun 6, 2026
39 checks passed

Conversation

NathanDrake2406 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Why

Method

Validation

Follow-ups

Uh oh!

pkg-pr-new Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanDrake2406 commented Jun 5, 2026

Uh oh!

Uh oh!

NathanDrake2406 commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

NathanDrake2406 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanDrake2406 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

james-elicx commented Jun 6, 2026

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Review: CI sharding optimization

Uh oh!

ask-bonk Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented Jun 6, 2026

Review summary — PR #1768 (ci: optimize CI sharding)

Uh oh!

james-elicx commented Jun 6, 2026

Uh oh!

NathanDrake2406 commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NathanDrake2406 commented Jun 5, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 5, 2026 •

edited

Loading

NathanDrake2406 commented Jun 6, 2026 •

edited

Loading

NathanDrake2406 commented Jun 6, 2026 •

edited

Loading