Skip to content

ci: optimize CI sharding#1768

Merged
james-elicx merged 25 commits into
cloudflare:mainfrom
NathanDrake2406:nathan/fifth-integration-shard
Jun 6, 2026
Merged

ci: optimize CI sharding#1768
james-elicx merged 25 commits into
cloudflare:mainfrom
NathanDrake2406:nathan/fifth-integration-shard

Conversation

@NathanDrake2406
Copy link
Copy Markdown
Contributor

@NathanDrake2406 NathanDrake2406 commented Jun 5, 2026

What this changes

Replaces the integration suite's native Vitest sharding with balanced, provenance-backed sharding, and tunes the CI matrix for wall-clock.

  • Integration tests are split into weighted file groups by measured per-file timings, instead of Vitest's native --shard (which splits by file count and lets a few heavy files dominate one shard).
  • Per-file weights live in a committed manifest (scripts/ci-integration-timings.json), regenerated from real CI runs with full provenance: median, p75, sample count, and the source run IDs.
  • CI matrix: integration 10 weighted shards, unit 3-way, app-router E2E project 3-way.
  • Hardens the manifest into a CI gate, and fixes a workflow template-injection flagged by code scanning.

No test behavior, timeouts, reporters, or coverage policy change.

Why

CI wall-clock is the contributor feedback bottleneck, and integration was its slowest stage. Native sharding could not balance it: Vitest shards by file, and several integration files are large, indivisible scheduling units.

This repo is public, so GitHub-hosted runner minutes are free; the objective is wall-clock, not compute. Measured runs:

Configuration CI wall-clock Slowest integration shard
Baseline, native 4-way integration 4m15s 3m31s
Weighted 6 shards + unit/E2E split 3m01s 2m19s
Weighted 10 shards + unit 3-way + E2E 3-way (this PR) ~2m32s (±10s)* 1m32s

* At 10 shards the integration critical path (slowest shard plus the serial report job) is ~2m08s, the lowest measured. The remaining wall is gated by create-next-app on windows-latest, a single un-shardable Windows job that ran 2m15s to 2m45s across these runs, not by anything sharding controls. Lowering the wall further means changing that job and the report tail, which is left to a follow-up.

Method

  • Manifest (version: 2) records, per file, the estimateMs the planner uses, plus medianMs, p75Ms, samples, and the source CI runs. The planner reads only estimateMs; the rest is provenance a reviewer can trust.
  • Refresh (scripts/ci-integration-timings-refresh.mjs) regenerates the manifest from Vitest blob reports downloaded from successful CI runs (gh run download -p 'blob-report-*'), using p75 per file. It fails closed if the blobs do not cover every discovered file, or do not back the claimed source runs (each file must have exactly one sample per --run).
  • Planner (scripts/ci-integration-shard.mjs + scripts/lib/*) packs files with longest-processing-time greedy plus a deterministic local search. --check is the CI gate and fails closed on shard-count drift, missing or stale files, malformed or non-positive weights, and bucket-coverage errors. --recommend advises on shard count from the real weights.
  • One source of truth for the count: manifest.shardTotal drives the matrix, --check enforces no drift, and refresh inherits the count from the manifest rather than a constant.
  • Template-injection fix: the integration run step passes its computed file list through an env var instead of expanding a step output into the shell, closing the vector flagged by code scanning (zizmor).

Validation

  • vp run check (format, lint, type) and vp run knip pass.
  • --check passes at 10 shards and fails on drift; all 59 integration files map to exactly one shard with no duplicates.
  • Refresh is idempotent and rejects blobs that do not back the claimed provenance.
  • CI is green across Check, Vitest (unit and integration), and E2E.

Follow-ups

  • Lower the wall below the create-next-app-windows floor (run it on push-to-main only, or speed up the Windows scaffold) and trim the serial integration report tail on PRs. Both touch other jobs and are out of scope here.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 5, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@vinext/cloudflare@1768
npm i https://pkg.pr.new/vinext@1768

commit: b14215a

@NathanDrake2406 NathanDrake2406 changed the title ci: add fifth integration shard (experiment) ci: optimize CI sharding (experiment) Jun 5, 2026
@NathanDrake2406 NathanDrake2406 marked this pull request as ready for review June 5, 2026 16:24
@NathanDrake2406
Copy link
Copy Markdown
Contributor Author

@james-elicx 3m05s

@NathanDrake2406 NathanDrake2406 marked this pull request as draft June 5, 2026 16:31
Comment thread .github/workflows/ci.yml Fixed
…provenance

Integration shard weights lived in a hand-seeded flat path->ms map
("aggregation": "manual seed"). A reviewer could not tell a measured
number from a guess, and the guesses were wrong: favicon-short-circuit
was seeded at 5s but runs ~35s in CI across five runs, a 7x under-weight
that mis-packed the shards. The seed had no provenance and no way to
regenerate from real data.

Restructure the manifest to a v2 provenance model: per file estimateMs
(the weight the planner uses), plus medianMs/p75Ms/samples and a
generatedFrom.runs list, an estimator metric, and generatedAt. Add
scripts/ci-integration-timings-refresh.mjs to aggregate Vitest blob
reports downloaded from successful CI runs (p75 per file, nearest-rank)
and rewrite the manifest deterministically, failing closed when the
blobs do not cover every discovered file. The manifest here was
regenerated from 5 successful runs (30 blobs); all six shards now pack
to 84s.

Extract planning and blob parsing into scripts/lib/* so the fragile
Vite+ blob-parser probe lives in one place. Replace the O(files*shards)
lightest-group scan with an O(n log m) binary min-heap and collapse the
three duplicated local-search move/swap helpers into one makespanAfter +
transfer primitive. Behavior preserved: the --check gate still verifies
every file lands in exactly one shard.

Harden --check to fail closed on no discovered files, missing, stale,
malformed/zero/negative timings, shard-count drift, and bucket coverage.
Add an advisory --recommend mode that models the optimal shard count from
real weights and flags when integration has dropped below the competing
cross-job bottleneck. It is advisory only and never runs in CI; the count
stays declarative in manifest.shardTotal with the matrix enforced
against it.
The integration shard step expanded ${{ steps.shard.outputs.files }}
directly into the run: block. That output is a list of test file paths
discovered from `vp test list`, and on pull_request runs a filename is
attacker-controllable: a fork PR adding a file whose name contains shell
metacharacters would inject it into the runner shell. GitHub code
scanning (zizmor) flagged this as template-injection, alert 163.

Route the file list and the other computed values through env vars and
reference them in the script, leaving $SHARD_FILES unquoted so the shell
still word-splits it into separate file arguments. The shell now treats
the value as data, never as script text. Verified with zizmor: the
pre-fix workflow reports template-injection on this line, the fixed
workflow reports no findings.
The refresh tool recorded every --run id as provenance but only checked
that each discovered file had at least one timing sample. Passing five
--run ids with blobs for a single complete run still produced a manifest
claiming five-run provenance while every file held one sample. The
manifest could claim stronger provenance than the blob directory backs.

A test file runs in exactly one shard per run, so one complete run
yields exactly one sample per file. Require samples === runIds.length for
every discovered file: too few means a claimed run's blobs are missing,
too many means the directory holds blobs beyond the claimed runs.
--allow-partial relaxes the check to "at least one sample per file" for
the re-run-failed-shard case while still recording the true per-file
sample count.
… knee

Temporary, for benchmarking only. Repacks the same provenance weights
into 5 integration shards instead of 6 (manifest shardTotal and matrix
set to 5, Check gate updated to match) so the 5 vs 6 trade-off can be
measured with the same weights, unit split, and E2E split. To be
reverted to 6 after the run is captured.
…-router E2E)

Runner minutes are free on this public repo, so the objective is pure
wall-clock. Attack the whole critical-path cluster at once: integration
to 8 shards (~63s test load each, near the per-file floor), unit to 3,
and the app-router E2E project to 3-way so none of them becomes the new
ceiling once the others drop. Report job left as-is. Benchmarking only;
final counts settle after the run lands.
Public repo, so runner minutes are free and the objective is pure
wall-clock. At 10 shards each integration shard carries ~51s of test
load; combined with the serial report tail this brings the integration
critical path down to roughly where the un-shardable create-next-app
(windows) job sits, so additional shards stop moving the overall wall.
Keeps unit at 3 shards and the app-router E2E project at 3-way from the
prior step. Benchmarking continues; counts can still change.
@NathanDrake2406
Copy link
Copy Markdown
Contributor Author

@codex adversarial review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a31c99c4fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +49 to +52
const shardTotalRaw = args.find((a) => a.startsWith("--shard-total="));
const shardTotal = shardTotalRaw
? Number.parseInt(shardTotalRaw.slice("--shard-total=".length), 10)
: 6;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set refresh default to the 10-shard manifest

When maintainers follow the documented refresh command in .github/workflows/ci.yml without passing --shard-total, this fallback writes shardTotal: 6 into scripts/ci-integration-timings.json even though the CI matrix and manifest check now require 10 shards. The next run then fails the Verify integration shard manifest step with shard-count drift, so refreshing timings via the advertised workflow produces an unusable manifest unless the operator knows to add an undocumented flag.

Useful? React with 👍 / 👎.

…constant

ci-integration-timings-refresh.mjs defaulted --shard-total to a hardcoded
6. The documented refresh command in ci.yml omits --shard-total, so once
the matrix moved past 6 shards, following the advertised workflow rewrote
shardTotal: 6 into the manifest and the next run failed the Verify
integration shard manifest step with shard-count drift.

Default to the current manifest's shardTotal instead. manifest.shardTotal
is the single source of truth for the count: the matrix mirrors it and
--check enforces no drift, so a plain refresh now preserves whatever the
matrix uses. An explicit --shard-total still overrides it for an
intentional count change, and a missing count with no existing manifest
now fails with a clear message instead of silently picking a number.

Found by Codex review on a31c99c.
The invalid --shard-total message reconstructed the flag as
'--shard-total=<value>', printing a doubled prefix
('Invalid --shard-total: --shard-total=abc'). parseFlag already
returns just the value, so print it directly to match the planner
CLI's wording.
@NathanDrake2406 NathanDrake2406 changed the title ci: optimize CI sharding (experiment) ci: optimize CI sharding Jun 6, 2026
@NathanDrake2406 NathanDrake2406 marked this pull request as ready for review June 6, 2026 04:32
@NathanDrake2406
Copy link
Copy Markdown
Contributor Author

NathanDrake2406 commented Jun 6, 2026

2m32s @james-elicx
The feedback loop gonna be tight tight tight

The integration shard check fails closed when a discovered file is
missing from the timing manifest, so adding one integration test reds CI
until someone hand-refreshes scripts/ci-integration-timings.json. The
per-file weights are only a load-balancing hint: a missing or stale
weight costs a little shard balance, never test correctness or coverage.
Gating on a freshness signal blocks contributors (and forks, which run
the secret-free ci.yml against the committed manifest) for an imbalance
worth a few seconds on one shard.

checkPlan now returns warnings separately from errors. Missing and stale
files become warnings; the structural invariants (schema, shard-count
drift, zero discovery, dropped or duplicated file) stay fail-closed. The
check job prints warnings as ::warning:: annotations and exits 0, so the
plan stays valid and a maintainer refreshes the manifest at leisure.

Separately, runShard packed into whatever N/M the workflow passed while
only --check compared the manifest to --shard-total, so a future edit
could drift the matrix count from the manifest and silently drop or
double-run tests at the point tests are selected. Guard
manifest.shardTotal against the requested total in runShard too, dying on
a mismatch instead of producing a malformed plan.
@NathanDrake2406
Copy link
Copy Markdown
Contributor Author

NathanDrake2406 commented Jun 6, 2026

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

@james-elicx
Copy link
Copy Markdown
Member

/bigbonk review for issues

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: CI sharding optimization

Solid, well-documented work. The packer produces a perfectly balanced plan locally (all 10 shards at 51s), the --check gate passes, and the manifest provenance design is genuinely nice. The template-injection fix (routing steps.shard.outputs.files through SHARD_FILES env instead of inline ${{ }} expansion) is correct.

A few things worth addressing before merge, in priority order:

1. No tests for ~480 lines of new pure logic. scripts/lib/integration-shard-plan.mjs (greedy + local-search packing, manifest validation, min-heap) and vitest-blob-timings.mjs (percentile/median, path normalization) are all pure and highly testable, but ship with zero tests. AGENTS.md explicitly asks for focused unit tests when moving real behavior into helper modules. Without them, a future refactor that silently breaks balancing or drops a file from a shard would not be caught — and the failure mode (tests silently not running) is exactly the kind this PR is meant to prevent. Suggest a small tests/ file asserting: every discovered file lands in exactly one shard, validateManifest rejects each malformed case, and percentile/median edge cases (single value, even/odd length).

2. Empty shard output silently runs the entire suite. See inline comment on the CI step. Not reachable at 59 files / 10 shards, but it is an unguarded foot-gun.

3. The Codex P2 comment about --shard-total defaulting to 6 is stale — it reviewed a31c99c4fd. On the current commit, ci-integration-timings-refresh.mjs defaults to the existing manifest's shardTotal (10), so the documented refresh command produces a correct manifest. No action needed; flagging so it does not block.

Minor notes inline.

Nothing here is blocking correctness for the current file set; #1 is the one I would most want before merge given the stated goal of preventing silent test-scheduling drift.

Comment thread .github/workflows/ci.yml
vp test run --project integration $COVERAGE \
--reporter=blob \
--outputFile.blob=".vitest-reports/blob-${SHARD_INDEX}-${SHARD_TOTAL}.json" \
$SHARD_FILES
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SHARD_FILES is ever empty, this falls through to vp test run --project integration with no file filter, which runs the entire integration suite in that shard — silently, with a green check. It is not reachable today (59 files / 10 shards), but it becomes reachable if the planner ever emits an empty shard (e.g. shardTotal > file count after a future trim, or a packing bug). Consider failing closed when the computed list is empty:

if [ -z "${SHARD_FILES// /}" ]; then
  echo "::error::shard ${SHARD_INDEX}/${SHARD_TOTAL} resolved to no files"; exit 1
fi

This keeps a mis-scheduled shard from masquerading as a fast pass while another shard double-runs.

Comment thread scripts/lib/vitest-blob-timings.mjs Outdated
export function extractDurations(testFiles) {
return testFiles.map((testFile) => ({
file: normalizeTestPath(testFile.filepath),
durationMs: testFile.result?.duration ?? 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

durationMs: testFile.result?.duration ?? 0 silently turns a missing duration into a 0ms sample, which would pull p75 down and under-weight a file in the planner. Since this feeds a committed weight, a 0 (or missing-result) sample is more likely a malformed blob than a genuinely instant file. Consider skipping such entries (and warning) rather than recording 0, so a bad blob does not quietly corrupt a weight.

console.error(`No samples for ${uncovered.length} discovered file(s):`);
for (const c of uncovered) console.error(` ${c.file}`);
die("Every discovered file needs at least one sample to compute a weight.");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With --allow-partial, a file may have fewer samples than runIds.length, yet generatedFrom.runs still lists every --run id. The per-file samples count stays honest, but generatedFrom.runs then overclaims which runs actually backed a given file's weight. Minor, and arguably acceptable for the re-run case, but worth a one-line note in the manifest or a comment so a future reader does not assume every listed run contributed to every file.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Jun 6, 2026

Review posted successfully. Here's a summary of what I reviewed and found.

Review summary — PR #1768 (ci: optimize CI sharding)

I checked out the branch, ran the tooling, and verified behavior end-to-end. The implementation is solid: the packer produces a perfectly balanced 10-shard plan (all shards at 51s), the --check gate passes, and the template-injection fix correctly routes the planner's file list through a SHARD_FILES env var instead of inline ${{ }} expansion.

Findings posted (1 review, 3 inline comments):

  1. No unit tests for ~480 lines of new pure logic (highest priority) — integration-shard-plan.mjs (greedy + local-search packing, manifest validation, min-heap) and vitest-blob-timings.mjs (percentile/median) ship with zero tests. AGENTS.md explicitly requires focused tests for extracted helper modules, and the failure mode this guards against (silently mis-scheduled tests) is exactly what the PR aims to prevent.

  2. Empty SHARD_FILES silently runs the full suite (ci.yml:101) — not reachable at 59 files/10 shards, but an unguarded foot-gun; suggested a fail-closed guard.

  3. durationMs ?? 0 (vitest-blob-timings.mjs:55) — a missing duration becomes a 0ms sample that silently corrupts a committed weight; suggested skipping + warning instead.

  4. --allow-partial provenance overclaim (refresh.mjs:100) — generatedFrom.runs lists all run ids even when a file lacks samples from some; minor.

Noted as non-blocking: The Codex P2 comment about --shard-total defaulting to 6 is stale — it reviewed an earlier commit; the current code defaults to the manifest's shardTotal (10), so the documented refresh command is correct.

I also confirmed the local-search loop terminates (strict integer makespan decrease) and the matrix/manifest/check shard-count values are consistently fail-closed.

github run

@james-elicx
Copy link
Copy Markdown
Member

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

Why do we need a manifest for running test files? Won't that be inconvenient whenever we add new test files?

@NathanDrake2406
Copy link
Copy Markdown
Contributor Author

@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints.

Why do we need a manifest for running test files? Won't that be inconvenient whenever we add new test files?

timing hint so the shards will automatically be balanced for speedmaxxing. It's just hints which affect 10% of the speed at most, and it won't materially change when adding a few new tests. Only hassle is needing to refresh it like once a month

@james-elicx james-elicx enabled auto-merge (squash) June 6, 2026 12:54
@james-elicx james-elicx disabled auto-merge June 6, 2026 12:54
@james-elicx james-elicx merged commit df066cb into cloudflare:main Jun 6, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants