ci: optimize CI sharding#1768
Conversation
commit: |
|
@james-elicx 3m05s |
…provenance
Integration shard weights lived in a hand-seeded flat path->ms map
("aggregation": "manual seed"). A reviewer could not tell a measured
number from a guess, and the guesses were wrong: favicon-short-circuit
was seeded at 5s but runs ~35s in CI across five runs, a 7x under-weight
that mis-packed the shards. The seed had no provenance and no way to
regenerate from real data.
Restructure the manifest to a v2 provenance model: per file estimateMs
(the weight the planner uses), plus medianMs/p75Ms/samples and a
generatedFrom.runs list, an estimator metric, and generatedAt. Add
scripts/ci-integration-timings-refresh.mjs to aggregate Vitest blob
reports downloaded from successful CI runs (p75 per file, nearest-rank)
and rewrite the manifest deterministically, failing closed when the
blobs do not cover every discovered file. The manifest here was
regenerated from 5 successful runs (30 blobs); all six shards now pack
to 84s.
Extract planning and blob parsing into scripts/lib/* so the fragile
Vite+ blob-parser probe lives in one place. Replace the O(files*shards)
lightest-group scan with an O(n log m) binary min-heap and collapse the
three duplicated local-search move/swap helpers into one makespanAfter +
transfer primitive. Behavior preserved: the --check gate still verifies
every file lands in exactly one shard.
Harden --check to fail closed on no discovered files, missing, stale,
malformed/zero/negative timings, shard-count drift, and bucket coverage.
Add an advisory --recommend mode that models the optimal shard count from
real weights and flags when integration has dropped below the competing
cross-job bottleneck. It is advisory only and never runs in CI; the count
stays declarative in manifest.shardTotal with the matrix enforced
against it.
The integration shard step expanded ${{ steps.shard.outputs.files }}
directly into the run: block. That output is a list of test file paths
discovered from `vp test list`, and on pull_request runs a filename is
attacker-controllable: a fork PR adding a file whose name contains shell
metacharacters would inject it into the runner shell. GitHub code
scanning (zizmor) flagged this as template-injection, alert 163.
Route the file list and the other computed values through env vars and
reference them in the script, leaving $SHARD_FILES unquoted so the shell
still word-splits it into separate file arguments. The shell now treats
the value as data, never as script text. Verified with zizmor: the
pre-fix workflow reports template-injection on this line, the fixed
workflow reports no findings.
The refresh tool recorded every --run id as provenance but only checked that each discovered file had at least one timing sample. Passing five --run ids with blobs for a single complete run still produced a manifest claiming five-run provenance while every file held one sample. The manifest could claim stronger provenance than the blob directory backs. A test file runs in exactly one shard per run, so one complete run yields exactly one sample per file. Require samples === runIds.length for every discovered file: too few means a claimed run's blobs are missing, too many means the directory holds blobs beyond the claimed runs. --allow-partial relaxes the check to "at least one sample per file" for the re-run-failed-shard case while still recording the true per-file sample count.
… knee Temporary, for benchmarking only. Repacks the same provenance weights into 5 integration shards instead of 6 (manifest shardTotal and matrix set to 5, Check gate updated to match) so the 5 vs 6 trade-off can be measured with the same weights, unit split, and E2E split. To be reverted to 6 after the run is captured.
…-router E2E) Runner minutes are free on this public repo, so the objective is pure wall-clock. Attack the whole critical-path cluster at once: integration to 8 shards (~63s test load each, near the per-file floor), unit to 3, and the app-router E2E project to 3-way so none of them becomes the new ceiling once the others drop. Report job left as-is. Benchmarking only; final counts settle after the run lands.
Public repo, so runner minutes are free and the objective is pure wall-clock. At 10 shards each integration shard carries ~51s of test load; combined with the serial report tail this brings the integration critical path down to roughly where the un-shardable create-next-app (windows) job sits, so additional shards stop moving the overall wall. Keeps unit at 3 shards and the app-router E2E project at 3-way from the prior step. Benchmarking continues; counts can still change.
|
@codex adversarial review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a31c99c4fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const shardTotalRaw = args.find((a) => a.startsWith("--shard-total=")); | ||
| const shardTotal = shardTotalRaw | ||
| ? Number.parseInt(shardTotalRaw.slice("--shard-total=".length), 10) | ||
| : 6; |
There was a problem hiding this comment.
Set refresh default to the 10-shard manifest
When maintainers follow the documented refresh command in .github/workflows/ci.yml without passing --shard-total, this fallback writes shardTotal: 6 into scripts/ci-integration-timings.json even though the CI matrix and manifest check now require 10 shards. The next run then fails the Verify integration shard manifest step with shard-count drift, so refreshing timings via the advertised workflow produces an unusable manifest unless the operator knows to add an undocumented flag.
Useful? React with 👍 / 👎.
…constant ci-integration-timings-refresh.mjs defaulted --shard-total to a hardcoded 6. The documented refresh command in ci.yml omits --shard-total, so once the matrix moved past 6 shards, following the advertised workflow rewrote shardTotal: 6 into the manifest and the next run failed the Verify integration shard manifest step with shard-count drift. Default to the current manifest's shardTotal instead. manifest.shardTotal is the single source of truth for the count: the matrix mirrors it and --check enforces no drift, so a plain refresh now preserves whatever the matrix uses. An explicit --shard-total still overrides it for an intentional count change, and a missing count with no existing manifest now fails with a clear message instead of silently picking a number. Found by Codex review on a31c99c.
The invalid --shard-total message reconstructed the flag as
'--shard-total=<value>', printing a doubled prefix
('Invalid --shard-total: --shard-total=abc'). parseFlag already
returns just the value, so print it directly to match the planner
CLI's wording.
|
2m32s @james-elicx |
The integration shard check fails closed when a discovered file is missing from the timing manifest, so adding one integration test reds CI until someone hand-refreshes scripts/ci-integration-timings.json. The per-file weights are only a load-balancing hint: a missing or stale weight costs a little shard balance, never test correctness or coverage. Gating on a freshness signal blocks contributors (and forks, which run the secret-free ci.yml against the committed manifest) for an imbalance worth a few seconds on one shard. checkPlan now returns warnings separately from errors. Missing and stale files become warnings; the structural invariants (schema, shard-count drift, zero discovery, dropped or duplicated file) stay fail-closed. The check job prints warnings as ::warning:: annotations and exits 0, so the plan stays valid and a maintainer refreshes the manifest at leisure. Separately, runShard packed into whatever N/M the workflow passed while only --check compared the manifest to --shard-total, so a future edit could drift the matrix count from the manifest and silently drop or double-run tests at the point tests are selected. Guard manifest.shardTotal against the requested total in runShard too, dying on a mismatch instead of producing a malformed plan.
|
@james-elicx the committed timing manifest still drifts as the suite grows, so it needs an occasional refresh. We could set it up so it runs automatically or do manual refresh, it is only for the 5%-10% since it's timing hints. |
|
/bigbonk review for issues |
There was a problem hiding this comment.
Review: CI sharding optimization
Solid, well-documented work. The packer produces a perfectly balanced plan locally (all 10 shards at 51s), the --check gate passes, and the manifest provenance design is genuinely nice. The template-injection fix (routing steps.shard.outputs.files through SHARD_FILES env instead of inline ${{ }} expansion) is correct.
A few things worth addressing before merge, in priority order:
1. No tests for ~480 lines of new pure logic. scripts/lib/integration-shard-plan.mjs (greedy + local-search packing, manifest validation, min-heap) and vitest-blob-timings.mjs (percentile/median, path normalization) are all pure and highly testable, but ship with zero tests. AGENTS.md explicitly asks for focused unit tests when moving real behavior into helper modules. Without them, a future refactor that silently breaks balancing or drops a file from a shard would not be caught — and the failure mode (tests silently not running) is exactly the kind this PR is meant to prevent. Suggest a small tests/ file asserting: every discovered file lands in exactly one shard, validateManifest rejects each malformed case, and percentile/median edge cases (single value, even/odd length).
2. Empty shard output silently runs the entire suite. See inline comment on the CI step. Not reachable at 59 files / 10 shards, but it is an unguarded foot-gun.
3. The Codex P2 comment about --shard-total defaulting to 6 is stale — it reviewed a31c99c4fd. On the current commit, ci-integration-timings-refresh.mjs defaults to the existing manifest's shardTotal (10), so the documented refresh command produces a correct manifest. No action needed; flagging so it does not block.
Minor notes inline.
Nothing here is blocking correctness for the current file set; #1 is the one I would most want before merge given the stated goal of preventing silent test-scheduling drift.
| vp test run --project integration $COVERAGE \ | ||
| --reporter=blob \ | ||
| --outputFile.blob=".vitest-reports/blob-${SHARD_INDEX}-${SHARD_TOTAL}.json" \ | ||
| $SHARD_FILES |
There was a problem hiding this comment.
If SHARD_FILES is ever empty, this falls through to vp test run --project integration with no file filter, which runs the entire integration suite in that shard — silently, with a green check. It is not reachable today (59 files / 10 shards), but it becomes reachable if the planner ever emits an empty shard (e.g. shardTotal > file count after a future trim, or a packing bug). Consider failing closed when the computed list is empty:
if [ -z "${SHARD_FILES// /}" ]; then
echo "::error::shard ${SHARD_INDEX}/${SHARD_TOTAL} resolved to no files"; exit 1
fiThis keeps a mis-scheduled shard from masquerading as a fast pass while another shard double-runs.
| export function extractDurations(testFiles) { | ||
| return testFiles.map((testFile) => ({ | ||
| file: normalizeTestPath(testFile.filepath), | ||
| durationMs: testFile.result?.duration ?? 0, |
There was a problem hiding this comment.
durationMs: testFile.result?.duration ?? 0 silently turns a missing duration into a 0ms sample, which would pull p75 down and under-weight a file in the planner. Since this feeds a committed weight, a 0 (or missing-result) sample is more likely a malformed blob than a genuinely instant file. Consider skipping such entries (and warning) rather than recording 0, so a bad blob does not quietly corrupt a weight.
| console.error(`No samples for ${uncovered.length} discovered file(s):`); | ||
| for (const c of uncovered) console.error(` ${c.file}`); | ||
| die("Every discovered file needs at least one sample to compute a weight."); | ||
| } |
There was a problem hiding this comment.
With --allow-partial, a file may have fewer samples than runIds.length, yet generatedFrom.runs still lists every --run id. The per-file samples count stays honest, but generatedFrom.runs then overclaims which runs actually backed a given file's weight. Minor, and arguably acceptable for the re-run case, but worth a one-line note in the manifest or a comment so a future reader does not assume every listed run contributed to every file.
|
Review posted successfully. Here's a summary of what I reviewed and found. Review summary — PR #1768 (ci: optimize CI sharding)I checked out the branch, ran the tooling, and verified behavior end-to-end. The implementation is solid: the packer produces a perfectly balanced 10-shard plan (all shards at 51s), the Findings posted (1 review, 3 inline comments):
Noted as non-blocking: The Codex P2 comment about I also confirmed the local-search loop terminates (strict integer makespan decrease) and the matrix/manifest/check shard-count values are consistently fail-closed. |
Why do we need a manifest for running test files? Won't that be inconvenient whenever we add new test files? |
timing hint so the shards will automatically be balanced for speedmaxxing. It's just hints which affect 10% of the speed at most, and it won't materially change when adding a few new tests. Only hassle is needing to refresh it like once a month |
What this changes
Replaces the integration suite's native Vitest sharding with balanced, provenance-backed sharding, and tunes the CI matrix for wall-clock.
--shard(which splits by file count and lets a few heavy files dominate one shard).scripts/ci-integration-timings.json), regenerated from real CI runs with full provenance: median, p75, sample count, and the source run IDs.No test behavior, timeouts, reporters, or coverage policy change.
Why
CI wall-clock is the contributor feedback bottleneck, and integration was its slowest stage. Native sharding could not balance it: Vitest shards by file, and several integration files are large, indivisible scheduling units.
This repo is public, so GitHub-hosted runner minutes are free; the objective is wall-clock, not compute. Measured runs:
* At 10 shards the integration critical path (slowest shard plus the serial report job) is ~2m08s, the lowest measured. The remaining wall is gated by create-next-app on
windows-latest, a single un-shardable Windows job that ran 2m15s to 2m45s across these runs, not by anything sharding controls. Lowering the wall further means changing that job and the report tail, which is left to a follow-up.Method
version: 2) records, per file, theestimateMsthe planner uses, plusmedianMs,p75Ms,samples, and the source CI runs. The planner reads onlyestimateMs; the rest is provenance a reviewer can trust.scripts/ci-integration-timings-refresh.mjs) regenerates the manifest from Vitest blob reports downloaded from successful CI runs (gh run download -p 'blob-report-*'), using p75 per file. It fails closed if the blobs do not cover every discovered file, or do not back the claimed source runs (each file must have exactly one sample per--run).scripts/ci-integration-shard.mjs+scripts/lib/*) packs files with longest-processing-time greedy plus a deterministic local search.--checkis the CI gate and fails closed on shard-count drift, missing or stale files, malformed or non-positive weights, and bucket-coverage errors.--recommendadvises on shard count from the real weights.manifest.shardTotaldrives the matrix,--checkenforces no drift, and refresh inherits the count from the manifest rather than a constant.Validation
vp run check(format, lint, type) andvp run knippass.--checkpasses at 10 shards and fails on drift; all 59 integration files map to exactly one shard with no duplicates.Follow-ups