-
Notifications
You must be signed in to change notification settings - Fork 27
Night Watch Vibe Test Runner
Assigned to: Cindy's Navi (cixzhang)
Goal: Run the nightly Astryx vibe test suite with fair, isolated environments.
Frequency: Once per night (not every hour like other roles).
NAVI (orchestrator)
│
├─ Phase 0: Preflight — verify CLI + environment
│ pnpm install, pnpm build (packages/cli)
│ Verify npx xds --help works in a test project dir
│
├─ Phase 1: Setup + Generate code
│ setup-nightly.mjs → creates per-agent isolated project dirs
│ Spawn 40 sub-agents: 10 Astryx + 10 Astryx+TW + 10 baseline + 10 HTML
│ Each agent works in its own project directory (no cross-contamination)
│ Output: .tsx/.json in each agent's project dir
│
├─ Phase 1.5: Collect results
│ collect-results.mjs → copies .tsx/.json to results/<iter>/results/
│
├─ Phase 2: Build previews + tsc type-checking
│ build-previews.ts → self-contained HTML with inlined CSS + build-errors.json
│ Produces per-prompt correctness data (tsc errors) for debugging
│
├─ Phase 3: Evaluate
│ universal-eval.ts → 5-dimension scoring
│ universal-compare.ts → cross-target comparison
│
├─ Phase 4: Report + Persist
│ Deploy report to gh-pages
│ Commit results to repo (branch: vibe-test/nightly-YYYY-MM-DD)
│ File GitHub issue with full breakdown
│
└─ Phase 5: Verify
Confirm GH issue was filed with correct body
Confirm report is accessible at gh-pages URL
Before running, verify these 5 invariants (see internal/vibe-tests/README.md):
- Fair evaluators — same scoring logic across targets (target-aware counting is OK)
- Only the system varies — same prompts, no system-specific coaching rules
-
Never leak the answer —
expectedComponentsnever appears in agent prompts - Representative environment — each agent sees what a real consumer would see
- Context-free agents — fresh spawn per prompt, no inherited knowledge
Each agent gets an isolated project directory cloned from environments/project-{target}/:
| Target | Agent sees | Discovery path |
|---|---|---|
| Astryx |
package.json + node_modules/@xds/core/ (symlinked to real source) + working CLI |
ls → npx xds --help or node_modules/@xds/core/README.md → docs.mjs → component docs |
| Astryx+TW | Same as Astryx + Tailwind CSS available | Same discovery + Tailwind utility classes |
| Baseline |
package.json + components/ui/*.tsx + lib/ + README.md
|
ls → README.md → cat components/ui/button.tsx → real shadcn source |
| HTML |
package.json only (bare React project) |
No design system — agent uses plain HTML + inline CSS |
Agent prompts say: "Your project is at <path>. Explore it to find how to look up component docs." No README paths, no CLI commands, no component names are given.
# Ensure repo is up to date
cd /vercel/sandbox/repos/xds
git fetch origin main && git checkout origin/main
# Install dependencies
pnpm install
# Build the CLI package (required for npx xds to work in agent project dirs)
pnpm --filter @xds/cli build
# Verify CLI works
cd internal/vibe-tests
node -e "
const {createAgentProject} = await import('./src/setup-environment.mjs');
const fs = await import('fs');
const path = await import('path');
const {execSync} = await import('child_process');
const testDir = '/tmp/vibe-preflight-test';
fs.mkdirSync(testDir + '/results/test/projects/preflight', {recursive: true});
// Just verify the symlink target exists and CLI binary is accessible
const cliPath = path.join('/vercel/sandbox/repos/xds/packages/cli/bin/xds.mjs');
if (!fs.existsSync(cliPath)) { console.error('CLI not found at', cliPath); process.exit(1); }
console.log('✓ CLI binary exists');
try { execSync('node ' + cliPath + ' --help', {stdio: 'pipe'}); console.log('✓ CLI runs'); }
catch(e) { console.error('✗ CLI failed:', e.message); process.exit(1); }
"If preflight fails: STOP. Do not proceed. Log the error in state and report in the daily note. The CLI must work for Astryx agents to discover component APIs.
# Run setup (creates iterations + per-agent project dirs)
node internal/vibe-tests/src/setup-nightly.mjs --sample 10This outputs:
- 4 iteration IDs (xds, xds-tailwind, baseline, html)
- 40 task files (10 per target)
- 40 isolated project directories (one per agent)
Before spawning: Read the generated task files and verify checker protocol:
- Same task text across all targets ✓
- No expectedComponents in any prompt ✓
- No system-specific coaching rules ✓
- Each project dir contains only what a real consumer would see ✓
Astryx agent verification: For at least one Astryx project dir, verify:
ls <project-dir>/node_modules/.bin/xds # symlink exists
ls <project-dir>/node_modules/@xds/core/README.md # docs accessibleSpawn 40 sub-agents — each works in its own project directory on the Astryx sandbox node. Agents write output (.tsx + .json) to their project directory.
After all agents complete, copy output to the standard results path:
cd internal/vibe-tests
node src/collect-results.mjs <xds-iteration-id>
node src/collect-results.mjs <xds-tailwind-iteration-id>
node src/collect-results.mjs <baseline-iteration-id>
node src/collect-results.mjs <html-iteration-id>Check completeness: Each iteration should have 10 .tsx + 10 .json files. If any are missing, log which prompts failed and why (agent timeout, crash, etc.).
# Run tsc AND build previews (not --tsc-only — we want the HTML previews for debugging)
npx tsx src/build-previews.ts --iterations "<xds>,<xds-tw>,<baseline>,<html>"This produces per-iteration:
-
build-errors.json— per-prompt tsc error details (THE key debugging artifact) -
previews/— standalone HTML pages for each generated component
Read build-errors.json for each iteration. Record:
- Total error count
- Which prompts are clean vs have errors
- The actual error messages (for the GH issue)
npx tsx src/universal-aggregate.ts --iteration <xds-id>
npx tsx src/universal-aggregate.ts --iteration <xds-tw-id>
npx tsx src/universal-aggregate.ts --iteration <baseline-id>
npx tsx src/universal-aggregate.ts --iteration <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <baseline-id> --html <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <xds-tw-id>npx tsx src/deploy-report.ts --iteration <xds-id> --baseline <baseline-id> --html <html-id> --xds-tailwind <xds-tw-id>git checkout -b vibe-test/nightly-$(date +%Y-%m-%d)
git add internal/vibe-tests/results/
git commit -m "vibe-test: nightly results $(date +%Y-%m-%d)"
git push origin HEADScores are NO LONGER filed as a new GitHub issue per night. They are appended as one row per run to the rolling wiki ledger: Vibe-Test-Scores.
# Clone the wiki with push auth (repo scope covers the wiki repo)
TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wiki
# Append ONE row to the "## Overall" table (newest at the bottom):
# | <date> | <astryx> | <astryx+tw or —> | <baseline> | <html> | <winner> | [#<run-issue?>](...) |
# And ONE row to the "## Astryx dimension breakdown" table:
# | <date> | <overall> | <correctness> | <a11y> | <code-quality> | <efficiency> | <maintainability> | ... |
# Insert each row immediately before the next blank line / section after the table.
# Use `—` for any target not run. Keep the 0–100 scale. Do not reformat existing rows.
git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin masterOptional run record: the per-run detail (per-prompt table, tsc errors, iteration IDs) does NOT need its own issue. If you want a durable artifact, keep the results commit on the
vibe-test/nightly-<date>branch (pushed above) and link its commit from the wiki row. Do notgh issue createfor scores.
API concerns are tracked in ONE rolling issue: Vibe Test — API Concerns (#3164) — never a new issue. Only the Vibe Test Debugger (the 6 AM fix job) posts here, after it confirms a concern reproduces. The Runner does not post API concerns; it only records scores. (Correctness/tsc details stay with the results branch for the Debugger to analyze.)
- Confirm the wiki ledger updated: re-fetch
Vibe-Test-Scores.mdand verify the new date row is present in both tables with sane 0–100 values. - Confirm the wiki page is reachable:
curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scoresreturns200. - If verification fails, retry the wiki push once. If still failing, log the error in state (do NOT fall back to filing a scores issue).
When correctness < 100%, include this section in the GH issue:
### Correctness Failures
| Prompt | Target | Errors | Root Cause |
|--------|--------|--------|------------|
| fwc-6 | Astryx | 3 | Wrong prop type for `variant` — passed "primary" but Astryx expects "filled" |
| sd-1 | Astryx | 5 | Used `Spinner` (doesn't exist), should be `ProgressCircle` |
<details>
<summary>Full tsc errors (Astryx iteration)</summary>
**fwc-6.tsx:**fwc-6.tsx(12,5): error TS2322: Type '"primary"' is not assignable to type '"filled" | "outlined" | "ghost"'.
**sd-1.tsx:**
sd-1.tsx(3,10): error TS2305: Module '@xds/core/Spinner' has no exported member 'Spinner'.
</details>
| Failure | Recovery |
|---|---|
CLI not built / npx xds fails |
Run pnpm --filter @xds/cli build first. If that fails, skip Astryx iteration and report. |
| Agent didn't read docs (correctness tanks) | Check if docsRead in result .json is empty — indicates discovery failure |
| Symlinks broken | Re-run setup-nightly.mjs — it recreates project dirs fresh |
| Some agents timed out | Log missing prompts, proceed with partial results. Note in issue body. |
gh issue create body mismatch |
Always use --body-file. Read back the issue after creation to verify. |
| Results not in repo | The branch push failed — try again. Results MUST be persisted. |
These slightly favor baseline, making Astryx wins more credible:
-
Efficiency: Tailwind's single-line
classNamegets a lower styling ratio than Astryx's multi-linestylex.create, despite encoding more decisions -
Maintainability: Tailwind scale values (
p-4,text-sm) get generous semantic credit - Baseline docs: Hand-written README (more guidance than a real shadcn project would have)
Track in memory/xds-night-watch-state.json under vibeTestRunner key:
-
lastRun: ISO timestamp -
lastIterations:{xds, "xds-tailwind", baseline, html}iteration IDs -
lastIssue: GitHub issue number (MUST be set — if null, the run didn't complete properly) -
results: per-target scores -
correctnessDetails:{promptId: {errorCount, errors[]}}for debugging -
note: human-readable summary of what happened