Skip to content

Night Watch Vibe Test Runner

Cindy Zhang edited this page Jun 26, 2026 · 2 revisions

Night Watch: Vibe Test Runner Role

Assigned to: Cindy's Navi (cixzhang) Goal: Run the nightly Astryx vibe test suite with fair, isolated environments. Frequency: Once per night (not every hour like other roles).


Architecture

NAVI (orchestrator)
  │
  ├─ Phase 0: Preflight — verify CLI + environment
  │   pnpm install, pnpm build (packages/cli)
  │   Verify npx xds --help works in a test project dir
  │
  ├─ Phase 1: Setup + Generate code
  │   setup-nightly.mjs → creates per-agent isolated project dirs
  │   Spawn 40 sub-agents: 10 Astryx + 10 Astryx+TW + 10 baseline + 10 HTML
  │   Each agent works in its own project directory (no cross-contamination)
  │   Output: .tsx/.json in each agent's project dir
  │
  ├─ Phase 1.5: Collect results
  │   collect-results.mjs → copies .tsx/.json to results/<iter>/results/
  │
  ├─ Phase 2: Build previews + tsc type-checking
  │   build-previews.ts → self-contained HTML with inlined CSS + build-errors.json
  │   Produces per-prompt correctness data (tsc errors) for debugging
  │
  ├─ Phase 3: Evaluate
  │   universal-eval.ts → 5-dimension scoring
  │   universal-compare.ts → cross-target comparison
  │
  ├─ Phase 4: Report + Persist
  │   Deploy report to gh-pages
  │   Commit results to repo (branch: vibe-test/nightly-YYYY-MM-DD)
  │   File GitHub issue with full breakdown
  │
  └─ Phase 5: Verify
      Confirm GH issue was filed with correct body
      Confirm report is accessible at gh-pages URL

Checker Protocol

Before running, verify these 5 invariants (see internal/vibe-tests/README.md):

  1. Fair evaluators — same scoring logic across targets (target-aware counting is OK)
  2. Only the system varies — same prompts, no system-specific coaching rules
  3. Never leak the answerexpectedComponents never appears in agent prompts
  4. Representative environment — each agent sees what a real consumer would see
  5. Context-free agents — fresh spawn per prompt, no inherited knowledge

Agent Environments

Each agent gets an isolated project directory cloned from environments/project-{target}/:

Target Agent sees Discovery path
Astryx package.json + node_modules/@xds/core/ (symlinked to real source) + working CLI lsnpx xds --help or node_modules/@xds/core/README.mddocs.mjs → component docs
Astryx+TW Same as Astryx + Tailwind CSS available Same discovery + Tailwind utility classes
Baseline package.json + components/ui/*.tsx + lib/ + README.md lsREADME.mdcat components/ui/button.tsx → real shadcn source
HTML package.json only (bare React project) No design system — agent uses plain HTML + inline CSS

Agent prompts say: "Your project is at <path>. Explore it to find how to look up component docs." No README paths, no CLI commands, no component names are given.


Nightly Checklist

Phase 0: Preflight (CRITICAL — do not skip)

# Ensure repo is up to date
cd /vercel/sandbox/repos/xds
git fetch origin main && git checkout origin/main

# Install dependencies
pnpm install

# Build the CLI package (required for npx xds to work in agent project dirs)
pnpm --filter @xds/cli build

# Verify CLI works
cd internal/vibe-tests
node -e "
const {createAgentProject} = await import('./src/setup-environment.mjs');
const fs = await import('fs');
const path = await import('path');
const {execSync} = await import('child_process');
const testDir = '/tmp/vibe-preflight-test';
fs.mkdirSync(testDir + '/results/test/projects/preflight', {recursive: true});
// Just verify the symlink target exists and CLI binary is accessible
const cliPath = path.join('/vercel/sandbox/repos/xds/packages/cli/bin/xds.mjs');
if (!fs.existsSync(cliPath)) { console.error('CLI not found at', cliPath); process.exit(1); }
console.log('✓ CLI binary exists');
try { execSync('node ' + cliPath + ' --help', {stdio: 'pipe'}); console.log('✓ CLI runs'); }
catch(e) { console.error('✗ CLI failed:', e.message); process.exit(1); }
"

If preflight fails: STOP. Do not proceed. Log the error in state and report in the daily note. The CLI must work for Astryx agents to discover component APIs.

Phase 1: Setup + Generate Code

# Run setup (creates iterations + per-agent project dirs)
node internal/vibe-tests/src/setup-nightly.mjs --sample 10

This outputs:

  • 4 iteration IDs (xds, xds-tailwind, baseline, html)
  • 40 task files (10 per target)
  • 40 isolated project directories (one per agent)

Before spawning: Read the generated task files and verify checker protocol:

  • Same task text across all targets ✓
  • No expectedComponents in any prompt ✓
  • No system-specific coaching rules ✓
  • Each project dir contains only what a real consumer would see ✓

Astryx agent verification: For at least one Astryx project dir, verify:

ls <project-dir>/node_modules/.bin/xds  # symlink exists
ls <project-dir>/node_modules/@xds/core/README.md  # docs accessible

Spawn 40 sub-agents — each works in its own project directory on the Astryx sandbox node. Agents write output (.tsx + .json) to their project directory.

Phase 1.5: Collect Results

After all agents complete, copy output to the standard results path:

cd internal/vibe-tests
node src/collect-results.mjs <xds-iteration-id>
node src/collect-results.mjs <xds-tailwind-iteration-id>
node src/collect-results.mjs <baseline-iteration-id>
node src/collect-results.mjs <html-iteration-id>

Check completeness: Each iteration should have 10 .tsx + 10 .json files. If any are missing, log which prompts failed and why (agent timeout, crash, etc.).

Phase 2: Build Previews + Type Checking

# Run tsc AND build previews (not --tsc-only — we want the HTML previews for debugging)
npx tsx src/build-previews.ts --iterations "<xds>,<xds-tw>,<baseline>,<html>"

This produces per-iteration:

  • build-errors.json — per-prompt tsc error details (THE key debugging artifact)
  • previews/ — standalone HTML pages for each generated component

Read build-errors.json for each iteration. Record:

  • Total error count
  • Which prompts are clean vs have errors
  • The actual error messages (for the GH issue)

Phase 3: Evaluate

npx tsx src/universal-aggregate.ts --iteration <xds-id>
npx tsx src/universal-aggregate.ts --iteration <xds-tw-id>
npx tsx src/universal-aggregate.ts --iteration <baseline-id>
npx tsx src/universal-aggregate.ts --iteration <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <baseline-id> --html <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <xds-tw-id>

Phase 4: Report + Persist

Deploy report to gh-pages:

npx tsx src/deploy-report.ts --iteration <xds-id> --baseline <baseline-id> --html <html-id> --xds-tailwind <xds-tw-id>

Commit results to repo:

git checkout -b vibe-test/nightly-$(date +%Y-%m-%d)
git add internal/vibe-tests/results/
git commit -m "vibe-test: nightly results $(date +%Y-%m-%d)"
git push origin HEAD

Append scores to the wiki ledger (NOT a new issue):

Scores are NO LONGER filed as a new GitHub issue per night. They are appended as one row per run to the rolling wiki ledger: Vibe-Test-Scores.

# Clone the wiki with push auth (repo scope covers the wiki repo)
TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wiki

# Append ONE row to the "## Overall" table (newest at the bottom):
#   | <date> | <astryx> | <astryx+tw or —> | <baseline> | <html> | <winner> | [#<run-issue?>](...) |
# And ONE row to the "## Astryx dimension breakdown" table:
#   | <date> | <overall> | <correctness> | <a11y> | <code-quality> | <efficiency> | <maintainability> | ... |
# Insert each row immediately before the next blank line / section after the table.
# Use `—` for any target not run. Keep the 0–100 scale. Do not reformat existing rows.

git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
  commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin master

Optional run record: the per-run detail (per-prompt table, tsc errors, iteration IDs) does NOT need its own issue. If you want a durable artifact, keep the results commit on the vibe-test/nightly-<date> branch (pushed above) and link its commit from the wiki row. Do not gh issue create for scores.

Post API concerns to the rolling tracker (only if any):

API concerns are tracked in ONE rolling issue: Vibe Test — API Concerns (#3164) — never a new issue. Only the Vibe Test Debugger (the 6 AM fix job) posts here, after it confirms a concern reproduces. The Runner does not post API concerns; it only records scores. (Correctness/tsc details stay with the results branch for the Debugger to analyze.)

Phase 5: Verify

  1. Confirm the wiki ledger updated: re-fetch Vibe-Test-Scores.md and verify the new date row is present in both tables with sane 0–100 values.
  2. Confirm the wiki page is reachable: curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scores returns 200.
  3. If verification fails, retry the wiki push once. If still failing, log the error in state (do NOT fall back to filing a scores issue).

Correctness Debugging Section (Issue Template)

When correctness < 100%, include this section in the GH issue:

### Correctness Failures

| Prompt | Target | Errors | Root Cause |
|--------|--------|--------|------------|
| fwc-6 | Astryx | 3 | Wrong prop type for `variant` — passed "primary" but Astryx expects "filled" |
| sd-1 | Astryx | 5 | Used `Spinner` (doesn't exist), should be `ProgressCircle` |

<details>
<summary>Full tsc errors (Astryx iteration)</summary>

**fwc-6.tsx:**

fwc-6.tsx(12,5): error TS2322: Type '"primary"' is not assignable to type '"filled" | "outlined" | "ghost"'.


**sd-1.tsx:**

sd-1.tsx(3,10): error TS2305: Module '@xds/core/Spinner' has no exported member 'Spinner'.

</details>

Failure Modes & Recovery

Failure Recovery
CLI not built / npx xds fails Run pnpm --filter @xds/cli build first. If that fails, skip Astryx iteration and report.
Agent didn't read docs (correctness tanks) Check if docsRead in result .json is empty — indicates discovery failure
Symlinks broken Re-run setup-nightly.mjs — it recreates project dirs fresh
Some agents timed out Log missing prompts, proceed with partial results. Note in issue body.
gh issue create body mismatch Always use --body-file. Read back the issue after creation to verify.
Results not in repo The branch push failed — try again. Results MUST be persisted.

Known Accepted Asymmetries

These slightly favor baseline, making Astryx wins more credible:

  • Efficiency: Tailwind's single-line className gets a lower styling ratio than Astryx's multi-line stylex.create, despite encoding more decisions
  • Maintainability: Tailwind scale values (p-4, text-sm) get generous semantic credit
  • Baseline docs: Hand-written README (more guidance than a real shadcn project would have)

State

Track in memory/xds-night-watch-state.json under vibeTestRunner key:

  • lastRun: ISO timestamp
  • lastIterations: {xds, "xds-tailwind", baseline, html} iteration IDs
  • lastIssue: GitHub issue number (MUST be set — if null, the run didn't complete properly)
  • results: per-target scores
  • correctnessDetails: {promptId: {errorCount, errors[]}} for debugging
  • note: human-readable summary of what happened

Clone this wiki locally