Night Watch Vibe Test Runner

Night Watch: Vibe Test Runner Role

Assigned to: Cindy's Navi (cixzhang) Goal: Run the nightly Astryx vibe test suite with fair, isolated environments. Frequency: Once per night (not every hour like other roles).

Architecture

NAVI (orchestrator)
  │
  ├─ Phase 0: Preflight — verify CLI + environment
  │   pnpm install, pnpm build (packages/cli)
  │   Verify npx xds --help works in a test project dir
  │
  ├─ Phase 1: Setup + Generate code
  │   setup-nightly.mjs → creates per-agent isolated project dirs
  │   Spawn 40 sub-agents: 10 Astryx + 10 Astryx+TW + 10 baseline + 10 HTML
  │   Each agent works in its own project directory (no cross-contamination)
  │   Output: .tsx/.json in each agent's project dir
  │
  ├─ Phase 1.5: Collect results
  │   collect-results.mjs → copies .tsx/.json to results/<iter>/results/
  │
  ├─ Phase 2: Build previews + tsc type-checking
  │   build-previews.ts → self-contained HTML with inlined CSS + build-errors.json
  │   Produces per-prompt correctness data (tsc errors) for debugging
  │
  ├─ Phase 3: Evaluate
  │   universal-eval.ts → 5-dimension scoring
  │   universal-compare.ts → cross-target comparison
  │
  ├─ Phase 4: Report + Persist
  │   Deploy report to gh-pages
  │   Commit results to repo (branch: vibe-test/nightly-YYYY-MM-DD)
  │   File GitHub issue with full breakdown
  │
  └─ Phase 5: Verify
      Confirm GH issue was filed with correct body
      Confirm report is accessible at gh-pages URL

Checker Protocol

Before running, verify these 5 invariants (see internal/vibe-tests/README.md):

Fair evaluators — same scoring logic across targets (target-aware counting is OK)
Only the system varies — same prompts, no system-specific coaching rules
Never leak the answer — expectedComponents never appears in agent prompts
Representative environment — each agent sees what a real consumer would see
Context-free agents — fresh spawn per prompt, no inherited knowledge

Agent Environments

Each agent gets an isolated project directory cloned from environments/project-{target}/:

Target	Agent sees	Discovery path
Astryx	`package.json` + `node_modules/@xds/core/` (symlinked to real source) + working CLI	`ls` → `npx xds --help` or `node_modules/@xds/core/README.md` → `docs.mjs` → component docs
Astryx+TW	Same as Astryx + Tailwind CSS available	Same discovery + Tailwind utility classes
Baseline	`package.json` + `components/ui/*.tsx` + `lib/` + `README.md`	`ls` → `README.md` → `cat components/ui/button.tsx` → real shadcn source
HTML	`package.json` only (bare React project)	No design system — agent uses plain HTML + inline CSS

Agent prompts say: "Your project is at <path>. Explore it to find how to look up component docs." No README paths, no CLI commands, no component names are given.

Nightly Checklist

Phase 0: Preflight (CRITICAL — do not skip)

# Ensure repo is up to date
cd /vercel/sandbox/repos/xds
git fetch origin main && git checkout origin/main

# Install dependencies
pnpm install

# Build the CLI package (required for npx xds to work in agent project dirs)
pnpm --filter @xds/cli build

# Verify CLI works
cd internal/vibe-tests
node -e "
const {createAgentProject} = await import('./src/setup-environment.mjs');
const fs = await import('fs');
const path = await import('path');
const {execSync} = await import('child_process');
const testDir = '/tmp/vibe-preflight-test';
fs.mkdirSync(testDir + '/results/test/projects/preflight', {recursive: true});
// Just verify the symlink target exists and CLI binary is accessible
const cliPath = path.join('/vercel/sandbox/repos/xds/packages/cli/bin/xds.mjs');
if (!fs.existsSync(cliPath)) { console.error('CLI not found at', cliPath); process.exit(1); }
console.log('✓ CLI binary exists');
try { execSync('node ' + cliPath + ' --help', {stdio: 'pipe'}); console.log('✓ CLI runs'); }
catch(e) { console.error('✗ CLI failed:', e.message); process.exit(1); }
"

If preflight fails: STOP. Do not proceed. Log the error in state and report in the daily note. The CLI must work for Astryx agents to discover component APIs.

Phase 1: Setup + Generate Code

# Run setup (creates iterations + per-agent project dirs)
node internal/vibe-tests/src/setup-nightly.mjs --sample 10

This outputs:

4 iteration IDs (xds, xds-tailwind, baseline, html)
40 task files (10 per target)
40 isolated project directories (one per agent)

Before spawning: Read the generated task files and verify checker protocol:

Same task text across all targets ✓
No expectedComponents in any prompt ✓
No system-specific coaching rules ✓
Each project dir contains only what a real consumer would see ✓

Astryx agent verification: For at least one Astryx project dir, verify:

ls <project-dir>/node_modules/.bin/xds  # symlink exists
ls <project-dir>/node_modules/@xds/core/README.md  # docs accessible

Spawn 40 sub-agents — each works in its own project directory on the Astryx sandbox node. Agents write output (.tsx + .json) to their project directory.

Phase 1.5: Collect Results

After all agents complete, copy output to the standard results path:

cd internal/vibe-tests
node src/collect-results.mjs <xds-iteration-id>
node src/collect-results.mjs <xds-tailwind-iteration-id>
node src/collect-results.mjs <baseline-iteration-id>
node src/collect-results.mjs <html-iteration-id>

Check completeness: Each iteration should have 10 .tsx + 10 .json files. If any are missing, log which prompts failed and why (agent timeout, crash, etc.).

Phase 2: Build Previews + Type Checking

# Run tsc AND build previews (not --tsc-only — we want the HTML previews for debugging)
npx tsx src/build-previews.ts --iterations "<xds>,<xds-tw>,<baseline>,<html>"

This produces per-iteration:

build-errors.json — per-prompt tsc error details (THE key debugging artifact)
previews/ — standalone HTML pages for each generated component

Read build-errors.json for each iteration. Record:

Total error count
Which prompts are clean vs have errors
The actual error messages (for the GH issue)

Phase 3: Evaluate

npx tsx src/universal-aggregate.ts --iteration <xds-id>
npx tsx src/universal-aggregate.ts --iteration <xds-tw-id>
npx tsx src/universal-aggregate.ts --iteration <baseline-id>
npx tsx src/universal-aggregate.ts --iteration <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <baseline-id> --html <html-id>
npx tsx src/universal-compare.ts --astryx <xds-id> --baseline <xds-tw-id>

Phase 4: Report + Persist

Deploy report to gh-pages:

npx tsx src/deploy-report.ts --iteration <xds-id> --baseline <baseline-id> --html <html-id> --xds-tailwind <xds-tw-id>

Commit results to repo:

git checkout -b vibe-test/nightly-$(date +%Y-%m-%d)
git add internal/vibe-tests/results/
git commit -m "vibe-test: nightly results $(date +%Y-%m-%d)"
git push origin HEAD

Append scores to the wiki ledger (NOT a new issue):

Scores are NO LONGER filed as a new GitHub issue per night. They are appended as one row per run to the rolling wiki ledger: Vibe-Test-Scores.

# Clone the wiki with push auth (repo scope covers the wiki repo)
TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wiki

# Append ONE row to the "## Overall" table (newest at the bottom):
#   | <date> | <astryx> | <astryx+tw or —> | <baseline> | <html> | <winner> | [#<run-issue?>](...) |
# And ONE row to the "## Astryx dimension breakdown" table:
#   | <date> | <overall> | <correctness> | <a11y> | <code-quality> | <efficiency> | <maintainability> | ... |
# Insert each row immediately before the next blank line / section after the table.
# Use `—` for any target not run. Keep the 0–100 scale. Do not reformat existing rows.

git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
  commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin master

Optional run record: the per-run detail (per-prompt table, tsc errors, iteration IDs) does NOT need its own issue. If you want a durable artifact, keep the results commit on the vibe-test/nightly-<date> branch (pushed above) and link its commit from the wiki row. Do not gh issue create for scores.

Post API concerns to the rolling tracker (only if any):

API concerns are tracked in ONE rolling issue: Vibe Test — API Concerns (#3164) — never a new issue. Only the Vibe Test Debugger (the 6 AM fix job) posts here, after it confirms a concern reproduces. The Runner does not post API concerns; it only records scores. (Correctness/tsc details stay with the results branch for the Debugger to analyze.)

Phase 5: Verify

Confirm the wiki ledger updated: re-fetch Vibe-Test-Scores.md and verify the new date row is present in both tables with sane 0–100 values.
Confirm the wiki page is reachable: curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scores returns 200.
If verification fails, retry the wiki push once. If still failing, log the error in state (do NOT fall back to filing a scores issue).

Correctness Debugging Section (Issue Template)

When correctness < 100%, include this section in the GH issue:

### Correctness Failures

| Prompt | Target | Errors | Root Cause |
|--------|--------|--------|------------|
| fwc-6 | Astryx | 3 | Wrong prop type for `variant` — passed "primary" but Astryx expects "filled" |
| sd-1 | Astryx | 5 | Used `Spinner` (doesn't exist), should be `ProgressCircle` |

<details>
<summary>Full tsc errors (Astryx iteration)</summary>

**fwc-6.tsx:**

fwc-6.tsx(12,5): error TS2322: Type '"primary"' is not assignable to type '"filled" | "outlined" | "ghost"'.


**sd-1.tsx:**

sd-1.tsx(3,10): error TS2305: Module '@xds/core/Spinner' has no exported member 'Spinner'.

</details>

Failure Modes & Recovery

Failure	Recovery
CLI not built / `npx xds` fails	Run `pnpm --filter @xds/cli build` first. If that fails, skip Astryx iteration and report.
Agent didn't read docs (correctness tanks)	Check if `docsRead` in result .json is empty — indicates discovery failure
Symlinks broken	Re-run `setup-nightly.mjs` — it recreates project dirs fresh
Some agents timed out	Log missing prompts, proceed with partial results. Note in issue body.
`gh issue create` body mismatch	Always use `--body-file`. Read back the issue after creation to verify.
Results not in repo	The branch push failed — try again. Results MUST be persisted.

Known Accepted Asymmetries

These slightly favor baseline, making Astryx wins more credible:

Efficiency: Tailwind's single-line className gets a lower styling ratio than Astryx's multi-line stylex.create, despite encoding more decisions
Maintainability: Tailwind scale values (p-4, text-sm) get generous semantic credit
Baseline docs: Hand-written README (more guidance than a real shadcn project would have)

State

Track in memory/xds-night-watch-state.json under vibeTestRunner key:

lastRun: ISO timestamp
lastIterations: {xds, "xds-tailwind", baseline, html} iteration IDs
lastIssue: GitHub issue number (MUST be set — if null, the run didn't complete properly)
results: per-target scores
correctnessDetails: {promptId: {errorCount, errors[]}} for debugging
note: human-readable summary of what happened

Uh oh!

Night Watch Vibe Test Runner

Night Watch: Vibe Test Runner Role

Architecture

Checker Protocol

Agent Environments

Nightly Checklist

Phase 0: Preflight (CRITICAL — do not skip)

Phase 1: Setup + Generate Code

Phase 1.5: Collect Results

Phase 2: Build Previews + Type Checking

Phase 3: Evaluate

Phase 4: Report + Persist

Deploy report to gh-pages:

Commit results to repo:

Append scores to the wiki ledger (NOT a new issue):

Post API concerns to the rolling tracker (only if any):

Phase 5: Verify

Correctness Debugging Section (Issue Template)

Failure Modes & Recovery

Known Accepted Asymmetries

State

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally