Agent Init Prompt Vibe Testing

A lightweight loop for testing and improving the Astryx agent init prompt — the block that pnpm xds init injects into CLAUDE.md, .cursorrules, and AGENTS.md via generateCompressedIndex().

Give tasks, agents build UI using only that prompt + CLI access, an adversarial evaluator scores them, tweak the prompt, repeat until all agents score ≥95.

This is separate from the full Vibe Tests harness — no report pipeline, no screenshots, no CI. Just you, Cursor, and a chat.

Quick Start

Paste this into Cursor agent mode:

Run a vibe test on the Astryx agent init prompt. Read /Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md and follow it exactly. Here are the tasks I want to test:

1. "Build a login page with email and password inputs, a remember me checkbox, and a sign in button"
2. "Build a pricing page with 3 plan cards (Free, Pro, Enterprise) and a monthly/annual toggle"

Run both tasks with 3 parallel agents each. Keep iterating until all 3 score >=95 on each task.

The agent reads the guide and handles everything: getting the current init prompt from generateCompressedIndex(), spawning builders that only see that prompt + CLI access, running the adversarial evaluator, diagnosing failures, tweaking the prompt in agent-docs.mjs, and iterating.

How It Works

for each task:
  while any agent scores < 95 AND iterations < 50:
    1. Get current prompt from generateCompressedIndex()
    2. Spawn 3 builder agents in parallel (same task, same prompt, CLI access)
    3. Spawn 1 adversarial evaluator (fresh agent, CLI access, verifies every prop)
    4. If all 3 ≥ 95: move to next task
    5. Else: diagnose failures, tweak prompt in agent-docs.mjs, repeat

Builder agents receive:

The exact output of generateCompressedIndex() (the init prompt being tested)
The task description
Shell access to run pnpm xds component, pnpm xds template, etc.
Instructions to return trajectory (what CLI commands they ran) + code

Evaluator agent receives:

The 3 generated .tsx files
Shell access to verify every component/prop against pnpm xds component <Name>
A deduction-based scoring rubric (start at 100, lose points per issue)

Scoring rubric (deduction-based, start at 100):

Issue	Deduction
Hallucinated prop (doesn't exist on component)	-5
Hallucinated prop value (not in enum)	-3
Raw `<div>`, `<span>`, etc. for layout/wrapping	-5
Raw `<button>`, `<input>` instead of Astryx component	-5
`style={{}}` or inline style	-5
Hardcoded color (`#hex`, `rgb()`)	-3
Hardcoded pixel value	-3
`className=` usage	-3
CSS that duplicates an existing Astryx prop	-8
Wrong import path	-3
Non-idiomatic composition	-3

The evaluator runs rg scans for automated detection, then verifies every prop via CLI. Each deduction must cite the specific line and evidence.

Failure Diagnosis

When agents fail, classify the root cause before changing the prompt:

Category	What happened	Fix
Didn't look up	Agent never ran CLI for a component it used	Strengthen "look up EVERY component" rule
Looked up but ignored	Agent saw the right prop in CLI output but used CSS	Add "if a prop exists, use it — never replicate with CSS"
Looked up wrong thing	Agent checked wrong component or template	Fix template mappings or add explicit component callouts
CLI output was unclear	Agent ran the right command but output didn't help	File a CLI improvement issue (not a prompt fix)
Lazy/shortcut	Agent skipped the workflow entirely	Make workflow steps more imperative

This prevents whack-a-mole. For example: if agents use Layout instead of AppShell, the fix isn't "add MORE rules about no divs" — it's "explicitly name AppShell in the prompt so agents find it."

Writing Good Tasks

Tasks should describe user experience, not components:

Bad	Good
"Use Table with sorting"	"Build a table of users with sortable columns, search, and pagination"
"Create a dialog with Dialog"	"Build row actions with a '...' dropdown and delete confirmation"
"Make a layout with AppShell"	"Build a settings dashboard with header, sidebar nav, and content area"

Cover diverse categories:

Layout: dashboard, settings page, admin panel
Forms: registration, support ticket, checkout
Data: sortable table, data grid, timeline
Overlays: dialog, dropdown, command palette
Navigation: sidebar, breadcrumbs, tabs
Composition: full app combining multiple patterns

Example Results

From the initial prompt improvement session (April 2026):

Task	Agent A	Agent B	Agent C	Pass?
Settings layout (v1 prompt)	90	86	63	❌
Settings layout (v2 prompt)	97	N/A	100	✅
Support ticket form	92	95	95	✅
Sortable data table	100	90	98	✅
Dialog + dropdown	95	100	93	✅
Full TodoTracker	95	100	95	✅

One prompt iteration fixed the core issues. The changes:

Change	Why
"Full pages → dashboard (uses AppShell)"	Agents were choosing Layout instead
"No `<div>` anywhere — not for layout, not for wrappers, not for spacing"	Agents used divs for non-layout wrapping
"Full-page shells → AppShell (not Layout). Sidebar nav → SideNav (not List)"	Agents picked wrong components
"If a component prop does what you need, use it — never replicate with CSS/stylex"	Agents wrote CSS duplicating HStack props

The session also found two design system gaps (not prompt-fixable):

Button missing xstyle prop (all agents assumed it existed)
StackItem CLI shows grow but real prop is size="fill"

Full Guide

The complete step-by-step instructions (with exact prompt templates for builders and evaluators) are in the repo:

/Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md

That file is what the Cursor agent reads when you use the quick start command. It contains:

How to get the current prompt
Exact builder prompt template (copy-paste ready)
Exact evaluator prompt template (copy-paste ready)
Decision tree for pass/fail
Logging format

Uh oh!

Agent Init Prompt Vibe Testing

Agent Init Prompt Vibe Testing

Quick Start

How It Works

Builder agents receive:

Evaluator agent receives:

Scoring rubric (deduction-based, start at 100):

Failure Diagnosis

Writing Good Tasks

Example Results

Full Guide

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally