-
Notifications
You must be signed in to change notification settings - Fork 27
Agent Init Prompt Vibe Testing
A lightweight loop for testing and improving the Astryx agent init prompt — the block that pnpm xds init injects into CLAUDE.md, .cursorrules, and AGENTS.md via generateCompressedIndex().
Give tasks, agents build UI using only that prompt + CLI access, an adversarial evaluator scores them, tweak the prompt, repeat until all agents score ≥95.
This is separate from the full Vibe Tests harness — no report pipeline, no screenshots, no CI. Just you, Cursor, and a chat.
Paste this into Cursor agent mode:
Run a vibe test on the Astryx agent init prompt. Read /Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md and follow it exactly. Here are the tasks I want to test:
1. "Build a login page with email and password inputs, a remember me checkbox, and a sign in button"
2. "Build a pricing page with 3 plan cards (Free, Pro, Enterprise) and a monthly/annual toggle"
Run both tasks with 3 parallel agents each. Keep iterating until all 3 score >=95 on each task.
The agent reads the guide and handles everything: getting the current init prompt from generateCompressedIndex(), spawning builders that only see that prompt + CLI access, running the adversarial evaluator, diagnosing failures, tweaking the prompt in agent-docs.mjs, and iterating.
for each task:
while any agent scores < 95 AND iterations < 50:
1. Get current prompt from generateCompressedIndex()
2. Spawn 3 builder agents in parallel (same task, same prompt, CLI access)
3. Spawn 1 adversarial evaluator (fresh agent, CLI access, verifies every prop)
4. If all 3 ≥ 95: move to next task
5. Else: diagnose failures, tweak prompt in agent-docs.mjs, repeat
- The exact output of
generateCompressedIndex()(the init prompt being tested) - The task description
- Shell access to run
pnpm xds component,pnpm xds template, etc. - Instructions to return trajectory (what CLI commands they ran) + code
- The 3 generated .tsx files
- Shell access to verify every component/prop against
pnpm xds component <Name> - A deduction-based scoring rubric (start at 100, lose points per issue)
| Issue | Deduction |
|---|---|
| Hallucinated prop (doesn't exist on component) | -5 |
| Hallucinated prop value (not in enum) | -3 |
Raw <div>, <span>, etc. for layout/wrapping |
-5 |
Raw <button>, <input> instead of Astryx component |
-5 |
style={{}} or inline style |
-5 |
Hardcoded color (#hex, rgb()) |
-3 |
| Hardcoded pixel value | -3 |
className= usage |
-3 |
| CSS that duplicates an existing Astryx prop | -8 |
| Wrong import path | -3 |
| Non-idiomatic composition | -3 |
The evaluator runs rg scans for automated detection, then verifies every prop via CLI. Each deduction must cite the specific line and evidence.
When agents fail, classify the root cause before changing the prompt:
| Category | What happened | Fix |
|---|---|---|
| Didn't look up | Agent never ran CLI for a component it used | Strengthen "look up EVERY component" rule |
| Looked up but ignored | Agent saw the right prop in CLI output but used CSS | Add "if a prop exists, use it — never replicate with CSS" |
| Looked up wrong thing | Agent checked wrong component or template | Fix template mappings or add explicit component callouts |
| CLI output was unclear | Agent ran the right command but output didn't help | File a CLI improvement issue (not a prompt fix) |
| Lazy/shortcut | Agent skipped the workflow entirely | Make workflow steps more imperative |
This prevents whack-a-mole. For example: if agents use Layout instead of AppShell, the fix isn't "add MORE rules about no divs" — it's "explicitly name AppShell in the prompt so agents find it."
Tasks should describe user experience, not components:
| Bad | Good |
|---|---|
| "Use Table with sorting" | "Build a table of users with sortable columns, search, and pagination" |
| "Create a dialog with Dialog" | "Build row actions with a '...' dropdown and delete confirmation" |
| "Make a layout with AppShell" | "Build a settings dashboard with header, sidebar nav, and content area" |
Cover diverse categories:
- Layout: dashboard, settings page, admin panel
- Forms: registration, support ticket, checkout
- Data: sortable table, data grid, timeline
- Overlays: dialog, dropdown, command palette
- Navigation: sidebar, breadcrumbs, tabs
- Composition: full app combining multiple patterns
From the initial prompt improvement session (April 2026):
| Task | Agent A | Agent B | Agent C | Pass? |
|---|---|---|---|---|
| Settings layout (v1 prompt) | 90 | 86 | 63 | ❌ |
| Settings layout (v2 prompt) | 97 | N/A | 100 | ✅ |
| Support ticket form | 92 | 95 | 95 | ✅ |
| Sortable data table | 100 | 90 | 98 | ✅ |
| Dialog + dropdown | 95 | 100 | 93 | ✅ |
| Full TodoTracker | 95 | 100 | 95 | ✅ |
One prompt iteration fixed the core issues. The changes:
| Change | Why |
|---|---|
| "Full pages → dashboard (uses AppShell)" | Agents were choosing Layout instead |
"No <div> anywhere — not for layout, not for wrappers, not for spacing" |
Agents used divs for non-layout wrapping |
| "Full-page shells → AppShell (not Layout). Sidebar nav → SideNav (not List)" | Agents picked wrong components |
| "If a component prop does what you need, use it — never replicate with CSS/stylex" | Agents wrote CSS duplicating HStack props |
The session also found two design system gaps (not prompt-fixable):
-
Buttonmissingxstyleprop (all agents assumed it existed) -
StackItemCLI showsgrowbut real prop issize="fill"
The complete step-by-step instructions (with exact prompt templates for builders and evaluators) are in the repo:
/Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md
That file is what the Cursor agent reads when you use the quick start command. It contains:
- How to get the current prompt
- Exact builder prompt template (copy-paste ready)
- Exact evaluator prompt template (copy-paste ready)
- Decision tree for pass/fail
- Logging format