Skip to content

Agent Init Prompt Vibe Testing

Cindy Zhang edited this page Jun 23, 2026 · 1 revision

Agent Init Prompt Vibe Testing

A lightweight loop for testing and improving the Astryx agent init prompt — the block that pnpm xds init injects into CLAUDE.md, .cursorrules, and AGENTS.md via generateCompressedIndex().

Give tasks, agents build UI using only that prompt + CLI access, an adversarial evaluator scores them, tweak the prompt, repeat until all agents score ≥95.

This is separate from the full Vibe Tests harness — no report pipeline, no screenshots, no CI. Just you, Cursor, and a chat.


Quick Start

Paste this into Cursor agent mode:

Run a vibe test on the Astryx agent init prompt. Read /Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md and follow it exactly. Here are the tasks I want to test:

1. "Build a login page with email and password inputs, a remember me checkbox, and a sign in button"
2. "Build a pricing page with 3 plan cards (Free, Pro, Enterprise) and a monthly/annual toggle"

Run both tasks with 3 parallel agents each. Keep iterating until all 3 score >=95 on each task.

The agent reads the guide and handles everything: getting the current init prompt from generateCompressedIndex(), spawning builders that only see that prompt + CLI access, running the adversarial evaluator, diagnosing failures, tweaking the prompt in agent-docs.mjs, and iterating.


How It Works

for each task:
  while any agent scores < 95 AND iterations < 50:
    1. Get current prompt from generateCompressedIndex()
    2. Spawn 3 builder agents in parallel (same task, same prompt, CLI access)
    3. Spawn 1 adversarial evaluator (fresh agent, CLI access, verifies every prop)
    4. If all 3 ≥ 95: move to next task
    5. Else: diagnose failures, tweak prompt in agent-docs.mjs, repeat

Builder agents receive:

  • The exact output of generateCompressedIndex() (the init prompt being tested)
  • The task description
  • Shell access to run pnpm xds component, pnpm xds template, etc.
  • Instructions to return trajectory (what CLI commands they ran) + code

Evaluator agent receives:

  • The 3 generated .tsx files
  • Shell access to verify every component/prop against pnpm xds component <Name>
  • A deduction-based scoring rubric (start at 100, lose points per issue)

Scoring rubric (deduction-based, start at 100):

Issue Deduction
Hallucinated prop (doesn't exist on component) -5
Hallucinated prop value (not in enum) -3
Raw <div>, <span>, etc. for layout/wrapping -5
Raw <button>, <input> instead of Astryx component -5
style={{}} or inline style -5
Hardcoded color (#hex, rgb()) -3
Hardcoded pixel value -3
className= usage -3
CSS that duplicates an existing Astryx prop -8
Wrong import path -3
Non-idiomatic composition -3

The evaluator runs rg scans for automated detection, then verifies every prop via CLI. Each deduction must cite the specific line and evidence.


Failure Diagnosis

When agents fail, classify the root cause before changing the prompt:

Category What happened Fix
Didn't look up Agent never ran CLI for a component it used Strengthen "look up EVERY component" rule
Looked up but ignored Agent saw the right prop in CLI output but used CSS Add "if a prop exists, use it — never replicate with CSS"
Looked up wrong thing Agent checked wrong component or template Fix template mappings or add explicit component callouts
CLI output was unclear Agent ran the right command but output didn't help File a CLI improvement issue (not a prompt fix)
Lazy/shortcut Agent skipped the workflow entirely Make workflow steps more imperative

This prevents whack-a-mole. For example: if agents use Layout instead of AppShell, the fix isn't "add MORE rules about no divs" — it's "explicitly name AppShell in the prompt so agents find it."


Writing Good Tasks

Tasks should describe user experience, not components:

Bad Good
"Use Table with sorting" "Build a table of users with sortable columns, search, and pagination"
"Create a dialog with Dialog" "Build row actions with a '...' dropdown and delete confirmation"
"Make a layout with AppShell" "Build a settings dashboard with header, sidebar nav, and content area"

Cover diverse categories:

  • Layout: dashboard, settings page, admin panel
  • Forms: registration, support ticket, checkout
  • Data: sortable table, data grid, timeline
  • Overlays: dialog, dropdown, command palette
  • Navigation: sidebar, breadcrumbs, tabs
  • Composition: full app combining multiple patterns

Example Results

From the initial prompt improvement session (April 2026):

Task Agent A Agent B Agent C Pass?
Settings layout (v1 prompt) 90 86 63
Settings layout (v2 prompt) 97 N/A 100
Support ticket form 92 95 95
Sortable data table 100 90 98
Dialog + dropdown 95 100 93
Full TodoTracker 95 100 95

One prompt iteration fixed the core issues. The changes:

Change Why
"Full pages → dashboard (uses AppShell)" Agents were choosing Layout instead
"No <div> anywhere — not for layout, not for wrappers, not for spacing" Agents used divs for non-layout wrapping
"Full-page shells → AppShell (not Layout). Sidebar nav → SideNav (not List)" Agents picked wrong components
"If a component prop does what you need, use it — never replicate with CSS/stylex" Agents wrote CSS duplicating HStack props

The session also found two design system gaps (not prompt-fixable):

  1. Button missing xstyle prop (all agents assumed it existed)
  2. StackItem CLI shows grow but real prop is size="fill"

Full Guide

The complete step-by-step instructions (with exact prompt templates for builders and evaluators) are in the repo:

/Users/joeyfarina/code/xds/vibe-test-runs/GUIDE.md

That file is what the Cursor agent reads when you use the quick start command. It contains:

  • How to get the current prompt
  • Exact builder prompt template (copy-paste ready)
  • Exact evaluator prompt template (copy-paste ready)
  • Decision tree for pass/fail
  • Logging format

Clone this wiki locally