Vibe Evaluation

Nightly and periodic evaluations that measure how well Astryx performs compared to alternatives. This is the benchmarking system — the ongoing scorecard that tells us whether the design system is getting better or worse over time.

For using vibe tests to make API decisions (choosing between API shapes, resolving naming disputes), see API Arbitration.

Why

Design systems exist to make building UIs faster and more consistent. In an AI-assisted world, that means the system needs to work well not just for humans reading docs, but for LLMs generating code from those docs.

The nightly evaluation answers: is Astryx measurably better than the alternatives? If baseline (shadcn/Tailwind) consistently outscores Astryx, something is regressing — stale docs, growing API complexity, or conventions that fight mental models.

This is the system's health check, not a design tool. For using vibe tests as a design tool, see API Arbitration.

What Gets Measured

Six Dimensions

Dimension	What it measures	Why it matters
Correctness	Valid component usage, no hallucinated props or imports	The baseline — does it work at all?
Accessibility	Labels, semantics, keyboard support, ARIA attributes	Design systems should make a11y the default path
Code Quality	Complexity, patterns, TypeScript usage, readability	Does the API lead to clean consumer code?
Efficiency	Decisions per element, DRYness, conciseness	Fewer decisions = fewer chances to get it wrong
Maintainability	Semantic tokens vs magic values, dark mode support, locality	Will this code survive a theme change?
Design	Visual fidelity vs ideal reference images (layout, hierarchy, spacing, component fidelity, color/theming)	Does the output look right? Optional — only scored when ideal images exist and a vision LLM is available.

Key Metrics

Decisions/element — How many styling decisions the LLM had to make per UI element. Astryx typically scores ~1.7 vs ~2.3 for raw Tailwind. Lower is better — it means the system is absorbing complexity.
Semantic ratio — What percentage of styling uses semantic tokens vs raw values. Higher means more maintainable, more themeable.
Escape hatches — Did the LLM drop out of the design system to raw CSS/HTML? Where and why?
Dimension scores — Each of the 6 dimensions is scored 0-100. Overall score is the average across available dimensions.

Targets

Target	What it uses	Docs provided
Astryx	`@xds/core` components + StyleX	Auto-generated from source via CLI
Baseline	shadcn/ui + Tailwind CSS	`.baseline-docs/`
HTML	Raw HTML + CSS	None (just HTML/CSS knowledge)
Astryx + Tailwind	`@xds/core` + Tailwind (no StyleX)	`.generated/xds-tailwind-skill.md`

Running the Nightly Evaluation

Quick Start

The simplest way is through the /vibe-test command in Claude Code, or by asking Navi:

/vibe-test 5              # 5 stratified sample prompts
"Run a vibe test with 5 samples"

This generates the skill doc from source, spawns parallel sub-agents, self-evaluates, and aggregates. Results land in results/<iteration>/.

Full Comparative Run

cd internal/vibe-tests

# 1. Run Astryx — note the iteration ID and prompt IDs
pnpm interactive --target xds --persona naive --sample 10

# 2. Run baseline with the SAME prompts (critical!)
pnpm interactive --target baseline --persona naive --prompts <comma-separated-ids>

# 3. (Optional) Run raw HTML with the same prompts
pnpm interactive --target html --persona naive --prompts <ids>

Always reuse the same prompt IDs across targets. Without this, you're comparing different prompts and the results are noise.

Evaluate with a Judge Agent

For comparative runs, use a judge agent instead of relying on per-target self-evaluation. The judge sees all targets' outputs for each prompt and scores them on the same scale. See #Judge Agent Evaluation.

When using the automated harness, pnpm universal:compare handles this. For manual runs or Navi-spawned tests, spawn a dedicated judge agent after all sub-agents complete:

Collect all result JSONs (one per target × prompt)
Group by prompt ID so each prompt has all targets' outputs together
Spawn a judge agent with the grouped results, the original prompts, and the scoring rubric
The judge writes per-prompt comparative scores and an overall analysis

Aggregate, Compare, Deploy

# All-in-one: score, compare, build, deploy
pnpm report:deploy --iteration <xds-id> --baseline <baseline-id>

Or step by step:

pnpm universal --iteration <xds-id>                                # Score Astryx
pnpm universal --iteration <baseline-id>                           # Score baseline
pnpm universal:compare --astryx <xds-id> --baseline <baseline-id>    # Compare (uses judge)
pnpm report:build --iteration <xds-id> --baseline <baseline-id>   # Build HTML
pnpm report:deploy --iteration <xds-id> --baseline <baseline-id>  # Deploy to gh-pages

Reports are self-contained HTML deployed to GitHub Pages at: https://facebook.github.io/astryx/reports/<iteration-id>/

Prompt Design

Good prompts describe user experience, not components:

❌ Bad prompt	✅ Good prompt
"Create an accordion with CollapsibleGroup"	"Build a FAQ page where users can expand/collapse individual questions"
"Use Table with sorting"	"Build an admin dashboard showing recent orders with sortable columns"

The prompt should describe the user experience, not the components. This tests whether the LLM can discover the right components from the docs.

Stratified Sampling

The prompt battery covers categories: layout, data display, forms, navigation, feedback, composition. Sampling ensures coverage across categories rather than over-testing one area.

Sub-Agent Isolation

This is the most common way to get invalid results. Sub-agents must simulate a naive consumer who only has the skill doc — not an insider who already knows the system.

The Problem

When Navi spawns a sub-agent, the sub-agent may inherit the parent's context: SOUL.md, MEMORY.md, USER.md, conversation history, and system prompt. If the parent knows Astryx internals, the sub-agent knows them too.

This means the sub-agent isn't testing "can an LLM discover the right pattern from the doc?" — it's testing "can an LLM that already knows Astryx confirm what it knows?" Those are very different questions.

Signs Your Test Is Contaminated

Zero hallucinations across all approaches. Real naive agents hallucinate occasionally.
All approaches score nearly identically. If the agent already knows the answer, the doc naming doesn't matter.
Uniform high confidence. Real discovery involves uncertainty.
A batch spawner processed multiple approaches. Context accumulates across sequential processing.

How to Ensure Isolation

Spawn each agent directly from the parent using spawn_agent. Don't nest.
Make the task self-contained. Include everything: skill doc path, prompt, output path, JSON schema.
Constrain explicitly. "This is your ONLY reference. Use ONLY what's documented there. Do NOT use prior knowledge of any design system."
One agent per (prompt × target). Never batch approaches in one agent.

Template

You are running a vibe test for a design system component library.

You have NO prior knowledge of any design system. Do NOT use prior knowledge
of Astryx, Tailwind, Material Design, Shadcn, Radix, or any other library.

## Skill Doc
Read the reference at: {path}
This is your ONLY reference. Use ONLY what's documented there.

## Prompt
{prompt}

## Output
Write result to: {result_path}

Case Study: Token Comparison Bias (March 2026)

A token discoverability test compared 4 systems across 10 prompts. The first run used Navi sub-agents via spawn_agent. Astryx scored highest with zero hallucinations — but the Navi instance had MEMORY.md containing detailed Astryx token knowledge. The agent wasn't discovering tokens from the doc — it was confirming what it already knew.

Indicators it was contaminated:

Zero hallucinations for the "home" system — suspiciously perfect
Agent used --color-divider-emphasized for input borders on first try (non-obvious convention from MEMORY.md)
Tailwind agent used 136 valid classes not in the provided doc — prior knowledge
All agents uniformly confident with no hedging

Fix: Re-run with blank-slate agents (direct API calls or explicit anti-contamination prompts).

Judge Agent Evaluation

Self-evaluation is unreliable for comparative tests. Each sub-agent only sees its own output — it can't compare approaches and has natural self-scoring bias.

The Pattern

Split evaluation into two roles:

Sub-agents: generate + self-report methodology (no scores)

The code solution
What approach they took and why
What documentation they relied on
What they were uncertain about
Escape hatches or workarounds used

Judge agent: comparative evaluation (scores + analysis)

Receives all targets' outputs for each prompt side-by-side
Has the original prompt text and evaluation rubric
Scores all conditions on the same scale
Writes qualitative comparison per prompt
Produces aggregate analysis

Why This Works

The judge has:

Cross-condition visibility — sees A and B together, can compare ergonomics directly
Calibrated scoring — a 4 for A and 5 for B are relative to each other, not absolute

Judge Agent Template

You are evaluating outputs from a vibe test comparing {N} targets.

For each prompt, you'll receive:
- The original user prompt
- Generated code from each target
- Self-reported methodology notes from each agent

Evaluate EACH target on:
- Correctness (1-5): Valid API usage, no hallucinations
- Accessibility (1-5): Labels, semantics, keyboard
- Code Quality (1-5): Complexity, patterns, readability
- Efficiency (1-5): Decisions/element, DRYness
- Maintainability (1-5): Tokens vs raw values, dark mode

For each prompt, write:
- Scores for each target
- Qualitative comparison: what did each do differently and why
- Which target produced better code and why

After all prompts: overall recommendation with evidence.
Output JSON to: {result_path}

When to Use a Judge vs Self-Eval

Scenario	Method
Single-target evaluation (just Astryx)	Self-eval is fine
Astryx vs baseline comparative	Judge required
Nightly trend tracking	Self-eval acceptable (same target, tracking over time)
Any cross-target comparison	Judge required

Interpreting Results

What "winning" looks like

Astryx vs baseline: Astryx should win on accessibility and maintainability (that's the value proposition). If baseline wins on correctness, the API or docs have a problem.
Escape hatches: Zero is ideal. Each one signals a gap.
Decisions/element: Lower means the system is doing more work for the consumer.

What to watch for

Token cost: Astryx docs are larger than baseline (~2.5KB vs ~600B per component). If token cost is 2x+ higher, consider conciseness.
Hallucinated props: If LLMs consistently invent props that don't exist, the API might not match expectations. Consider whether those hallucinated props should exist.
Naming confusion: If LLMs confuse Astryx components with similarly-named ones from other systems, consider renaming.

Trend Tracking

Results are posted as GitHub issues with the vibe-test label. Track trends across runs:

Date	Astryx	Baseline	HTML	Notes
Feb 23	92	88	—	Astryx wins 6/10
Feb 24	90	90	—	First tie. Astryx +7 accessibility, baseline +efficiency
Feb 26	90	91	—	Baseline edges ahead. StyleX overhead drags efficiency
Feb 28	94	92	77	Astryx bounces back. HTML added as third target

The trend matters more than any single run. If baseline is catching up, investigate — stale docs? Growing complexity? Wrong defaults?

API Arbitration — Using vibe tests to make API design decisions (ad-hoc, per-component)
Agent Init Prompt Vibe Testing — Testing the CLI init prompt specifically
Component Lifecycle — Where evaluation fits in the overall loop
Contributing with AI Assistants — How contributors encounter vibe testing

Uh oh!

Vibe Evaluation

Vibe Evaluation

Why

What Gets Measured

Six Dimensions

Key Metrics

Targets

Running the Nightly Evaluation

Quick Start

Full Comparative Run

Evaluate with a Judge Agent

Aggregate, Compare, Deploy

Prompt Design

Stratified Sampling

Sub-Agent Isolation

The Problem

Signs Your Test Is Contaminated

How to Ensure Isolation

Template

Case Study: Token Comparison Bias (March 2026)

Judge Agent Evaluation

The Pattern

Why This Works

Judge Agent Template

When to Use a Judge vs Self-Eval

Interpreting Results

What "winning" looks like

What to watch for

Trend Tracking

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!