Skip to content

Vibe Evaluation

Cindy Zhang edited this page Jun 23, 2026 · 1 revision

Vibe Evaluation

Nightly and periodic evaluations that measure how well Astryx performs compared to alternatives. This is the benchmarking system — the ongoing scorecard that tells us whether the design system is getting better or worse over time.

For using vibe tests to make API decisions (choosing between API shapes, resolving naming disputes), see API Arbitration.


Why

Design systems exist to make building UIs faster and more consistent. In an AI-assisted world, that means the system needs to work well not just for humans reading docs, but for LLMs generating code from those docs.

The nightly evaluation answers: is Astryx measurably better than the alternatives? If baseline (shadcn/Tailwind) consistently outscores Astryx, something is regressing — stale docs, growing API complexity, or conventions that fight mental models.

This is the system's health check, not a design tool. For using vibe tests as a design tool, see API Arbitration.


What Gets Measured

Six Dimensions

Dimension What it measures Why it matters
Correctness Valid component usage, no hallucinated props or imports The baseline — does it work at all?
Accessibility Labels, semantics, keyboard support, ARIA attributes Design systems should make a11y the default path
Code Quality Complexity, patterns, TypeScript usage, readability Does the API lead to clean consumer code?
Efficiency Decisions per element, DRYness, conciseness Fewer decisions = fewer chances to get it wrong
Maintainability Semantic tokens vs magic values, dark mode support, locality Will this code survive a theme change?
Design Visual fidelity vs ideal reference images (layout, hierarchy, spacing, component fidelity, color/theming) Does the output look right? Optional — only scored when ideal images exist and a vision LLM is available.

Key Metrics

  • Decisions/element — How many styling decisions the LLM had to make per UI element. Astryx typically scores ~1.7 vs ~2.3 for raw Tailwind. Lower is better — it means the system is absorbing complexity.
  • Semantic ratio — What percentage of styling uses semantic tokens vs raw values. Higher means more maintainable, more themeable.
  • Escape hatches — Did the LLM drop out of the design system to raw CSS/HTML? Where and why?
  • Dimension scores — Each of the 6 dimensions is scored 0-100. Overall score is the average across available dimensions.

Targets

Target What it uses Docs provided
Astryx @xds/core components + StyleX Auto-generated from source via CLI
Baseline shadcn/ui + Tailwind CSS .baseline-docs/
HTML Raw HTML + CSS None (just HTML/CSS knowledge)
Astryx + Tailwind @xds/core + Tailwind (no StyleX) .generated/xds-tailwind-skill.md

Running the Nightly Evaluation

Quick Start

The simplest way is through the /vibe-test command in Claude Code, or by asking Navi:

/vibe-test 5              # 5 stratified sample prompts
"Run a vibe test with 5 samples"

This generates the skill doc from source, spawns parallel sub-agents, self-evaluates, and aggregates. Results land in results/<iteration>/.

Full Comparative Run

cd internal/vibe-tests

# 1. Run Astryx — note the iteration ID and prompt IDs
pnpm interactive --target xds --persona naive --sample 10

# 2. Run baseline with the SAME prompts (critical!)
pnpm interactive --target baseline --persona naive --prompts <comma-separated-ids>

# 3. (Optional) Run raw HTML with the same prompts
pnpm interactive --target html --persona naive --prompts <ids>

Always reuse the same prompt IDs across targets. Without this, you're comparing different prompts and the results are noise.

Evaluate with a Judge Agent

For comparative runs, use a judge agent instead of relying on per-target self-evaluation. The judge sees all targets' outputs for each prompt and scores them on the same scale. See #Judge Agent Evaluation.

When using the automated harness, pnpm universal:compare handles this. For manual runs or Navi-spawned tests, spawn a dedicated judge agent after all sub-agents complete:

  1. Collect all result JSONs (one per target × prompt)
  2. Group by prompt ID so each prompt has all targets' outputs together
  3. Spawn a judge agent with the grouped results, the original prompts, and the scoring rubric
  4. The judge writes per-prompt comparative scores and an overall analysis

Aggregate, Compare, Deploy

# All-in-one: score, compare, build, deploy
pnpm report:deploy --iteration <xds-id> --baseline <baseline-id>

Or step by step:

pnpm universal --iteration <xds-id>                                # Score Astryx
pnpm universal --iteration <baseline-id>                           # Score baseline
pnpm universal:compare --astryx <xds-id> --baseline <baseline-id>    # Compare (uses judge)
pnpm report:build --iteration <xds-id> --baseline <baseline-id>   # Build HTML
pnpm report:deploy --iteration <xds-id> --baseline <baseline-id>  # Deploy to gh-pages

Reports are self-contained HTML deployed to GitHub Pages at: https://facebook.github.io/astryx/reports/<iteration-id>/


Prompt Design

Good prompts describe user experience, not components:

❌ Bad prompt ✅ Good prompt
"Create an accordion with CollapsibleGroup" "Build a FAQ page where users can expand/collapse individual questions"
"Use Table with sorting" "Build an admin dashboard showing recent orders with sortable columns"

The prompt should describe the user experience, not the components. This tests whether the LLM can discover the right components from the docs.

Stratified Sampling

The prompt battery covers categories: layout, data display, forms, navigation, feedback, composition. Sampling ensures coverage across categories rather than over-testing one area.


Sub-Agent Isolation

This is the most common way to get invalid results. Sub-agents must simulate a naive consumer who only has the skill doc — not an insider who already knows the system.

The Problem

When Navi spawns a sub-agent, the sub-agent may inherit the parent's context: SOUL.md, MEMORY.md, USER.md, conversation history, and system prompt. If the parent knows Astryx internals, the sub-agent knows them too.

This means the sub-agent isn't testing "can an LLM discover the right pattern from the doc?" — it's testing "can an LLM that already knows Astryx confirm what it knows?" Those are very different questions.

Signs Your Test Is Contaminated

  • Zero hallucinations across all approaches. Real naive agents hallucinate occasionally.
  • All approaches score nearly identically. If the agent already knows the answer, the doc naming doesn't matter.
  • Uniform high confidence. Real discovery involves uncertainty.
  • A batch spawner processed multiple approaches. Context accumulates across sequential processing.

How to Ensure Isolation

  1. Spawn each agent directly from the parent using spawn_agent. Don't nest.
  2. Make the task self-contained. Include everything: skill doc path, prompt, output path, JSON schema.
  3. Constrain explicitly. "This is your ONLY reference. Use ONLY what's documented there. Do NOT use prior knowledge of any design system."
  4. One agent per (prompt × target). Never batch approaches in one agent.

Template

You are running a vibe test for a design system component library.

You have NO prior knowledge of any design system. Do NOT use prior knowledge
of Astryx, Tailwind, Material Design, Shadcn, Radix, or any other library.

## Skill Doc
Read the reference at: {path}
This is your ONLY reference. Use ONLY what's documented there.

## Prompt
{prompt}

## Output
Write result to: {result_path}

Case Study: Token Comparison Bias (March 2026)

A token discoverability test compared 4 systems across 10 prompts. The first run used Navi sub-agents via spawn_agent. Astryx scored highest with zero hallucinations — but the Navi instance had MEMORY.md containing detailed Astryx token knowledge. The agent wasn't discovering tokens from the doc — it was confirming what it already knew.

Indicators it was contaminated:

  • Zero hallucinations for the "home" system — suspiciously perfect
  • Agent used --color-divider-emphasized for input borders on first try (non-obvious convention from MEMORY.md)
  • Tailwind agent used 136 valid classes not in the provided doc — prior knowledge
  • All agents uniformly confident with no hedging

Fix: Re-run with blank-slate agents (direct API calls or explicit anti-contamination prompts).


Judge Agent Evaluation

Self-evaluation is unreliable for comparative tests. Each sub-agent only sees its own output — it can't compare approaches and has natural self-scoring bias.

The Pattern

Split evaluation into two roles:

Sub-agents: generate + self-report methodology (no scores)

  • The code solution
  • What approach they took and why
  • What documentation they relied on
  • What they were uncertain about
  • Escape hatches or workarounds used

Judge agent: comparative evaluation (scores + analysis)

  • Receives all targets' outputs for each prompt side-by-side
  • Has the original prompt text and evaluation rubric
  • Scores all conditions on the same scale
  • Writes qualitative comparison per prompt
  • Produces aggregate analysis

Why This Works

The judge has:

  1. Cross-condition visibility — sees A and B together, can compare ergonomics directly
  2. Calibrated scoring — a 4 for A and 5 for B are relative to each other, not absolute

Judge Agent Template

You are evaluating outputs from a vibe test comparing {N} targets.

For each prompt, you'll receive:
- The original user prompt
- Generated code from each target
- Self-reported methodology notes from each agent

Evaluate EACH target on:
- Correctness (1-5): Valid API usage, no hallucinations
- Accessibility (1-5): Labels, semantics, keyboard
- Code Quality (1-5): Complexity, patterns, readability
- Efficiency (1-5): Decisions/element, DRYness
- Maintainability (1-5): Tokens vs raw values, dark mode

For each prompt, write:
- Scores for each target
- Qualitative comparison: what did each do differently and why
- Which target produced better code and why

After all prompts: overall recommendation with evidence.
Output JSON to: {result_path}

When to Use a Judge vs Self-Eval

Scenario Method
Single-target evaluation (just Astryx) Self-eval is fine
Astryx vs baseline comparative Judge required
Nightly trend tracking Self-eval acceptable (same target, tracking over time)
Any cross-target comparison Judge required

Interpreting Results

What "winning" looks like

  • Astryx vs baseline: Astryx should win on accessibility and maintainability (that's the value proposition). If baseline wins on correctness, the API or docs have a problem.
  • Escape hatches: Zero is ideal. Each one signals a gap.
  • Decisions/element: Lower means the system is doing more work for the consumer.

What to watch for

  • Token cost: Astryx docs are larger than baseline (~2.5KB vs ~600B per component). If token cost is 2x+ higher, consider conciseness.
  • Hallucinated props: If LLMs consistently invent props that don't exist, the API might not match expectations. Consider whether those hallucinated props should exist.
  • Naming confusion: If LLMs confuse Astryx components with similarly-named ones from other systems, consider renaming.

Trend Tracking

Results are posted as GitHub issues with the vibe-test label. Track trends across runs:

Date Astryx Baseline HTML Notes
Feb 23 92 88 Astryx wins 6/10
Feb 24 90 90 First tie. Astryx +7 accessibility, baseline +efficiency
Feb 26 90 91 Baseline edges ahead. StyleX overhead drags efficiency
Feb 28 94 92 77 Astryx bounces back. HTML added as third target

The trend matters more than any single run. If baseline is catching up, investigate — stale docs? Growing complexity? Wrong defaults?


Related

Clone this wiki locally