Skill test harness with autonomous prompt optimization.
xvivo tests Claude skill files (.md prompts + Python scripts) the same way pytest tests code — with two modes: fast pass/fail testing and an autonomous optimization loop that iteratively improves skill quality.
# Install
npm install
# Run tests
xvivo run --dir ./skills
# Optimize a skill section
xvivo optimize --skill fishbone-diagram --section cause-identification --spec fishbone-diagram.optimize.yaml
# Train ML evaluators (optional)
npm run prepare:ml
xvivo train all --skill fishbone-diagram
# Check model health
xvivo models check --skill fishbone-diagramDiscovers .test.yaml specs, runs them against skill files, reports results. Supports two execution modes per test:
- Agent-driven — sends the skill prompt + test input to the Claude Agent SDK, evaluates the output
- Direct — runs a Python script with fixture inputs, evaluates stdout and output files
Designed for CI/CD. Exits 0 on all pass, 1 on any failure.
Implements the autoresearch pattern for skill improvement:
- Score the skill section against 25 boolean criteria (baseline)
- An agent applies a mutation (add constraint, tighten language, add example, etc.)
- Git commit the change
- Re-score against the same 25 criteria
- If improved → keep. If same or worse →
git revert - Repeat
Runs indefinitely or until a stopping condition is met (target score, plateau, cost limit).
| Class | Engine | Speed | Deterministic | Cost |
|---|---|---|---|---|
| A | TypeScript (contains, regex, schema) | <1ms | Yes | Free |
| B | Claude API (LLM-as-judge) | 1-5s | No | API credits |
| C | PyTorch (embeddings, classifiers) | 10-100ms | Yes (once trained) | Free (local) |
Class C evaluators are optional — the harness degrades gracefully to Class B if PyTorch is not installed.
src/
cli/ Ink TUI components and Pastel CLI commands
runner/ Test discovery, orchestration, result collection
evaluators/ Class A, B, C evaluator implementations
optimizer/ Karpathy loop: mutation, keep/revert, logging
sandbox/ Temp directory lifecycle for script execution
parsers/ Markdown section parser, YAML spec loader
types/ Zod schemas and TypeScript type definitions
scripts/ml/ Python ML evaluator scripts
models/ Trained model checkpoints (gitignored except metadata)
tests/ Unit and integration tests
docs/ Requirements and design documents
.claude/ References for Claude Code development sessions
- Node.js 20+
- Python 3.10+ (for skill scripts and ML evaluators)
ANTHROPIC_API_KEYenvironment variable (for Class B evaluators and agent-driven tests)- PyTorch + sentence-transformers (optional, for Class C evaluators)
- ML Integration Requirements — full spec for the ML evaluation pipeline
.claude/references/— best practices per technology for development