Skip to content

ddunnock/xvivo

Repository files navigation

xvivo

Skill test harness with autonomous prompt optimization.

xvivo tests Claude skill files (.md prompts + Python scripts) the same way pytest tests code — with two modes: fast pass/fail testing and an autonomous optimization loop that iteratively improves skill quality.

Quick start

# Install
npm install

# Run tests
xvivo run --dir ./skills

# Optimize a skill section
xvivo optimize --skill fishbone-diagram --section cause-identification --spec fishbone-diagram.optimize.yaml

# Train ML evaluators (optional)
npm run prepare:ml
xvivo train all --skill fishbone-diagram

# Check model health
xvivo models check --skill fishbone-diagram

Two modes

xvivo run — fast pass/fail

Discovers .test.yaml specs, runs them against skill files, reports results. Supports two execution modes per test:

  • Agent-driven — sends the skill prompt + test input to the Claude Agent SDK, evaluates the output
  • Direct — runs a Python script with fixture inputs, evaluates stdout and output files

Designed for CI/CD. Exits 0 on all pass, 1 on any failure.

xvivo optimize — Karpathy loop

Implements the autoresearch pattern for skill improvement:

  1. Score the skill section against 25 boolean criteria (baseline)
  2. An agent applies a mutation (add constraint, tighten language, add example, etc.)
  3. Git commit the change
  4. Re-score against the same 25 criteria
  5. If improved → keep. If same or worse → git revert
  6. Repeat

Runs indefinitely or until a stopping condition is met (target score, plateau, cost limit).

Three evaluator classes

Class Engine Speed Deterministic Cost
A TypeScript (contains, regex, schema) <1ms Yes Free
B Claude API (LLM-as-judge) 1-5s No API credits
C PyTorch (embeddings, classifiers) 10-100ms Yes (once trained) Free (local)

Class C evaluators are optional — the harness degrades gracefully to Class B if PyTorch is not installed.

Project structure

src/
  cli/             Ink TUI components and Pastel CLI commands
  runner/          Test discovery, orchestration, result collection
  evaluators/      Class A, B, C evaluator implementations
  optimizer/       Karpathy loop: mutation, keep/revert, logging
  sandbox/         Temp directory lifecycle for script execution
  parsers/         Markdown section parser, YAML spec loader
  types/           Zod schemas and TypeScript type definitions
scripts/ml/        Python ML evaluator scripts
models/            Trained model checkpoints (gitignored except metadata)
tests/             Unit and integration tests
docs/              Requirements and design documents
.claude/           References for Claude Code development sessions

Requirements

  • Node.js 20+
  • Python 3.10+ (for skill scripts and ML evaluators)
  • ANTHROPIC_API_KEY environment variable (for Class B evaluators and agent-driven tests)
  • PyTorch + sentence-transformers (optional, for Class C evaluators)

Documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors