Autonomous experiment daemon. Point an LLM at any benchmark, it optimizes the metric in a loop.
bun run build # compiles binary to bin/xp + symlinks to ~/.bun/bin/# Start an experiment
xp start optimize-fft \
--metric latency --unit ms --direction min \
--benchmark "./bench.sh" \
--objective "reduce FFT latency" \
--provider claude
# Monitor
xp status # current state
xp logs # daemon output
xp logs -f # tail daemon output
xp results # all trial results
xp results --last 5 # last 5 trials
# Steer the agent mid-run
xp steer "try SIMD intrinsics instead of auto-vectorization"
# Stop
xp stop| Command | Description |
|---|---|
start <name> |
Initialize and start an experiment |
stop |
Stop the daemon |
status |
Show experiment state (--json) |
logs |
View daemon log (-f to follow) |
results |
Show trial results (--last N, --json) |
steer <guidance> |
Send guidance to the running experiment |
| Flag | Description | Default |
|---|---|---|
--metric |
Metric name to optimize | required |
--unit |
Metric unit | "" |
--direction |
min or max |
required |
--benchmark |
Shell command that emits METRIC name=value |
required |
--objective |
What the agent should optimize | required |
--provider |
claude or codex |
claude |
--max-iterations |
Budget cap | 50 |
--max-failures |
Max consecutive failures | 5 |
The benchmark command must print metrics to stdout in this format:
METRIC latency=42.5
METRIC throughput=1200
One METRIC name=value per line. The --metric flag selects which one to optimize.
- Baseline: runs the benchmark on the current code to establish a starting point
- Loop: invokes the LLM agent with context (objective, best score, dead ends, user guidance), agent makes changes in a git worktree, benchmark runs, result is kept or reverted
- Persistence: all events logged to append-only JSONL, crash-safe with two-phase decisions
- Worktree isolation: experiments run in
.xp/worktree/on anxp/<name>branch — your working directory stays clean
bun run dev -- --help # run from source
bun run gate # typecheck + lint + fmt + test + build
bun test # tests only