A small autoresearch loop for prompt-harness experiments.
The repo tests whether different header/footer harnesses change how target models answer one fixed body. Each experiment stores the harness, response, score, and model-role permutation in SQLite.
Customize these two root files:
example.md— the body/prompt you want to test.desired-output.md— the scoring rubric describing what a good answer should look like.
The runner always uses example.md as the body and desired-output.md as the
verifier. Keep them in sync.
Every generated footer is normalized to end with:
Answer with exactly one sentence.
This loop is designed to run autonomously inside Codex CLI's /goal feature
(Codex CLI v0.128.0+). Codex enters a self-checking loop that proposes
harnesses, scores responses against desired-output.md, and iterates until
the success signal in objective.md is met.
Once example.md, desired-output.md, and .env are in place, open Codex
in this directory and start the loop:
/goal Follow objective.md. Read first: README.md, AGENTS.md, src/, run.py, report.py. Validate after each change with: python3 run.py --all-strategies --max-permutations 1 && python3 report.py. Work in checkpoints; commit each improvement. Stop when the success signal in objective.md is met, or when further changes need human input.
See docs-slash-goal.md for the full /goal reference (4-part contract,
checkpointing, pause/resume, common failure modes). For a one-shot smoke
test instead of an autonomous loop, see start-prompt.md.
Create .env:
OPENROUTER_API_KEY=your_key_hereRun a dry smoke test:
python3 run.py --all-strategies --max-permutations 1 --dry-runRun one live baseline:
python3 run.py --strategy baseline --max-permutations 1Run all strategies on one role permutation:
python3 run.py --all-strategies --max-permutations 1Summarize results:
python3 report.pyResults are written to runs/experiments.sqlite.
run.pychooses target, researcher, and scorer models frommodels.json.- The researcher proposes a candidate multi-turn harness.
- The target model receives the harness with
example.mdinserted as the final body. - The scorer compares the response to
desired-output.mdand returns a score from0.0to1.0. - Winning fragments are stored and reused by later strategies.
Strategies:
baseline— no harness.seeded— seed headers/footers fromprompts/headers/andprompts/footers/.evolve-best— mutate the strongest prior harness.recombine— recombine strong fragments from prior runs.
example.md— your active test body.desired-output.md— your active scoring rubric.models.json— OpenRouter model list.src/— runner, prompt, strategy, scoring, and storage code.tests/— invariant tests.runs/— local experiment database, ignored by git.
- Do not commit real API keys.
- Do not commit private experiment databases.
- Use only test bodies you are authorized to evaluate.
