This repo includes a minimal autoresearch-style runner that can iterate on the witty emoji skill for multiple loops, benchmark each candidate, and keep or discard changes based on reasoning-gap metrics.
The implementation now follows the skill-local layout:
.agents/skills/emoji-witty-zh-tw/SKILL.md.agents/skills/emoji-witty-zh-tw/scripts/autoresearch.py.agents/skills/emoji-witty-zh-tw/references/emoji-witty-zh-tw/*.md
The runner intentionally limits edits to:
.agents/skills/emoji-witty-zh-tw/SKILL.md.agents/skills/emoji-witty-zh-tw/references/emoji-witty-zh-tw/test-cases.md.agents/skills/emoji-witty-zh-tw/references/emoji-witty-zh-tw/target-metrics.md
Each loop runs a fixed generator / solver / judge flow:
- Generate emoji designs for the selected targets
- Ask a full-tier solver and a mini-tier solver to explain them
- Judge both solver outputs against the generator rationale
- Aggregate a reasoning-gap objective
- Keep or discard the candidate
The runner now defaults to:
- generator: Codex (
reasoning_effort=high) - judge: Codex (
reasoning_effort=high) - solvers:
- Codex full (
reasoning_effort=high) - Copilot full (
gpt-5.4) - Gemini full (CLI default model unless overridden)
- Copilot mini (
gpt-5.4-mini)
- Codex full (
This keeps generation and judging anchored to Codex, while using a multi-provider solver matrix.
python3 .agents/skills/emoji-witty-zh-tw/scripts/autoresearch.py --iterations 10Useful flags:
--baseline-only— run only the benchmark baseline, without mutation loops--cases osaka,hongkong— choose the benchmark case pool--output-dir .autoresearch-runs/emoji-witty-zh-tw— change where artifacts are written--mutation-backend codex:full:mutation::high— choose the mutation backend--generator-backend codex:full:generator::high— choose the generator backend--judge-backend codex:full:judge::high— choose the judge backend--solver-backend provider:tier:label[:model[:reasoning_effort]]— add or replace solver backends
Each run writes a timestamped directory under .autoresearch-runs/emoji-witty-zh-tw/ with:
config.jsonbaseline/iteration-*/mutation.jsoniteration-*/benchmark/iteration-*/decision.jsonrun-summary.json
This is the repo's equivalent of an overnight autoresearch log: each iteration records the mutation hypothesis, benchmark result, and keep/discard decision.
More runner details live in:
.agents/skills/emoji-witty-zh-tw/references/emoji-witty-zh-tw/autoresearch.md