What's Changed
- Refresh install guidance to track latest BenchFlow by @bingran-you in #798
- fix(agents): route OpenCode-family + pi-acp agents through the LLM usage proxy by @Yiminnn in #797
- feat(eval): bind the environment (S) and config (C) axes at the CLI by @xdotli in #790
- feat(cli): move
continueunderbench eval continue(keep deprecated top-level alias) by @xdotli in #800 - fix(agents): resolve bare model ids to their provider so harnesses route correctly by @xdotli in #805
- ci(integration): Add tiered L0-L3 integration gates with scope planner and codex review by @Yiminnn in #802
- fix(integration): repair the full L0–L3 workflow + ready-to-merge codex auto-trigger by @Yiminnn in #806
- fix(integration): add uv.lock to plan-job sparse-checkout (setup-uv cache) by @Yiminnn in #807
- fix(integration): install codex CLI + isolate codex auth from the deepseek judge by @Yiminnn in #808
- fix(integration): pin codex model + demote false pinned-baseline parity blocker by @Yiminnn in #809
- fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote by @Yiminnn in #810
- feat(integration): codex reviewer on DeepSeek-v4-pro via Moon Bridge by @Yiminnn in #811
- fix(integration): pin moon-bridge + injection-safe key + clean fail-closed on bridge absence by @Yiminnn in #812
- fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness by @Yiminnn in #814
- Add MLE-bench adapter by @ZhengShenghan in #792
- Add adapter skill by @ZhengShenghan in #793
- fix(integration): clear residual greptile findings on the L3 gate by @Yiminnn in #817
- fix(eval): bind resolved S-axis env + C-axis overlay on the sharded and run-config paths by @xdotli in #804
- Make LiteLLM proxy mandatory for routable agents (never bypass; always capture usage/cost/trajectory) by @bingran-you in #820
- Strengthen experiment review trajectory gate by @bingran-you in #821
- fix(eval): correct verifier-error resume log by @bingran-you in #819
- fix(eval): expose context-root on eval run by @bingran-you in #816
- Preserve pi-acp model metadata through LiteLLM proxy by @bingran-you in #803
- fix(integration): avoid file-editor judge false positives by @bingran-you in #823
- fix(integration): audit summaryless result roots by @bingran-you in #824
- fix(eval): reject .git and file --source-path with a clean error (#548) by @bingran-you in #822
- fix(pi): avoid context-window retry storms by @bingran-you in #831
- fix(eval): surface provider failure cause and harden trajectory redaction by @Yiminnn in #834
- feat(agents): all-paths decouple core — manifest loader (8 ACP) + omnigent session-factory seam (additive, gated) by @Yiminnn in #825
- fix(eval): keep failure traceback consistent with surfaced error.message by @Yiminnn in #835
- test(loader): lock 3-path (acp/ ai-sdk/ omnigent/) manifest discovery by @Yiminnn in #836
- test(agents): lock core<->manifest byte-identical parity + CI gate by @Yiminnn in #837
- fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts by @Yiminnn in #839
- Add train convert Prime SFT export by @bingran-you in #828
- chore: release v0.6.4 by @xdotli in #801
New Contributors
- @ZhengShenghan made their first contribution in #792
Full Changelog: v0.6.3...v0.6.4