What's Changed
This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.
- Fix(eval): block build-script internet for submissions by @klieret in #41
- Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
- Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
- Fix(eval): default to v6 docker images by @klieret in #46
New Contributors
- @dependabot[bot] made their first contribution in #26
- @arpitjain099 made their first contribution in #24
- @klieret made their first contribution in #28
- @yurekami made their first contribution in #29
Full Changelog: v1.0.2...v1.1.0