Release v1.1.0 · facebookresearch/ProgramBench

What's Changed

This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.

Fix(eval): block build-script internet for submissions by @klieret in #41
Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
Fix(eval): default to v6 docker images by @klieret in #46

New Contributors

@dependabot[bot] made their first contribution in #26
@arpitjain099 made their first contribution in #24
@klieret made their first contribution in #28
@yurekami made their first contribution in #29

Full Changelog: v1.0.2...v1.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!