Hi there.
Really like this benchmark. I’m building a Modal-based pipeline to evaluate ProgramBench with higher parallelism, and I’d like to validate my implementation end-to-end without running a full agent each time.
For that purpose, I’m looking for a way to mock the agent generation phase with known-good submissions. The repo part is simple as the commit hash is known. As for the compile script part, it is a little bit tricky. Would it be possible to release the compile.sh scripts used to build the gold/reference executables, or any equivalent reference build scripts?
I understand if these cannot be shared due to benchmark integrity concerns. In that case, is there a recommended way to create a small set of known-good mock submissions for validating the evaluation pipeline?
Hi there.
Really like this benchmark. I’m building a Modal-based pipeline to evaluate ProgramBench with higher parallelism, and I’d like to validate my implementation end-to-end without running a full agent each time.
For that purpose, I’m looking for a way to mock the agent generation phase with known-good submissions. The repo part is simple as the commit hash is known. As for the compile script part, it is a little bit tricky. Would it be possible to release the compile.sh scripts used to build the gold/reference executables, or any equivalent reference build scripts?
I understand if these cannot be shared due to benchmark integrity concerns. In that case, is there a recommended way to create a small set of known-good mock submissions for validating the evaluation pipeline?