Conversation
VERIFIER_ENV pins PYTEST_ADDOPTS=--rootdir=/app for test-node-ID anchoring across SkillsBench / TB-style tasks. Tasks whose Dockerfile WORKDIRs elsewhere (e.g. /root) and don't otherwise create /app then trip pytest's at-startup directory check: ERROR: Directory '/app' not found. Check your '--rootdir' option. Verifier aborts before reaching test_outputs.py and the trial scores 0 even when the agent produced correct output. Surfaced today by trajectory analysis of the SkillsBench Apr 2026 patch trial: pg-essay-to-audiobook wrote /root/audiobook.mp3 (29:45 MP3, valid ffmpeg output) and the verifier's pytest still aborted on the missing /app rootdir; same shape on scheduling-email-assistant. Mirrors the skill_eval Dockerfile pattern (skill_eval.py:325) which already RUNs 'mkdir -p /logs/verifier /logs/agent /logs/artifacts /app /tests'. Add /app to the verifier dir prep step in trial.py so all benchflow eval paths (not just skill_eval) get it. Idempotent: tasks that DO populate /app are unaffected — mkdir -p is a no-op when the directory exists, and we deliberately don't chmod /app (any pre-existing task content stays root-owned).
| await self._env.exec( | ||
| "rm -rf /logs/verifier && mkdir -p /logs/verifier && chmod 777 /logs/verifier", | ||
| "rm -rf /logs/verifier && mkdir -p /logs/verifier /app && " | ||
| "chmod 777 /logs/verifier", | ||
| user="root", timeout_sec=10, |
There was a problem hiding this comment.
🟡 mkdir -p /app fix applied to soft_verify but not to the full verify path's _CLEAR_VERIFIER_DIR_CMD
The PR adds mkdir -p /app to soft_verify() to prevent pytest from aborting with "Directory '/app' not found" when VERIFIER_ENV pins PYTEST_ADDOPTS=--rootdir=/app and the task's Dockerfile WORKDIRs elsewhere (e.g. /root). However, the same issue exists in the full verify() path: sdk._verify() calls harden_before_verify() at src/benchflow/sdk.py:403, which executes _CLEAR_VERIFIER_DIR_CMD at src/benchflow/_sandbox.py:717 — and that constant at src/benchflow/_sandbox.py:593-595 does not include mkdir -p /app. Since harden_before_verify() also sets PYTEST_ADDOPTS with --rootdir=/app (via VERIFIER_ENV at src/benchflow/_sandbox.py:304-308), the same pytest abort will occur during the final verify() for any task whose Dockerfile doesn't create /app.
Prompt for agents
The mkdir -p /app fix in soft_verify() at src/benchflow/trial.py:555 addresses a real issue where pytest aborts with 'Directory /app not found' when VERIFIER_ENV sets --rootdir=/app but the task Dockerfile WORKDIRs elsewhere. However, the same fix needs to be applied to the full verify() path.
The constant _CLEAR_VERIFIER_DIR_CMD at src/benchflow/_sandbox.py:593-595 is used by harden_before_verify() at line 717, which is called from sdk._verify() at src/benchflow/sdk.py:403. This constant should also include 'mkdir -p /app' to ensure /app exists before the verifier runs pytest with --rootdir=/app.
Suggested fix: Update _CLEAR_VERIFIER_DIR_CMD in src/benchflow/_sandbox.py from:
"rm -rf /logs/verifier && mkdir -p /logs/verifier && chmod 777 /logs/verifier"
to:
"rm -rf /logs/verifier && mkdir -p /logs/verifier /app && chmod 777 /logs/verifier"
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
VERIFIER_ENVpinsPYTEST_ADDOPTS="--rootdir=/app …"(see_sandbox.py:307) for test-node-ID anchoring across SkillsBench / TB-style tasks. Tasks whose DockerfileWORKDIRs elsewhere (e.g./root) and don't otherwise create/apptrip pytest's at-startup directory check:The verifier aborts before reaching
test_outputs.pyand the trial scores0even when the agent produced correct output.Surfaced today by trajectory analysis of the SkillsBench Apr 2026 patch trial:
pg-essay-to-audiobookwrote/root/audiobook.mp3(29:45 MP3, valid ffmpeg output) and the verifier's pytest still aborted on the missing/approotdir.scheduling-email-assistantand several other Trial 1 ERROR-classified tasks.skill_eval's Dockerfile augmentation already does this (src/benchflow/skill_eval.py:325runsmkdir -p /logs/verifier /logs/agent /logs/artifacts /app /tests). This PR adds the equivalent to the regular eval path so all benchflow runs get/app, not just skill_eval ones.Idempotence
mkdir -pis a no-op when the directory already exists. We deliberately don'tchmod/appso any task content (e.g. swebench-style repo at/app/<repo>) stays root-owned and undisturbed.Test plan
pg-essay-to-audiobookandscheduling-email-assistantfrom the SkillsBench Apr 2026 patch trial — verifier should now reachtest_outputs.pyinstead of aborting at pytest's rootdir check.WORKDIR /app(e.g.azure-bgp-oscillation-route-leak,dialogue-parser) should be unaffected —mkdir -pis a no-op there.skill_evalflow (it was already doing this).