Fix/openhands sandbox launch#182
Merged
xdotli merged 5 commits intobenchflow-ai:dev-0.3from Apr 25, 2026
Merged
Conversation
968b126 to
1eca795
Compare
Co-authored-by: Copilot <copilot@github.com>
| inner = ( | ||
| f"export HOME=/home/{sandbox_user} && {agent_launch}" | ||
| ) | ||
| inner = f"export HOME=/home/{sandbox_user} && {agent_launch}" |
Contributor
There was a problem hiding this comment.
🟡 Removal of cd from build_priv_drop_cmd creates cwd inconsistency between setpriv and su -l paths
The cd /home/{sandbox_user} was removed from the inner command in build_priv_drop_cmd, so setpriv now inherits the working directory from ContainerTransport (typically /app). However, the su -l fallback path still simulates a login shell which changes directory to /home/{sandbox_user} before running the command. This means agents will run in different working directories depending on whether the container has setpriv (Debian/Ubuntu → /app) or falls back to su -l (Alpine/BusyBox → /home/{sandbox_user}). The old explicit cd made both paths consistent.
Prompt for agents
The build_priv_drop_cmd function in src/benchflow/_sandbox.py has two privilege-drop paths: setpriv (primary) and su -l (fallback). After removing the explicit cd /home/{sandbox_user} from the inner command, the two paths produce different working directories:
- setpriv: inherits cwd from ContainerTransport (e.g. /app)
- su -l: changes to /home/{sandbox_user} because -l simulates a login
To make both paths consistent, either:
1. Change su -l to su (without -l) so it doesn't change directory, OR
2. Add an explicit cd to agent_cwd in the inner command for the su -l path only, OR
3. Restore the cd but change it to cd to the workspace (agent_cwd) instead of home
The intent of the change was to let ContainerTransport control the working directory, but the su -l fallback defeats this by overriding the cwd.
Was this helpful? React with 👍 or 👎 to provide feedback.
4 tasks
# Conflicts: # src/benchflow/_acp_run.py # src/benchflow/_agent_env.py
4 tasks
xdotli
added a commit
that referenced
this pull request
Apr 25, 2026
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #169
.venv/bin/benchflow eval create
-t tasks/weighted-gdp-calc
-a openhands
-m gemini-3.1-flash-lite-preview
-e docker
-o jobs/skillsbench-openhands-gemini
Task: weighted-gdp-calc
Agent: openhands (gemini-3.1-flash-lite-preview)
Reward: 0.0
Tool calls: 17