Skip to content

refactor: kill shim layers, single Rollout execution path (ENG-46)#268

Merged
devin-ai-integration[bot] merged 3 commits into
refactor/v0.4from
devin/1778883845-eng-46-kill-shims
May 16, 2026
Merged

refactor: kill shim layers, single Rollout execution path (ENG-46)#268
devin-ai-integration[bot] merged 3 commits into
refactor/v0.4from
devin/1778883845-eng-46-kill-shims

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 15, 2026

Summary

Eliminates the SDK → Runtime → Trial triple-indirection and establishes bf.run(RolloutConfig) → RolloutResult as the single execution path (ENG-46). This is Phase 2 of the v0.4 architecture refactoring.

Depends on: #261 (ENG-48 sandbox protocol) and #262 (ENG-47 unified types) — both are merged into this branch.

What changed

Old New File
Trial class Rollout class rollout.py (new, from trial.py)
TrialConfig RolloutConfig rollout.py
RunResult RolloutResult models.py
Job class Evaluation class evaluation.py (new, from job.py)
JobConfig / JobResult EvaluationConfig / EvaluationResult evaluation.py
SDK.run() orchestration Module-level functions in rollout.py sdk.py → backward-compat shim
bf.run() from runtime.py Also available via _run.py _run.py (new)

Architecture

bf.run(RolloutConfig) → RolloutResult    # single entry point
    └── Rollout.create(config)
        └── rollout.run()                 # 5-phase lifecycle preserved
            ├── setup()
            ├── install_agent()
            ├── execute()
            ├── verify()
            └── cleanup

Backward compatibility

All old names still work via aliases:

  • Trial = Rollout, TrialConfig = RolloutConfig, RunResult = RolloutResult
  • Job = Evaluation, JobConfig = EvaluationConfig, JobResult = EvaluationResult
  • from benchflow.sdk import SDK — shim delegates to rollout.py
  • from benchflow.trial import Trial — shim re-exports from rollout.py
  • from benchflow.job import Job — shim re-exports from evaluation.py

SDK elimination

The 582-line sdk.py is replaced by a thin shim. All SDK static methods (_init_trial, _write_config, _resolve_prompts, _build_result, _start_env_and_upload, _run_oracle, _verify) are now module-level functions in rollout.py. SDK.run() delegates to bf.run().

Test updates

Test patch targets updated from benchflow.trial.*benchflow.rollout.* and benchflow.sdk.Verifierharbor.verifier.verifier.Verifier to match the new module structure. No test logic was changed.

Integration test

Verified end-to-end with bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics -a gemini -m gemini-3.1-flash-lite-preview -e daytona -c 1 -o jobs/integration-smoke/gemini — completed successfully (16 tool calls, reward=0.0).

Review & Testing Checklist for Human

  • Verify that from benchflow.sdk import SDK; SDK().run(...) still works end-to-end (backward compat for existing scripts)
  • Run bench eval create with a real agent against Daytona to verify the full Evaluation pipeline
  • Confirm that from benchflow import Rollout, RolloutConfig, RolloutResult, run resolves correctly
  • Spot-check that the 5-phase lifecycle (setup → install_agent → execute → verify → cleanup) is preserved in rollout.py

Notes

  • All 854 tests pass locally (including 13 new tests from main merge)
  • Lint (ruff check .) passes clean
  • Format (ruff format --check) passes clean
  • Typecheck (ty check src/) passes with 0 errors (no regressions from main)
  • Merged latest main to resolve conflicts (job.py had a content conflict from ProgramBench integration feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/ #237)
  • The runtime.py file is preserved with its full API (Agent, Environment, Runtime classes) — only internal imports updated to use Rollout
  • Reward/verifier code is untouched (ENG-49's scope)
  • No adapters/ directory created (ENG-51's scope)

Link to Devin session: https://app.devin.ai/sessions/6ffc9c5487484894b289d3b947c9641d
Requested by: @xdotli


Open in Devin Review

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🧪 Test Results — ENG-46 Refactoring

All 9 tests passed. Tested backward compatibility, new API surface, and full integration pipeline.

Primary: Backward compat + new API + integration (9/9 passed)
# Test Result
1 New public API names resolve (Rollout, RolloutConfig, RolloutResult, run, Evaluation, EvaluationConfig, EvaluationResult)
2 Backward-compat aliases are identity references (6 is checks: Trial is Rollout, etc.)
3 Old module import paths work (benchflow.trial, .sdk, .job, .runtime)
4 Private helpers re-exported via trial shim (4 functions, identity verified)
5 Evaluation._sdk sentinel pattern (default=_SENTINEL, mock replaces it)
6 bf.run() is async coroutine function
7 Full test suite: 854 passed, 0 failed
8 Lint + format + typecheck: all clean (0 errors)
9 Integration: bench eval create e2e against Daytona — Reward: 1.0, 40 tool calls
Integration test output
Task: jax-computing-basics
Agent: gemini (gemini-3.1-flash-lite-preview)
Reward: 1.0
Tool calls: 40

No escalations. All backward-compat shims resolve correctly, new API names are importable, and the full Evaluation→Rollout pipeline works end-to-end.

Devin session

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Closing — this is now included in the combined refactor branch: #272 (refactor/v0.4main). All changes from this PR are preserved there.

@xdotli xdotli closed this May 16, 2026
@devin-ai-integration devin-ai-integration Bot changed the base branch from main to refactor/v0.4 May 16, 2026 01:07
xdotli added 3 commits May 16, 2026 01:08
- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig
- Rename RunResult → RolloutResult in models.py
- Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult
- Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point
- Replace sdk.py with thin backward-compat shim delegating to rollout.py
- Replace trial.py with re-export shim from rollout.py
- Replace job.py with re-export shim from evaluation.py
- Preserve runtime.py API, update internal imports to use Rollout
- Update __init__.py: new public API + backward-compat aliases
- Update test patch targets from benchflow.trial → benchflow.rollout
- All 841 tests pass, lint clean, pre-existing typecheck errors unchanged
@devin-ai-integration devin-ai-integration Bot force-pushed the devin/1778883845-eng-46-kill-shims branch from 1e06c4f to 8675839 Compare May 16, 2026 01:09
@devin-ai-integration devin-ai-integration Bot merged commit 75bb8f3 into refactor/v0.4 May 16, 2026
1 of 2 checks passed
@xdotli xdotli deleted the devin/1778883845-eng-46-kill-shims branch May 17, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant