refactor: kill shim layers, single Rollout execution path (ENG-46)#268
Merged
devin-ai-integration[bot] merged 3 commits intoMay 16, 2026
Merged
Conversation
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
3 tasks
Contributor
Author
🧪 Test Results — ENG-46 RefactoringAll 9 tests passed. Tested backward compatibility, new API surface, and full integration pipeline. Primary: Backward compat + new API + integration (9/9 passed)
Integration test outputNo escalations. All backward-compat shims resolve correctly, new API names are importable, and the full Evaluation→Rollout pipeline works end-to-end. |
5 tasks
Contributor
Author
|
Closing — this is now included in the combined refactor branch: #272 ( |
- Rename trial.py → rollout.py, Trial → Rollout, TrialConfig → RolloutConfig - Rename RunResult → RolloutResult in models.py - Create evaluation.py from job.py: Job → Evaluation, JobConfig → EvaluationConfig, JobResult → EvaluationResult - Create _run.py with bf.run(RolloutConfig) → RolloutResult entry point - Replace sdk.py with thin backward-compat shim delegating to rollout.py - Replace trial.py with re-export shim from rollout.py - Replace job.py with re-export shim from evaluation.py - Preserve runtime.py API, update internal imports to use Rollout - Update __init__.py: new public API + backward-compat aliases - Update test patch targets from benchflow.trial → benchflow.rollout - All 841 tests pass, lint clean, pre-existing typecheck errors unchanged
1e06c4f to
8675839
Compare
Merged
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates the SDK → Runtime → Trial triple-indirection and establishes
bf.run(RolloutConfig) → RolloutResultas the single execution path (ENG-46). This is Phase 2 of the v0.4 architecture refactoring.Depends on: #261 (ENG-48 sandbox protocol) and #262 (ENG-47 unified types) — both are merged into this branch.
What changed
TrialclassRolloutclassrollout.py(new, fromtrial.py)TrialConfigRolloutConfigrollout.pyRunResultRolloutResultmodels.pyJobclassEvaluationclassevaluation.py(new, fromjob.py)JobConfig/JobResultEvaluationConfig/EvaluationResultevaluation.pySDK.run()orchestrationrollout.pysdk.py→ backward-compat shimbf.run()from runtime.py_run.py_run.py(new)Architecture
Backward compatibility
All old names still work via aliases:
Trial = Rollout,TrialConfig = RolloutConfig,RunResult = RolloutResultJob = Evaluation,JobConfig = EvaluationConfig,JobResult = EvaluationResultfrom benchflow.sdk import SDK— shim delegates torollout.pyfrom benchflow.trial import Trial— shim re-exports fromrollout.pyfrom benchflow.job import Job— shim re-exports fromevaluation.pySDK elimination
The 582-line
sdk.pyis replaced by a thin shim. All SDK static methods (_init_trial,_write_config,_resolve_prompts,_build_result,_start_env_and_upload,_run_oracle,_verify) are now module-level functions inrollout.py.SDK.run()delegates tobf.run().Test updates
Test patch targets updated from
benchflow.trial.*→benchflow.rollout.*andbenchflow.sdk.Verifier→harbor.verifier.verifier.Verifierto match the new module structure. No test logic was changed.Integration test
Verified end-to-end with
bench eval create --source-repo benchflow-ai/skillsbench --source-path tasks/jax-computing-basics -a gemini -m gemini-3.1-flash-lite-preview -e daytona -c 1 -o jobs/integration-smoke/gemini— completed successfully (16 tool calls, reward=0.0).Review & Testing Checklist for Human
from benchflow.sdk import SDK; SDK().run(...)still works end-to-end (backward compat for existing scripts)bench eval createwith a real agent against Daytona to verify the full Evaluation pipelinefrom benchflow import Rollout, RolloutConfig, RolloutResult, runresolves correctlyrollout.pyNotes
ruff check .) passes cleanruff format --check) passes cleanty check src/) passes with 0 errors (no regressions from main)mainto resolve conflicts (job.py had a content conflict from ProgramBench integration feat: add ProgramBench integration, remove TB2, migrate .ref/ → benchmarks/ #237)runtime.pyfile is preserved with its full API (Agent, Environment, Runtime classes) — only internal imports updated to use Rolloutadapters/directory created (ENG-51's scope)Link to Devin session: https://app.devin.ai/sessions/6ffc9c5487484894b289d3b947c9641d
Requested by: @xdotli