Release benchflow 0.5.0 · benchflow-ai/benchflow

What's Changed

chore: start v0.5 development by @xdotli in #298
fix(cli): remove tasks generate short flags by @xdotli in #299
feat(opaquetoolsbench): add runnable BFCL adapter by @xdotli in #301
feat: add OpaqueToolsBench BFCL adapter (+ fix eval() security vuln) by @devin-ai-integration[bot] in #280
fix(pi-acp): preserve provider prefix for set_model by @xdotli in #300
feat(hilbench): add runnable SWE adapter by @xdotli in #302
feat: add HILBench SWE baseline adapter by @devin-ai-integration[bot] in #279
feat(clbench): add Continual Learning Bench adapter (ENG-103) by @xdotli in #303
feat: add Continual Learning Bench (CLBench) adapter by @devin-ai-integration[bot] in #283
feat(release-gate): refresh trial-ready evidence on post-adapter main by @xdotli in #304
fix(continuallearningbench): rename adapter surface by @xdotli in #305
Fix v0.5 follow-up stress regressions by @xdotli in #306
ACP provider token/cost telemetry (supersedes #289) by @xdotli in #307
[WIP] feat/Add ACP Provider Token/Cost Telemetry by @AmyTao in #289
test: guard issue #229 — deploy_skills receives effective task path by @xdotli in #308
fix(verifier): restore [verifier] pytest_plugins task.toml field (#192 bug 2) by @xdotli in #309
feat(sandbox): multi-container service selection for vulhub-style tasks (#248) by @xdotli in #310
feat: support LLM-as-judge as a first-class task verification method by @xdotli in #311
fix: address PR #309/#310 verification findings by @xdotli in #312
fix(verifier): address PR #311 llm-judge verification follow-ups by @xdotli in #314
fix(tests): remove order-dependent flake in test_config_mismatch_warning by @xdotli in #318
fix(rewards): three rewards-code audit findings by @xdotli in #319
feat(verifier): target-side test.sh verification for multi-container tasks (#248) by @xdotli in #321
fix(agents): repair acpx agent resolution; complete register_agent fields by @xdotli in #322
fix: disjoint error/verifier_error result buckets + skill_eval glob collision by @xdotli in #320
fix: traces split handling, COPY staging, exec env secrecy (audit findings) by @xdotli in #323
fix(rewards): _call_google returns '' instead of None on empty text by @xdotli in #324
test: harden async tests against event-loop order-dependence by @xdotli in #325
fix(daytona): skip host telemetry proxy for unreachable remote sandboxes by @xdotli in #327
fix(rewards): declare judge provider SDKs; surface missing-SDK as verifier error by @xdotli in #326
fix(bedrock): fail fast for Bedrock models on unreachable remote sandboxes by @xdotli in #329
fix(daytona): raise memory clamp default 8 GB -> 16 GB by @xdotli in #328
Verifier follow-ups batch (PR #320-#324) by @xdotli in #330
fix(registry): carry api_protocol/default_model through _acpx_wrap by @xdotli in #331
fix(verifier): create /logs/verifier in non-main target services by @xdotli in #332
fix(rewards): concurrency-safe judge env + restore unknown-prefix fallback by @xdotli in #335
test: name guarding PRs in regression-test docstrings by @xdotli in #333
fix: parquet fallback, non-identifier env keys, umask scoping (PR #323 follow-ups) by @xdotli in #336
fix(sandbox): POSIX sh for service exec + Modal is_dir/is_file service param by @xdotli in #334
fix(agents): inherit BENCHFLOW_PROVIDER_BASE_URL/API_KEY from host env by @xdotli in #313
fix: make JS agent installs POSIX sh compatible by @bingran-you in #423
feat: add Azure AI Foundry provider routing by @bingran-you in #422
fix: add Mintlify docs config by @bingran-you in #482
Merge v0.5 integration into main by @xdotli in #344
Add Daytona usage tracking proxy support by @bingran-you in #568
docs: task authoring by @xdotli in #574
openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable by @bingran-you in #575
fix: auto-discover task-bundled skills for oracle mode by @ElegantLin in #562
Fix no-skills task skill isolation by @bingran-you in #586
Enable Daytona sandbox-local usage proxy by @bingran-you in #587
docs: add experiment guidance to AGENTS.md; ignore .claude/handoffs by @bingran-you in #591
Add bench --version flag by @xdotli in #595
fix: upload absolute str skills_dir host dirs in scene skill activation by @Yiminnn in #594
Add Benchflow experiment review skill by @bingran-you in #596
Enable openhands SkillsBench runs for Bedrock Opus 4.8 (MAX thinking) + Gemini usage tracking + apples-to-apples token accounting by @bingran-you in #598
ci: run test workflow on merge_group events by @xdotli in #315
fix: isolate mirror tests from host env to prevent API key leak (#522) by @ElegantLin in #573
Add openai + us-openai provider routing by @EYH0602 in #593
Fix daytona sandbox cleanup/list for SDK >=0.18 iterator API by @bingran-you in #605
Fix ruff format drift on two test files (unblock CI on all open PRs) by @bingran-you in #606
Add MAI-Thinking-1 verifier-hardening checklist to experiment-review skill by @bingran-you in #604
Add Harbor-style usage metadata by @bingran-you in #607
feat: stream ACP trajectory incrementally to disk by @Yiminnn in #566
Add --prompt to bench eval create by @bingran-you in #608
Remove deprecated bench run CLI command by @bingran-you in #610
docs: verified Opus-4.8/Gemini × skill Daytona run recipe + experiment gotchas by @bingran-you in #609
Fix pip-audit aiohttp advisory by @bingran-you in #612
dashboard: add live Daytona sandboxes panel by @bingran-you in #611
Fix Claude ACP model config option setup by @bingran-you in #614
Fix OpenHands skill invocation counting by @bingran-you in #597
Make ACP model selection robust across the @agentclientprotocol family by @bingran-you in #616
Replace provider proxies with LiteLLM runtime by @bingran-you in #613
Simplify canonical task skill modes by @bingran-you in #615
Fix CI formatting after skill mode changes by @bingran-you in #618
fix: extend trajectory JSONL redaction to cover Google, AWS, Daytona, and header patterns (#537) by @ElegantLin in #585
fix: surface test-stdout.txt in verifier exception for dep_install classification (#540) by @ElegantLin in #572
fix: classify provider auth failures as non-retryable by @ElegantLin in #564
fix: persist Daytona sandbox IDs for audit and cleanup by @ElegantLin in #563
Fix usage tracking policy fallbacks by @bingran-you in #620
fix(litellm-runtime): bootstrap sandbox LiteLLM via uv (fixes Daytona runs) by @bingran-you in #623
Harden usage tracking follow-up edges by @bingran-you in #624
Add public and internal preview release workflows by @bingran-you in #621
Prepare 0.5.0 public release by @bingran-you in #626

New Contributors

@ElegantLin made their first contribution in #562

Full Changelog: v0.4.0...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchflow 0.5.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!