Skip to content

benchflow 0.5.0

Choose a tag to compare

@github-actions github-actions released this 05 Jun 00:12
· 17 commits to main since this release
082659d

What's Changed

  • chore: start v0.5 development by @xdotli in #298
  • fix(cli): remove tasks generate short flags by @xdotli in #299
  • feat(opaquetoolsbench): add runnable BFCL adapter by @xdotli in #301
  • feat: add OpaqueToolsBench BFCL adapter (+ fix eval() security vuln) by @devin-ai-integration[bot] in #280
  • fix(pi-acp): preserve provider prefix for set_model by @xdotli in #300
  • feat(hilbench): add runnable SWE adapter by @xdotli in #302
  • feat: add HILBench SWE baseline adapter by @devin-ai-integration[bot] in #279
  • feat(clbench): add Continual Learning Bench adapter (ENG-103) by @xdotli in #303
  • feat: add Continual Learning Bench (CLBench) adapter by @devin-ai-integration[bot] in #283
  • feat(release-gate): refresh trial-ready evidence on post-adapter main by @xdotli in #304
  • fix(continuallearningbench): rename adapter surface by @xdotli in #305
  • Fix v0.5 follow-up stress regressions by @xdotli in #306
  • ACP provider token/cost telemetry (supersedes #289) by @xdotli in #307
  • [WIP] feat/Add ACP Provider Token/Cost Telemetry by @AmyTao in #289
  • test: guard issue #229 — deploy_skills receives effective task path by @xdotli in #308
  • fix(verifier): restore [verifier] pytest_plugins task.toml field (#192 bug 2) by @xdotli in #309
  • feat(sandbox): multi-container service selection for vulhub-style tasks (#248) by @xdotli in #310
  • feat: support LLM-as-judge as a first-class task verification method by @xdotli in #311
  • fix: address PR #309/#310 verification findings by @xdotli in #312
  • fix(verifier): address PR #311 llm-judge verification follow-ups by @xdotli in #314
  • fix(tests): remove order-dependent flake in test_config_mismatch_warning by @xdotli in #318
  • fix(rewards): three rewards-code audit findings by @xdotli in #319
  • feat(verifier): target-side test.sh verification for multi-container tasks (#248) by @xdotli in #321
  • fix(agents): repair acpx agent resolution; complete register_agent fields by @xdotli in #322
  • fix: disjoint error/verifier_error result buckets + skill_eval glob collision by @xdotli in #320
  • fix: traces split handling, COPY staging, exec env secrecy (audit findings) by @xdotli in #323
  • fix(rewards): _call_google returns '' instead of None on empty text by @xdotli in #324
  • test: harden async tests against event-loop order-dependence by @xdotli in #325
  • fix(daytona): skip host telemetry proxy for unreachable remote sandboxes by @xdotli in #327
  • fix(rewards): declare judge provider SDKs; surface missing-SDK as verifier error by @xdotli in #326
  • fix(bedrock): fail fast for Bedrock models on unreachable remote sandboxes by @xdotli in #329
  • fix(daytona): raise memory clamp default 8 GB -> 16 GB by @xdotli in #328
  • Verifier follow-ups batch (PR #320-#324) by @xdotli in #330
  • fix(registry): carry api_protocol/default_model through _acpx_wrap by @xdotli in #331
  • fix(verifier): create /logs/verifier in non-main target services by @xdotli in #332
  • fix(rewards): concurrency-safe judge env + restore unknown-prefix fallback by @xdotli in #335
  • test: name guarding PRs in regression-test docstrings by @xdotli in #333
  • fix: parquet fallback, non-identifier env keys, umask scoping (PR #323 follow-ups) by @xdotli in #336
  • fix(sandbox): POSIX sh for service exec + Modal is_dir/is_file service param by @xdotli in #334
  • fix(agents): inherit BENCHFLOW_PROVIDER_BASE_URL/API_KEY from host env by @xdotli in #313
  • fix: make JS agent installs POSIX sh compatible by @bingran-you in #423
  • feat: add Azure AI Foundry provider routing by @bingran-you in #422
  • fix: add Mintlify docs config by @bingran-you in #482
  • Merge v0.5 integration into main by @xdotli in #344
  • Add Daytona usage tracking proxy support by @bingran-you in #568
  • docs: task authoring by @xdotli in #574
  • openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable by @bingran-you in #575
  • fix: auto-discover task-bundled skills for oracle mode by @ElegantLin in #562
  • Fix no-skills task skill isolation by @bingran-you in #586
  • Enable Daytona sandbox-local usage proxy by @bingran-you in #587
  • docs: add experiment guidance to AGENTS.md; ignore .claude/handoffs by @bingran-you in #591
  • Add bench --version flag by @xdotli in #595
  • fix: upload absolute str skills_dir host dirs in scene skill activation by @Yiminnn in #594
  • Add Benchflow experiment review skill by @bingran-you in #596
  • Enable openhands SkillsBench runs for Bedrock Opus 4.8 (MAX thinking) + Gemini usage tracking + apples-to-apples token accounting by @bingran-you in #598
  • ci: run test workflow on merge_group events by @xdotli in #315
  • fix: isolate mirror tests from host env to prevent API key leak (#522) by @ElegantLin in #573
  • Add openai + us-openai provider routing by @EYH0602 in #593
  • Fix daytona sandbox cleanup/list for SDK >=0.18 iterator API by @bingran-you in #605
  • Fix ruff format drift on two test files (unblock CI on all open PRs) by @bingran-you in #606
  • Add MAI-Thinking-1 verifier-hardening checklist to experiment-review skill by @bingran-you in #604
  • Add Harbor-style usage metadata by @bingran-you in #607
  • feat: stream ACP trajectory incrementally to disk by @Yiminnn in #566
  • Add --prompt to bench eval create by @bingran-you in #608
  • Remove deprecated bench run CLI command by @bingran-you in #610
  • docs: verified Opus-4.8/Gemini × skill Daytona run recipe + experiment gotchas by @bingran-you in #609
  • Fix pip-audit aiohttp advisory by @bingran-you in #612
  • dashboard: add live Daytona sandboxes panel by @bingran-you in #611
  • Fix Claude ACP model config option setup by @bingran-you in #614
  • Fix OpenHands skill invocation counting by @bingran-you in #597
  • Make ACP model selection robust across the @agentclientprotocol family by @bingran-you in #616
  • Replace provider proxies with LiteLLM runtime by @bingran-you in #613
  • Simplify canonical task skill modes by @bingran-you in #615
  • Fix CI formatting after skill mode changes by @bingran-you in #618
  • fix: extend trajectory JSONL redaction to cover Google, AWS, Daytona, and header patterns (#537) by @ElegantLin in #585
  • fix: surface test-stdout.txt in verifier exception for dep_install classification (#540) by @ElegantLin in #572
  • fix: classify provider auth failures as non-retryable by @ElegantLin in #564
  • fix: persist Daytona sandbox IDs for audit and cleanup by @ElegantLin in #563
  • Fix usage tracking policy fallbacks by @bingran-you in #620
  • fix(litellm-runtime): bootstrap sandbox LiteLLM via uv (fixes Daytona runs) by @bingran-you in #623
  • Harden usage tracking follow-up edges by @bingran-you in #624
  • Add public and internal preview release workflows by @bingran-you in #621
  • Prepare 0.5.0 public release by @bingran-you in #626

New Contributors

Full Changelog: v0.4.0...v0.5.0