Conversation
The benchmark and harbor commands measured Claude Code performance on generic Terminal-Bench tasks unrelated to AgentReady, so they validated nothing about the tool's core claims. They also had statistical flaws (independence violations, insufficient power for plausible effect sizes). Also removes the unregistered eval-harness CLI and its services, which shared the same tbench-based approach and was already inaccessible dead code. Closes #394 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning
|
| Layer / File(s) | Summary |
|---|---|
Data Models & Configuration src/agentready/models/harbor.py, src/agentready/services/eval_harness/harbor_config.py, repos-for-benchmark.txt |
HarborConfig, model definitions, and benchmark repository list are deleted. |
Service Layer src/agentready/services/eval_harness/baseline.py, assessor_tester.py, batch_runner.py, aggregator.py, src/agentready/services/harbor/runner.py, comparer.py, agent_toggler.py, dashboard_generator.py, result_parser.py, src/agentready/services/harbor/__init__.py |
Benchmark orchestration, result parsing, statistical comparison, and dashboard generation modules are deleted. |
CLI Command & Integration src/agentready/cli/benchmark.py, src/agentready/cli/main.py |
Benchmark CLI command (including benchmark, _run_tbench, validate_assessor) is removed. Main CLI is rewired to remove harbor from lazy-loading and replace eager benchmark import with align; assess-batch, experiment, and extract-skills are added to lazy-command mapping. |
Documentation & Specifications README.md, docs/harbor-comparison-guide.md, specs/002-harbor-real-integration/* |
Harbor CLI installation guide is removed from README. Harbor integration specification directory (DOUBLEAGENT_IMPACT.md, plan.md, data-model.md, spec.md, tasks.md, research.md, checklists/requirements.md, contracts/harbor-results-schema.json) is deleted. |
Tests, Utilities & Supporting Files tests/unit/test_cli_benchmark.py, test_cli_harbor.py, test_harbor_config.py, test_harbor_models.py, test_harbor_services.py, test_eval_harness_services.py, test_eval_harness_cli.py, tests/unit/services/harbor/test_assessor_state_toggler.py, src/agentready/utils/__init__.py, src/agentready/utils/preflight.py, src/agentready/services/eval_harness/__init__.py, src/agentready/reporters/harbor_markdown.py, src/agentready/cli/eval_harness.py, src/agentready/cli/harbor.py, src/agentready/services/eval_harness/tbench_runner.py, patches/harbor-task-filtering-fix.patch |
All Harbor-related and benchmark-related test files are deleted. Preflight utility exports for Harbor CLI checking and Terminal-Bench dataset validation are removed from utils module. Harbor reporter, Harbor CLI module, eval-harness CLI module, and Harbor task-filtering patch file are removed. |
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title 'feat: remove benchmark and harbor commands' clearly and concisely describes the main change: removal of specific CLI commands, which aligns perfectly with the changeset. |
| Description check | ✅ Passed | The description is directly related to the changeset, explaining the rationale for removing benchmark/harbor/eval-harness commands and referencing the associated cleanup of models, services, tests, and docs. |
| Linked Issues check | ✅ Passed | The PR successfully implements the core requirement from #394: removes benchmark and harbor commands due to measuring unrelated Terminal-Bench tasks with statistical flaws (independence violations and insufficient power for plausible effect sizes). |
| Out of Scope Changes check | ✅ Passed | All changes are in scope: removal of benchmark/harbor/eval-harness commands, associated CLI integration changes, and cleanup of related code, services, tests, specs, and documentation as justified in #394. |
| Docstring Coverage | ✅ Passed | No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
📝 Generate docstrings
- Create stacked PR
- Commit on current branch
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
remove-benchmark-harbor
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
📈 Test Coverage Report
Coverage calculated from unit tests only |
Summary
agentready benchmark,agentready validate-assessor, andagentready harborcommands, which measured Claude Code performance on generic Terminal-Bench tasks unrelated to AgentReady and had statistical flaws (independence violations, insufficient power for plausible effect sizes)eval-harnessCLI and its services, which shared the same tbench-based approach and was already inaccessible dead codeCloses #394
Test plan
agentready --helpno longer showsbenchmarkorharborcommandsagentready assess .runs cleanly and produces a valid reportblack,isort,ruffall clean🤖 Generated with Claude Code under the supervision of Bill Murdock