The core problem: the benchmark measures the wrong thing
The agentready harbor compare command measures Claude Code's performance on Terminal-Bench tasks with and without doubleagent.md enabled. doubleagent.md is a Claude Code subagent file containing AgentReady-specific context: how assessors work, project architecture, coding conventions for this codebase.
Terminal-Bench tasks are generic coding challenges (adaptive rejection samplers, async HTTP clients, etc.) that have no relationship to AgentReady. The agent file's content is irrelevant to those tasks. The benchmark is therefore measuring noise -- and may even show a slight negative effect, consistent with the ETH Zurich finding (Feb 2026) that repo-specific context files hurt performance on unrelated tasks (-3% success rate, +20% cost).
The benchmark does not measure what AgentReady actually claims to measure: whether following AgentReady guidance makes Claude Code more effective in a given repository.
What a real benchmark would require
A meaningful benchmark would need to:
- Assemble a set of repositories that are 100% AgentReady compliant across all scored attributes
- For each attribute (e.g. CLAUDE.md presence, type annotations, runnable tests), revert that one attribute to a non-compliant state while holding all others constant
- Run Claude Code on tasks within that repository in both states and measure the difference
This is an enormous undertaking. It also has a sustainability problem: every time an assessor's guidance changes, the benchmark dataset and task suite would need to be updated to match. Without that discipline, the benchmark drifts out of sync with the assessors it is supposed to validate.
Statistical problems with the current implementation
Even if the benchmark were measuring the right thing, the implementation has two statistical issues:
Independence violation: The code runs multiple trials per task and passes all runs as individual observations to scipy.stats.ttest_ind, which assumes independent observations. Runs on the same task are not independent -- they share task difficulty. This inflates the apparent sample size and produces p-values that are more significant than they should be. The correct approach is to aggregate per-task, then do a paired comparison across tasks.
Insufficient power for plausible effect sizes: With 89 binary-outcome tasks, the standard error of a proportion is approximately 1/sqrt(89) ~ 10.5%. The minimum detectable effect at reasonable statistical power is therefore in the 15-20% range. AgentReady's guidance items are likely to have single-digit impacts on task success -- they are incremental improvements in context and structure, not fundamental capability changes. The benchmark cannot distinguish a real 5% improvement from random variation, which means it will almost always return a non-significant result regardless of whether the guidance is actually helping.
Recommendation
Remove the benchmark and harbor commands. If empirical validation of assessors is a goal, the requirements above describe what a proper study would take -- it is a substantial research effort and belongs outside the CLI.
Opened by Claude Code under the supervision of Bill Murdock.
The core problem: the benchmark measures the wrong thing
The
agentready harbor comparecommand measures Claude Code's performance on Terminal-Bench tasks with and withoutdoubleagent.mdenabled.doubleagent.mdis a Claude Code subagent file containing AgentReady-specific context: how assessors work, project architecture, coding conventions for this codebase.Terminal-Bench tasks are generic coding challenges (adaptive rejection samplers, async HTTP clients, etc.) that have no relationship to AgentReady. The agent file's content is irrelevant to those tasks. The benchmark is therefore measuring noise -- and may even show a slight negative effect, consistent with the ETH Zurich finding (Feb 2026) that repo-specific context files hurt performance on unrelated tasks (-3% success rate, +20% cost).
The benchmark does not measure what AgentReady actually claims to measure: whether following AgentReady guidance makes Claude Code more effective in a given repository.
What a real benchmark would require
A meaningful benchmark would need to:
This is an enormous undertaking. It also has a sustainability problem: every time an assessor's guidance changes, the benchmark dataset and task suite would need to be updated to match. Without that discipline, the benchmark drifts out of sync with the assessors it is supposed to validate.
Statistical problems with the current implementation
Even if the benchmark were measuring the right thing, the implementation has two statistical issues:
Independence violation: The code runs multiple trials per task and passes all runs as individual observations to
scipy.stats.ttest_ind, which assumes independent observations. Runs on the same task are not independent -- they share task difficulty. This inflates the apparent sample size and produces p-values that are more significant than they should be. The correct approach is to aggregate per-task, then do a paired comparison across tasks.Insufficient power for plausible effect sizes: With 89 binary-outcome tasks, the standard error of a proportion is approximately
1/sqrt(89)~ 10.5%. The minimum detectable effect at reasonable statistical power is therefore in the 15-20% range. AgentReady's guidance items are likely to have single-digit impacts on task success -- they are incremental improvements in context and structure, not fundamental capability changes. The benchmark cannot distinguish a real 5% improvement from random variation, which means it will almost always return a non-significant result regardless of whether the guidance is actually helping.Recommendation
Remove the
benchmarkandharborcommands. If empirical validation of assessors is a goal, the requirements above describe what a proper study would take -- it is a substantial research effort and belongs outside the CLI.Opened by Claude Code under the supervision of Bill Murdock.