From 0b91f44f350e984b260da2436de019fd1ea089f4 Mon Sep 17 00:00:00 2001 From: Nati Fridman Date: Wed, 13 May 2026 11:42:09 +0300 Subject: [PATCH 1/3] fix: correct 3 citation-claim mismatches in research report (v2.0.1) Address follow-up feedback from #374: three citations in Section 5.1 (Test Execution) didn't support the claims they were attached to. - Remove Terminal-Bench citation (agent scaffolding, not test execution) - Remove DORA 2025 marketing page (Google Cloud blog, not actual report) - Remove Salesforce Cursor article (AI writing tests, not tests helping AI) - Add Zhang et al. (arXiv:2604.11088) on positive vs negative directives - Rename title from "Comprehensive Research" to "Curated Best Practices" - Add LIMITATIONS & EVIDENCE QUALITY section - Update default-weights.yaml header attribution No tier or weight value changes. Scoring behavior unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) --- src/agentready/data/RESEARCH_REPORT.md | 65 ++++++++++++++++++------ src/agentready/data/default-weights.yaml | 2 +- 2 files changed, 50 insertions(+), 17 deletions(-) diff --git a/src/agentready/data/RESEARCH_REPORT.md b/src/agentready/data/RESEARCH_REPORT.md index e0e62068..326f19f2 100644 --- a/src/agentready/data/RESEARCH_REPORT.md +++ b/src/agentready/data/RESEARCH_REPORT.md @@ -1,19 +1,19 @@ --- -version: "2.0.0" -date: "2026-05-03" +version: "2.0.1" +date: "2026-05-13" focus: "AI coding agent optimization (Claude Code, Copilot, Cursor, Codex, and cross-tool)" attribute_count: 41 tier_count: 4 -reference_count: 55 +reference_count: 53 --- -# Agent-Ready Codebase Attributes: Comprehensive Research +# Agent-Ready Codebase Attributes: Curated Best Practices *Optimizing Codebases for AI-Assisted Development* -**Version:** 2.0.0 -**Date:** 2026-05-03 +**Version:** 2.0.1 +**Date:** 2026-05-13 **Focus:** AI coding agent optimization (cross-tool: Claude Code, GitHub Copilot, Cursor, Codex, Gemini CLI) -**Sources:** 55+ authoritative sources including Anthropic, ETH Zurich, Red Hat, GitHub, Microsoft, Google, ArXiv +**Sources:** 50+ curated sources including Anthropic, ETH Zurich, Red Hat, GitHub, Microsoft, ArXiv --- @@ -28,7 +28,7 @@ This document catalogs 27 assessed attributes (across 41 sub-topics) that make c **Key Research Findings (2025-2026):** - Auto-generated context files **reduce** agent success by ~3% and increase cost by 20-23% (ETH Zurich, Feb 2026) -- The same model scores 52.8% vs 66.5% depending on surrounding infrastructure — scaffolding matters more than the model (LangChain Terminal-Bench) +- Positive directives ("follow code style") actively hurt agent performance; only negative constraints ("do not refactor unrelated code") reliably help (Zhang et al., Apr 2026) - Anthropic identifies "giving agents a way to verify their work" as the single highest-leverage practice - Human-written context files help only marginally (+4%) — only include what agents can't discover on their own (ETH Zurich) @@ -83,8 +83,11 @@ This document catalogs 27 assessed attributes (across 41 sub-topics) that make c **Litmus test for every line:** "Would removing this cause the agent to make a mistake it wouldn't otherwise make?" If not, delete it. +**Directive Framing (Critical):** Zhang et al. (Apr 2026) found that positive directives in context files ("follow the existing code style," "use descriptive names") actively hurt agent performance, while negative constraints ("do not refactor code outside the target function," "never modify the public API") reliably improve it. Frame context file instructions as boundaries and prohibitions, not aspirational guidance. + **Citations:** - ETH Zurich: "Are Repository-Level Context Files Helpful for Coding Agents?" (Feb 2026) — arxiv.org/html/2602.11988v1 +- Zhang et al.: "What Makes Repository-Level Instructions Effective for AI Agents?" (Apr 2026) — arxiv.org/html/2604.11088 - Anthropic: Claude Code Best Practices — code.claude.com/docs/en/best-practices - GitHub Blog: "How to write a great agents.md — lessons from over 2,500 repositories" (2026) - Addy Osmani: "Stop Using /init for AGENTS.md" (Feb 2026) @@ -535,7 +538,7 @@ project/ **Definition:** Ability to run tests with a single command, with adequate code coverage. -**Why It Matters:** Anthropic identifies test execution as "the single highest-leverage thing you can do" for AI agent effectiveness. Agents succeed through tight feedback loops: change code, run tests, see failures, iterate. Without runnable tests, agents fly blind. The DORA 2025 report confirms AI amplifies existing good testing practices — TDD is more critical than ever. +**Why It Matters:** Anthropic identifies test execution as "the single highest-leverage thing you can do" for AI agent effectiveness. Agents succeed through tight feedback loops: change code, run tests, see failures, iterate. Without runnable tests, agents fly blind. Red Hat independently reached the same conclusion, placing single-command test execution as the first requirement in their repository scaffolding guide. **Impact on Agent Behavior:** - Enables self-correction through test-driven iteration @@ -558,9 +561,7 @@ project/ **Citations:** - Anthropic: Claude Code Best Practices — "giving agents a way to verify their work" — code.claude.com/docs/en/best-practices - Red Hat: Repository Scaffolding for AI Coding Agents, Section 1.1 (April 2026) -- DORA 2025 (Google Cloud): "AI amplifies existing good practices. TDD is more critical than ever." -- LangChain Terminal-Bench: same model scored 52.8% vs 66.5% depending on scaffolding (+26%) -- Salesforce Engineering: "How Cursor AI Cut Legacy Code Coverage Time by 85%" +- Cursor: "Use typed languages, configure linters, and write tests. Give the agent clear signals for whether changes are correct." — cursor.com/blog/agent-best-practices --- @@ -1954,6 +1955,29 @@ Where each category is 0.0-1.0 based on attribute completion. --- +## LIMITATIONS & EVIDENCE QUALITY + +This document synthesizes findings from sources of varying rigor and generalizability: + +**Source type distribution:** +- **Peer-reviewed / arXiv preprints** (ETH Zurich, Zhang et al., AGENTS.md study, others): Controlled experiments with reproducible methodology. Strongest evidence, but often narrow scope (specific models, specific benchmarks, specific time period). +- **Vendor best practices** (Anthropic, Red Hat, Cursor, OpenAI, GitHub): First-party guidance from tool developers. High relevance but potentially self-serving; recommendations may reflect product design choices rather than universal truths. +- **Engineering blog posts** (Dropbox, Stack Overflow, Factory.ai): Experience reports from specific teams. Valuable for real-world signal but not controlled experiments; survivorship bias likely. +- **Standards & specifications** (Conventional Commits, OpenAPI, PEP 8): Consensus documents, not empirical evidence. Included for measurability, not for causal claims. + +**Known gaps:** +- Most benchmarks test single-turn or short-session agent behavior. Long-running multi-session agent effectiveness is poorly studied. +- Evidence is heavily weighted toward Python and JavaScript ecosystems. Generalization to other languages is assumed, not demonstrated. +- The field moves fast. Findings from early 2025 may not apply to models released in 2026. Weight assignments should be revisited as new evidence emerges. +- No controlled studies directly measure the impact of individual codebase attributes on agent task success in isolation. Most evidence is correlational or qualitative. + +**What this document is not:** +- Not a systematic review or meta-analysis +- Not peer-reviewed +- Not a substitute for testing practices against your specific codebase and agent + +--- + ## REFERENCES & CITATIONS ### Key Research (2025-2026) @@ -1961,8 +1985,7 @@ Where each category is 0.0-1.0 based on attribute completion. - [Anthropic: Claude Code Best Practices](https://code.claude.com/docs/en/best-practices) - [Anthropic: 2026 Agentic Coding Trends Report](https://resources.anthropic.com/2026-agentic-coding-trends-report) - Red Hat: Repository Scaffolding for AI Coding Agents (April 2026) — internal guide, evidence-based -- LangChain: Terminal-Bench — scaffolding impact measurement (+26% accuracy from infrastructure alone) -- [DORA 2025 (Google Cloud): TDD and AI Quality](https://cloud.google.com/discover/how-test-driven-development-amplifies-ai-success) +- [Zhang et al.: "What Makes Repository-Level Instructions Effective for AI Agents?"](https://arxiv.org/html/2604.11088) (Apr 2026) — positive directives hurt, negative constraints help - [AGENTS.md Empirical Study: Agent READMEs](https://arxiv.org/html/2511.12884v1) — analysis of 2,303 context files - [Configuring Agentic AI Coding Tools](https://arxiv.org/html/2602.14690v3) — systematic study across 2,923 repos @@ -1986,7 +2009,6 @@ Where each category is 0.0-1.0 based on attribute completion. ### Engineering Blogs - [Dropbox Tech Blog: "Our journey to type checking 4 million lines of Python"](https://dropbox.tech/application/our-journey-to-type-checking-4-million-lines-of-python) -- [Salesforce Engineering: "How Cursor AI Cut Legacy Code Coverage Time by 85%"](https://engineering.salesforce.com/) - [GitClear: "Coding on Copilot" whitepaper](https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality) ### Standards & Specifications @@ -2013,6 +2035,17 @@ Where each category is 0.0-1.0 based on attribute completion. ## VERSION HISTORY +- **v2.0.1 (2026-05-13):** Citation accuracy and evidence quality improvements + - Fixed 3 citation-claim mismatches in Section 5.1 (test_execution): + - Removed Terminal-Bench: finding was about agent scaffolding, not test execution + - Removed DORA 2025 marketing page: URL was Google Cloud blog, not actual DORA report + - Removed Salesforce Cursor article: about AI writing tests for humans, not tests helping agents + - Added Zhang et al. (arXiv:2604.11088) finding on positive vs negative directives + - Renamed title from "Comprehensive Research" to "Curated Best Practices" + - Added LIMITATIONS & EVIDENCE QUALITY section + - Updated source count (55 -> 53) + - No tier or weight changes + - **v2.0.0 (2026-05-03):** Major research update based on 2025-2026 evidence - Rebalanced priorities: verification/testing elevated to #1 (was #6) - Added 4 new attributes: single-file verification, CI quality gates, pattern references, design intent, progressive disclosure @@ -2036,4 +2069,4 @@ Where each category is 0.0-1.0 based on attribute completion. **Document prepared for:** agentready tool development **Primary use case:** Scanning repositories for AI agent optimization **Target agents:** Claude Code, GitHub Copilot, Cursor, Codex, Gemini CLI, and cross-tool -**Methodology:** Evidence-based, cited research from authoritative sources +**Methodology:** Evidence-based, curated from vendor best practices, arXiv preprints, and engineering experience reports diff --git a/src/agentready/data/default-weights.yaml b/src/agentready/data/default-weights.yaml index b0506293..b3f713f3 100644 --- a/src/agentready/data/default-weights.yaml +++ b/src/agentready/data/default-weights.yaml @@ -2,7 +2,7 @@ # # This file defines the default weights for all 28 attributes. # Weights are based on evidence from ETH Zurich (Feb 2026), Anthropic, -# Red Hat best practices (April 2026), and LangChain Terminal-Bench. +# Red Hat best practices (April 2026), and Cursor agent guidelines. # # Key finding: Verification (tests, lint, CI) has highest impact on # agent effectiveness. Documentation helps only marginally (+4%). From ed3d113ed1d690c50c80a0365e424d21624d7409 Mon Sep 17 00:00:00 2001 From: Nati Fridman Date: Wed, 13 May 2026 11:43:43 +0300 Subject: [PATCH 2/3] fix: update research template title to match report rename The generate_template() in research_formatter.py had the old "Comprehensive Research" title hardcoded. Updated to match the v2.0.1 rename to "Curated Best Practices". Co-Authored-By: Claude Opus 4.6 (1M context) --- src/agentready/services/research_formatter.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/agentready/services/research_formatter.py b/src/agentready/services/research_formatter.py index 5e57e8e3..640f16b5 100644 --- a/src/agentready/services/research_formatter.py +++ b/src/agentready/services/research_formatter.py @@ -21,7 +21,7 @@ def generate_template(self) -> str: date: {today} --- -# Agent-Ready Codebase Attributes: Comprehensive Research +# Agent-Ready Codebase Attributes: Curated Best Practices *Optimizing Codebases for Claude Code and AI-Assisted Development* **Version:** 1.0.0 From e67495395f0e47c0091236d484f331c0dc13db66 Mon Sep 17 00:00:00 2001 From: Nati Fridman Date: Thu, 14 May 2026 10:33:46 +0300 Subject: [PATCH 3/3] fix: correct Zhang et al. citation title and broken URL (v2.0.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses PR #429 review: arXiv:2604.11088 actual title is "Do Agent Rules Shape or Distort?" not "What Makes Repository-Level Instructions Effective?", and /html/ URL returns 404 — switched to /abs/. Co-Authored-By: Claude Opus 4.6 (1M context) --- src/agentready/data/RESEARCH_REPORT.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/agentready/data/RESEARCH_REPORT.md b/src/agentready/data/RESEARCH_REPORT.md index 326f19f2..79bed671 100644 --- a/src/agentready/data/RESEARCH_REPORT.md +++ b/src/agentready/data/RESEARCH_REPORT.md @@ -87,7 +87,7 @@ This document catalogs 27 assessed attributes (across 41 sub-topics) that make c **Citations:** - ETH Zurich: "Are Repository-Level Context Files Helpful for Coding Agents?" (Feb 2026) — arxiv.org/html/2602.11988v1 -- Zhang et al.: "What Makes Repository-Level Instructions Effective for AI Agents?" (Apr 2026) — arxiv.org/html/2604.11088 +- Zhang et al.: "Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents" (Apr 2026) — arxiv.org/abs/2604.11088 - Anthropic: Claude Code Best Practices — code.claude.com/docs/en/best-practices - GitHub Blog: "How to write a great agents.md — lessons from over 2,500 repositories" (2026) - Addy Osmani: "Stop Using /init for AGENTS.md" (Feb 2026) @@ -1985,7 +1985,7 @@ This document synthesizes findings from sources of varying rigor and generalizab - [Anthropic: Claude Code Best Practices](https://code.claude.com/docs/en/best-practices) - [Anthropic: 2026 Agentic Coding Trends Report](https://resources.anthropic.com/2026-agentic-coding-trends-report) - Red Hat: Repository Scaffolding for AI Coding Agents (April 2026) — internal guide, evidence-based -- [Zhang et al.: "What Makes Repository-Level Instructions Effective for AI Agents?"](https://arxiv.org/html/2604.11088) (Apr 2026) — positive directives hurt, negative constraints help +- [Zhang et al.: "Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents"](https://arxiv.org/abs/2604.11088) (Apr 2026) — positive directives hurt, negative constraints help - [AGENTS.md Empirical Study: Agent READMEs](https://arxiv.org/html/2511.12884v1) — analysis of 2,303 context files - [Configuring Agentic AI Coding Tools](https://arxiv.org/html/2602.14690v3) — systematic study across 2,923 repos