-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Summary
The paper Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen, Mündler, Müller, Raychev, Vechev — ICML 2026) is the first rigorous empirical evaluation of context files (CLAUDE.md / AGENTS.md) on real-world coding tasks. The findings have direct implications for our claude-md-generator workflow and should be incorporated to ensure the files we help users create are evidence-backed, not just opinion-driven.
Key findings from the paper
| Finding | Detail |
|---|---|
| LLM-generated context files hurt performance | Across 4 agents and 2 benchmarks, auto-generated context files reduced task success rates by 0.5–2% on average |
| Developer-written files help only marginally | +4% average improvement, and only when manually authored |
| All context files increase cost | 20–23% higher inference cost due to more steps, more reasoning tokens, broader exploration |
| Codebase overviews are ineffective | Despite being the most common recommendation, directory/structure overviews did not help agents find relevant files any faster |
| Context files are redundant with existing docs | When existing docs (README, docs/) were removed, context files did help — meaning they mostly duplicate what's already discoverable |
| Instructions are followed but make tasks harder | Agents obey context file instructions, but the extra constraints increase reasoning tokens by 14–22% |
| Only specific tooling info consistently helps | e.g., "use uv for deps", "run pytest" — concrete, repository-specific tooling that the agent wouldn't guess on its own |
| Stronger models don't generate better context files | Using GPT-5.2 to generate context files didn't consistently outperform using the default model |
What the current workflow already gets right
The claude-md-generator already reflects several of these principles:
- "Onboard, don't configure" — aligns with the "minimal requirements" finding
- "Less is more" / under 300 lines, ideally under 60 — aligns with findings that bloat hurts
- "Don't auto-generate it" / skip
/init— directly supported by the data - "Don't use it as a linter" — unnecessary requirements make tasks harder
- "Only universally applicable instructions" — aligns with minimal requirements
- "Prefer pointers to copies" — reduces redundancy
Proposed changes
1. Add research backing to BEST_PRACTICES_CLAUDE.md
Add a "Research" or "Evidence" section citing the paper and summarizing the key numbers. This gives the advice authority beyond "best practice" — it's now empirically validated.
2. De-emphasize or reframe the "Structure" section in project template
The paper found that codebase overviews (listing directories and their purposes) do not help agents find relevant files faster. The current interview asks "Key directories and their purposes? (3-5 max)" and the project template includes a ## Structure section.
Options:
- a) Remove the Structure section entirely and rely on progressive disclosure (BOOKMARKS.md)
- b) Keep it but reframe: make it optional, shorter (2-3 dirs max), and explicitly warn it's for human readers, not agent navigation
- c) Replace with a "Key entry points" section that points to 2-3 files an agent should start from (more useful than a directory tree)
3. Strengthen emphasis on concrete tooling commands
The paper shows tooling-specific info (e.g., "use uv", "use pytest", repo-specific CLI tools) is the most consistently useful content. The Commands section already does this, but we should:
- Make it the primary focus of the interview
- Add a question about repo-specific tooling (custom CLIs, Makefiles, task runners)
- Emphasize that this is the highest-signal content in the file
4. Add redundancy awareness
If a project already has a good README.md and docs/, the CLAUDE.md should be even shorter — potentially just commands and tooling. Add a question to the interview: "Does this repo already have a README/docs?" and adjust output length accordingly.
5. Add cost awareness messaging
Every line in CLAUDE.md has a measurable cost: ~20% increase in inference spend. The workflow should communicate this to users — "each unnecessary line costs you ~20% more per task" is a stronger motivator than "keep it short."
6. Update the "Don't auto-generate" advice
Currently: "Skip /init and follow this guide."
Improved: "Skip /init. Research shows auto-generated context files reduce task success by up to 2% while increasing cost by 20%+. Human-authored minimal files outperform LLM-generated ones."
7. Consider adding a "lint" or audit checklist
Post-generation, offer a quick audit:
- Is every line relevant to every task? (not just some tasks)
- Does this duplicate information already in README.md or docs/?
- Are commands concrete and copy-pasteable?
- Is the Structure section actually needed? (agents explore on their own)
- Is the total under 60 lines?
Files likely affected
workflows/claude-md-generator/.ambient/ambient.json(systemPrompt, description)workflows/claude-md-generator/BEST_PRACTICES_CLAUDE.mdworkflows/claude-md-generator/.claude/templates/project-template.mdworkflows/claude-md-generator/README.md
Acceptance criteria
- BEST_PRACTICES_CLAUDE.md cites the paper and incorporates findings
- systemPrompt interview flow reflects research (tooling-first, structure-optional)
- Project template updated to de-emphasize overview, emphasize tooling
- Redundancy awareness added to interview
- Cost awareness messaging added
- README updated to reflect changes
- No regressions to the personal CLAUDE.md flow (paper focused on project/repo context files)
References
- Paper: https://arxiv.org/abs/2602.11988
- Benchmark code: https://github.com/eth-sri/agentbench
- Related: Chatlatanagulchai et al. (2025) — descriptive study of context file content
- Related: Nigh (2025) — GitHub's analysis of 2,500+ repos