Update CLAUDE.md Generator to reflect research findings from Gloaguen et al. (2026)

## Summary

The paper [*Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?*](https://arxiv.org/abs/2602.11988) (Gloaguen, Mündler, Müller, Raychev, Vechev — ICML 2026) is the first rigorous empirical evaluation of context files (CLAUDE.md / AGENTS.md) on real-world coding tasks. The findings have direct implications for our `claude-md-generator` workflow and should be incorporated to ensure the files we help users create are evidence-backed, not just opinion-driven.

## Key findings from the paper

| Finding | Detail |
|---------|--------|
| **LLM-generated context files hurt performance** | Across 4 agents and 2 benchmarks, auto-generated context files *reduced* task success rates by 0.5–2% on average |
| **Developer-written files help only marginally** | +4% average improvement, and only when manually authored |
| **All context files increase cost** | 20–23% higher inference cost due to more steps, more reasoning tokens, broader exploration |
| **Codebase overviews are ineffective** | Despite being the most common recommendation, directory/structure overviews did not help agents find relevant files any faster |
| **Context files are redundant with existing docs** | When existing docs (README, docs/) were removed, context files *did* help — meaning they mostly duplicate what's already discoverable |
| **Instructions are followed but make tasks harder** | Agents obey context file instructions, but the extra constraints increase reasoning tokens by 14–22% |
| **Only specific tooling info consistently helps** | e.g., "use `uv` for deps", "run `pytest`" — concrete, repository-specific tooling that the agent wouldn't guess on its own |
| **Stronger models don't generate better context files** | Using GPT-5.2 to generate context files didn't consistently outperform using the default model |

## What the current workflow already gets right

The `claude-md-generator` already reflects several of these principles:

- "Onboard, don't configure" — aligns with the "minimal requirements" finding
- "Less is more" / under 300 lines, ideally under 60 — aligns with findings that bloat hurts
- "Don't auto-generate it" / skip `/init` — directly supported by the data
- "Don't use it as a linter" — unnecessary requirements make tasks harder
- "Only universally applicable instructions" — aligns with minimal requirements
- "Prefer pointers to copies" — reduces redundancy

## Proposed changes

### 1. Add research backing to BEST_PRACTICES_CLAUDE.md

Add a "Research" or "Evidence" section citing the paper and summarizing the key numbers. This gives the advice authority beyond "best practice" — it's now empirically validated.

### 2. De-emphasize or reframe the "Structure" section in project template

The paper found that codebase overviews (listing directories and their purposes) **do not help agents find relevant files faster**. The current interview asks "Key directories and their purposes? (3-5 max)" and the project template includes a `## Structure` section.

Options:
- **a)** Remove the Structure section entirely and rely on progressive disclosure (BOOKMARKS.md)
- **b)** Keep it but reframe: make it optional, shorter (2-3 dirs max), and explicitly warn it's for human readers, not agent navigation
- **c)** Replace with a "Key entry points" section that points to 2-3 files an agent should start from (more useful than a directory tree)

### 3. Strengthen emphasis on concrete tooling commands

The paper shows tooling-specific info (e.g., "use `uv`", "use `pytest`", repo-specific CLI tools) is the most consistently useful content. The Commands section already does this, but we should:
- Make it the *primary* focus of the interview
- Add a question about repo-specific tooling (custom CLIs, Makefiles, task runners)
- Emphasize that this is the highest-signal content in the file

### 4. Add redundancy awareness

If a project already has a good README.md and docs/, the CLAUDE.md should be *even shorter* — potentially just commands and tooling. Add a question to the interview: "Does this repo already have a README/docs?" and adjust output length accordingly.

### 5. Add cost awareness messaging

Every line in CLAUDE.md has a measurable cost: ~20% increase in inference spend. The workflow should communicate this to users — "each unnecessary line costs you ~20% more per task" is a stronger motivator than "keep it short."

### 6. Update the "Don't auto-generate" advice

Currently: "Skip `/init` and follow this guide."
Improved: "Skip `/init`. Research shows auto-generated context files *reduce* task success by up to 2% while increasing cost by 20%+. Human-authored minimal files outperform LLM-generated ones."

### 7. Consider adding a "lint" or audit checklist

Post-generation, offer a quick audit:
- [ ] Is every line relevant to *every* task? (not just some tasks)
- [ ] Does this duplicate information already in README.md or docs/?
- [ ] Are commands concrete and copy-pasteable?
- [ ] Is the Structure section actually needed? (agents explore on their own)
- [ ] Is the total under 60 lines?

## Files likely affected

- `workflows/claude-md-generator/.ambient/ambient.json` (systemPrompt, description)
- `workflows/claude-md-generator/BEST_PRACTICES_CLAUDE.md`
- `workflows/claude-md-generator/.claude/templates/project-template.md`
- `workflows/claude-md-generator/README.md`

## Acceptance criteria

- [ ] BEST_PRACTICES_CLAUDE.md cites the paper and incorporates findings
- [ ] systemPrompt interview flow reflects research (tooling-first, structure-optional)
- [ ] Project template updated to de-emphasize overview, emphasize tooling
- [ ] Redundancy awareness added to interview
- [ ] Cost awareness messaging added
- [ ] README updated to reflect changes
- [ ] No regressions to the personal CLAUDE.md flow (paper focused on project/repo context files)

## References

- Paper: https://arxiv.org/abs/2602.11988
- Benchmark code: https://github.com/eth-sri/agentbench
- Related: [Chatlatanagulchai et al. (2025)](https://arxiv.org/abs/2511.12884) — descriptive study of context file content
- Related: [Nigh (2025)](https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/) — GitHub's analysis of 2,500+ repos

Finding	Detail
LLM-generated context files hurt performance	Across 4 agents and 2 benchmarks, auto-generated context files reduced task success rates by 0.5–2% on average
Developer-written files help only marginally	+4% average improvement, and only when manually authored
All context files increase cost	20–23% higher inference cost due to more steps, more reasoning tokens, broader exploration
Codebase overviews are ineffective	Despite being the most common recommendation, directory/structure overviews did not help agents find relevant files any faster
Context files are redundant with existing docs	When existing docs (README, docs/) were removed, context files did help — meaning they mostly duplicate what's already discoverable
Instructions are followed but make tasks harder	Agents obey context file instructions, but the extra constraints increase reasoning tokens by 14–22%
Only specific tooling info consistently helps	e.g., "use `uv` for deps", "run `pytest`" — concrete, repository-specific tooling that the agent wouldn't guess on its own
Stronger models don't generate better context files	Using GPT-5.2 to generate context files didn't consistently outperform using the default model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CLAUDE.md Generator to reflect research findings from Gloaguen et al. (2026) #86

Summary

Key findings from the paper

What the current workflow already gets right

Proposed changes

1. Add research backing to BEST_PRACTICES_CLAUDE.md

2. De-emphasize or reframe the "Structure" section in project template

3. Strengthen emphasis on concrete tooling commands

4. Add redundancy awareness

5. Add cost awareness messaging

6. Update the "Don't auto-generate" advice

7. Consider adding a "lint" or audit checklist

Files likely affected

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Update CLAUDE.md Generator to reflect research findings from Gloaguen et al. (2026) #86

Description

Summary

Key findings from the paper

What the current workflow already gets right

Proposed changes

1. Add research backing to BEST_PRACTICES_CLAUDE.md

2. De-emphasize or reframe the "Structure" section in project template

3. Strengthen emphasis on concrete tooling commands

4. Add redundancy awareness

5. Add cost awareness messaging

6. Update the "Don't auto-generate" advice

7. Consider adding a "lint" or audit checklist

Files likely affected

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions