Context Structure Research

How File Organization Affects Claude Code Accuracy

A systematic study of Claude Code's @ reference system across 849 test configurations.

TL;DR

Put your files in one directory with clear, descriptive names. That's it.

Structure	Accuracy (622K words)
Flat	97.35%
Shallow (1 level)	94.42%
Deep (3 levels)	95.00%
Very Deep (5 levels)	96.04%

Flat structure wins. Deep nesting (3 levels) hurts most; very-deep (5 levels) partially recovers.

Key Findings

1. Flat Structure Wins

Files in a single directory achieve 100% accuracy up to 302K words, and 97.35% at 622K words.

2. One Topic Per File

Claude assesses relevance from filenames. If each file covers one topic with a descriptive name, Claude can select the right files without reading them all.

Example: employees-leadership-bios.md beats docs/org/people/leaders.md

3. Nesting Depth Is Non-Linear

Depth	Accuracy (all scales)
0 (flat)	98.4%
1 (shallow)	98.1%
3 (deep)	92.5%
5 (very-deep)	95.9%

Deep nesting (3 levels) is the worst performer. Very-deep (5 levels) partially recovers — the degradation is not linear.

4. Skip Enhancement Indexes at Scale

Enhancement	At 302K	At 622K
None (flat)	100%	97.35%
With indexes	100%	92.74%

Adding keyword indexes or summaries hurts accuracy at 622K words (-4.6%). The overhead becomes noise.

5. Scale Is Manageable

First meaningful accuracy drop appears around 600K words, and it's only 2.65%. Structure matters more than size.

Practical Recommendations

Project Size	Strategy
<100K words	Anything works
100-300K words	Flat, keyword index optional
300-600K words	Flat, skip indexes
600K+ words	Flat, consider splitting

What Works

Flat file organization
Descriptive, entity-rich filenames
Topically cohesive files (one topic = one file)
Minimal CLAUDE.md

What Doesn't Work

Deep folder nesting (3+ levels)
Long summaries (5-sentence performed worse than 2-sentence)
Combined enhancements at scale
Monolith files approaching context limits

The Study

Phase 1: File Organization (849 tests)

5 structures × 6 enhancements × 3 corpus sizes
Corpus: 120K → 302K → 622K words (Soong-Daystrom Industries, synthetic)
Model: Claude 3.5 Haiku
Ground-truth evaluation against 23 known-answer questions

Phase 2: Context Strategies (1,323 tests)

27 strategies × 2 datasets × 49 questions
Corpora: Soong-v5 (120 files, synthetic) + Obsidian (213 files, real-world)
Model: Claude Haiku 4.5
Strategies tested: Index types (keyword, semantic, heuristic template), @-ref approaches, combinations, and I4 variant generation methods (heuristic, grep, LLM-generated)

Structure Variants Tested

Structure	Nesting Depth	Description
Flat	0	All files in root directory
Shallow	1	One level of folders
Deep	3	Multiple nesting levels
Very-Deep	5+	Maximum practical nesting
Monolith	0	Single combined file

Enhancement Strategies Tested

Enhancement	Description	Recovery Rate*
Keywords only	10 keywords per file	80%
2-sentence summary	Brief summary	60%
5-sentence summary	Detailed summary	40%
Summary + keywords	Combined	80%

*Recovery Rate = % of 5 specific questions that failed without enhancements but succeeded with them. This is a targeted metric (denominator is 5 failed questions, not all questions); the overall accuracy improvement from keywords is 3.96%.

Finding: Keywords alone match any combined approach. Summaries add no value when keywords present.

Repository Structure

context-structure-research/
├── README.md                    # This file
├── LICENSE                      # MIT License
├── report/
│   ├── README.md                # Full research report
│   └── executive-summary.md     # One-page summary
├── docs/
│   ├── methodology.md           # Test methodology
│   ├── phase-2-plan.md                # Phase 2 comprehensive plan
│   └── prior-art-research.md    # Industry research
├── harness/
│   ├── questions.json           # 23 test questions
│   ├── run-test.sh              # Single test runner
│   ├── evaluator.py             # Response evaluator
│   └── *.sh                     # Various test scripts
├── results/
│   └── analysis/                # Analysis reports
└── soong-daystrom/              # Test corpus
    ├── _source-v4/, v5/, v6/    # Source content by version
    ├── flat-*, deep-*, etc.     # Structure variants
    └── *-v5.*/                  # Enhancement variants

Reproduction

Requirements

Claude Code CLI
Bash, Python 3.x, jq

Quick Start

# Clone the repository
git clone https://github.com/davidmoneil/context-structure-research
cd context-structure-research

# Run a single test
./harness/run-test.sh --structure flat-v5 --question NAV-001

# Run full matrix
./harness/run-haiku-matrix-v5-no-monolith.sh

# Analyze results
python3 harness/evaluator.py results/v5/raw/haiku --output results/v5/analysis

Test Corpora

Soong-Daystrom Industries (synthetic) — a fictional robotics company knowledge base:

Employee directories and leadership bios
Project documentation (ATLAS, ARIA, Prometheus, Hermes)
Financial reports and governance documents
Technical specifications and incident reports

Version	Words	Files	Used In
V4	120,000	80	Phase 1
V5	302,000	121	Phase 1, Phase 2
V6	622,561	277	Phase 1

Obsidian vault (real-world, Phase 2 only) — a personal knowledge base with 213 files across AI project notes, infrastructure docs, research, and blog drafts. Included in the repository for reproducibility, but note:

This is a snapshot of a real personal vault, not a synthetic benchmark
Results are specific to this vault's structure and content
The Soong-v5 dataset provides the controlled, reproducible comparison; the Obsidian dataset validates findings against messy real-world data

Why This Research Matters

Claude Code users face a practical question: How should I organize my context files?

Current guidance is limited to general best practices. No one has systematically tested what actually works. We filled that gap with controlled experiments.

The "One Topic = One File" Principle

This emerged as a key architectural insight:

Approach	Result
One file with multiple topics	Claude must read entire file to assess relevance
One topic spread across files	Claude may miss connections
One topic per file	Claude can assess relevance from filename alone

Why it works: Claude Code's discovery process lists directory contents first. If each filename clearly indicates its topic, Claude can select relevant files without reading them all.

Limitations

Single model tested — Results are for Claude 3.5 Haiku. Sonnet/Opus may differ.
Synthetic corpus — Real-world codebases may behave differently (Phase 2 planned).
Prose content — Code-heavy repos may need different strategies.
Fixed question set — 23 questions may not cover all query patterns.
Single domain — Corporate knowledge base; other domains untested.
Partial credit scoring — Non-exact matches scored by keyword coverage using an arbitrary formula (0.1 + 0.6 × coverage). Keywords are matched without context, so keywords appearing in refusal statements ("I couldn't find information about X") earn credit. This affects absolute accuracy numbers but not relative rankings, since all strategies use the same scoring.
@-ref annotation untested — R2.1–R2.4 strategies (which tested @-ref with various annotation levels) all exceeded the context window, producing 0% accuracy. This means the research question "Do @-ref annotations (descriptions, nesting) improve accuracy?" remains unanswered. A future test with fewer files or a smaller corpus would be needed to isolate the annotation variable.
Obsidian dataset specificity — The Obsidian dataset is a personal vault snapshot. While included for reproducibility, results on it reflect this specific vault's structure and content. The Soong-v5 synthetic dataset provides the controlled benchmark.

Future Work

Phase 2: Code Repository Testing

The next phase will test these findings against real codebases:

Different content type: Code vs prose
Function-level indexing: Do code indexes help?
AST-based strategies: Auto-generated function/class maps
Real-world validation: Test on open-source repos

See docs/phase-2-plan.md for the full plan.

License

MIT License. See LICENSE.

Citation

If you use this research, please cite:

Context Structure Research: How File Organization Affects Claude Code Accuracy
https://github.com/davidmoneil/context-structure-research
January 2026

Links

Full Report — Complete methodology, results, and analysis
Executive Summary — One-page overview
Methodology — Detailed test methodology

Research conducted January-February 2026. 2,172 total tests (849 Phase 1 + 1,323 Phase 2), $90.18 total API cost.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.claude		.claude
archive		archive
corpus-versions		corpus-versions
docs		docs
harness		harness
report		report
results		results
soong-daystrom		soong-daystrom
strategies		strategies
test-datasets/obsidian		test-datasets/obsidian
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT-CHARTER.md		PROJECT-CHARTER.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SESSION-STATE.md		SESSION-STATE.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Context Structure Research

TL;DR

Key Findings

1. Flat Structure Wins

2. One Topic Per File

3. Nesting Depth Is Non-Linear

4. Skip Enhancement Indexes at Scale

5. Scale Is Manageable

Practical Recommendations

What Works

What Doesn't Work

The Study

Phase 1: File Organization (849 tests)

Phase 2: Context Strategies (1,323 tests)

Structure Variants Tested

Enhancement Strategies Tested

Repository Structure

Reproduction

Requirements

Quick Start

Test Corpora

Why This Research Matters

The "One Topic = One File" Principle

Limitations

Future Work

Phase 2: Code Repository Testing

License

Citation

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages