Walk nested aggregate llms.txt files recursively by SahilAujla · Pull Request #55 · agent-ecosystem/afdocs

SahilAujla · 2026-04-23T20:19:04Z

PR stack — depends on #54 (canonical llms.txt selection). To review just this PR’s incremental change, look at the second commit (fix: walk nested aggregate llms.txt files recursively) — it touches only src/helpers/get-page-urls.ts and the corresponding test file.
Recommend merging #54 first; this branch will then rebase to a clean single-commit diff.
Closes #57

Summary

The aggregate .txt walker that powers llms-txt-freshness (and any sampling that flows through llms.txt) only descended one level into nested indexes, with an explicit “skip further .txt nesting” filter on sub-links.

That works for two-level patterns (Cloudflare per-product files, Supabase aggregates) but undercounts sites that use deeper progressive disclosure.

The bug

alchemy.com/docs has a three-level structure:

/docs/llms.txt                           (6 section links)
  └─ /docs/chains/llms.txt               (~80 chain index links)
      └─ /docs/chains/ethereum/llms.txt  (eth_call, eth_chainId, … pages)
      └─ /docs/chains/solana/llms.txt
      └─ … 80 more chains

Our walker stopped after Level 1, treating the chain .txt files at Level 2 as terminal and dropping them.

So getUrlsFromCachedLlmsTxt returned only the 311 pages from the 5 sections that happened to have a flat layout (Get Started, Node, Data, Wallets, Rollups). All ~5,100 chain method pages were invisible to us.

The visible symptom was llms-txt-freshness reporting 6% coverage (311 / 5,452 sitemap pages) and failing the check, even though Alchemy’s llms.txt structure is actually exhaustive — we just weren’t walking it.

The fix

Replace the one-level walk with a bounded BFS:

Recurse to MAX_AGGREGATE_DEPTH = 5 (deep enough for realistic trees, shallow enough to terminate on accidental loops)
Cap total fetches at MAX_AGGREGATE_FILES = 200 to prevent unbounded HTTP traffic
Track visited aggregates so:
- the same .txt referenced from multiple parents is fetched once
- cycles terminate safely

The classification logic (page URL vs aggregate) is factored into a small classify() closure so the same rules apply to both seed URLs and discovered sub-links.

Real-world evidence: alchemy.com/docs

Run on top of #54 (so canonical selection is correct):

Metric	Before this PR	After this PR
Pages discovered from `llms.txt`	311	5,436
`llms-txt-freshness`	FAIL (6% coverage)	PASS (100% of 5,452)
`markdown-content-parity`	FAIL/WARN (biased sampling)	PASS (avg 0% missing)
Observability category	57 (F)	100 (A+)
Overall score	88 (B)	92 (A)

Combined with #54, the alchemy.com/docs scorecard moves from 68 (D) → 92 (A) without any change to their docs.

Test plan

3 new tests in test/unit/helpers/get-page-urls.test.ts:
- three-level nested walk (mirrors Alchemy’s structure)
- same aggregate referenced from multiple parents is fetched once
- cycle in the aggregate graph terminates
All 849 existing tests pass (852 total with new ones)
Lint clean
Verified live against https://www.alchemy.com/docs

Files

src/helpers/get-page-urls.ts            | walker rewrite + depth/file caps
test/unit/helpers/get-page-urls.test.ts | nested-walk, dedup, cycle tests

When a site has both an apex `llms.txt` (often a marketing index) and a docs-section `llms.txt`, downstream checks were aggregating links from both. Sampling, size, validity, and freshness all leaked apex pages into docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53. This change introduces a single canonical pick: - `llms-txt-exists` selects one file as canonical using a most-specific- -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt` when the user passed `/docs`, and the other way around when they passed the origin). Ties resolve to the candidate-discovery order. - Downstream checks (`llms-txt-size`, `llms-txt-valid`, `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`) now operate on that single file via a new `getLlmsTxtFilesForAnalysis` helper. `cache-header-hygiene` keeps probing every discovered file. - A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets users point afdocs at an explicit canonical, bypassing the heuristic. When set, only that URL is probed and the cross-host fallback is skipped. - The pass message now surfaces which file was chosen, and `details` exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for scripted consumers. Made-with: Cursor

The previous walker only descended one level into aggregate `.txt` files referenced from llms.txt, with an explicit "skip further .txt nesting" filter on sub-links. That works for two-level patterns (Cloudflare, Supabase) but undercounts sites that use deeper progressive disclosure. Concrete miss: alchemy.com/docs has a three-level structure (`/docs/llms.txt` → `/docs/chains/llms.txt` → `/docs/chains/{chain}/llms.txt`). The chain method pages — about 5,100 of them — never made it into the URL pool, so `llms-txt-freshness` reported 6% coverage instead of 100% and the freshness check failed for what was actually a well-organized docs site. Replace the one-level walk with a bounded BFS: - Recurse to `MAX_AGGREGATE_DEPTH = 5` (deep enough for realistic trees, shallow enough to terminate on accidental loops) - Cap total fetches at `MAX_AGGREGATE_FILES = 200` so a malformed index can't trigger unbounded HTTP traffic - Track visited aggregates so the same `.txt` referenced from multiple parents is fetched once and cycles terminate On alchemy.com/docs this lifts the page-URL count from 311 → 5,436 and the overall scorecard from 88 (B) → 92 (A) without any change to their docs. Made-with: Cursor

SahilAujla added 2 commits April 23, 2026 15:56

SahilAujla mentioned this pull request Apr 23, 2026

Pick a canonical llms.txt when multiple files are discovered #54

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Walk nested aggregate llms.txt files recursively#55

Walk nested aggregate llms.txt files recursively#55
SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
SahilAujla:fix/recursive-llms-txt-walker

SahilAujla commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SahilAujla commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

The fix

Real-world evidence: alchemy.com/docs

Test plan

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SahilAujla commented Apr 23, 2026 •

edited

Loading