Skip to content

Walk nested aggregate llms.txt files recursively#55

Open
SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
SahilAujla:fix/recursive-llms-txt-walker
Open

Walk nested aggregate llms.txt files recursively#55
SahilAujla wants to merge 2 commits intoagent-ecosystem:mainfrom
SahilAujla:fix/recursive-llms-txt-walker

Conversation

@SahilAujla
Copy link
Copy Markdown

@SahilAujla SahilAujla commented Apr 23, 2026

PR stack — depends on #54 (canonical llms.txt selection). To review just this PR’s incremental change, look at the second commit (fix: walk nested aggregate llms.txt files recursively) — it touches only src/helpers/get-page-urls.ts and the corresponding test file.
Recommend merging #54 first; this branch will then rebase to a clean single-commit diff.
Closes #57

Summary

The aggregate .txt walker that powers llms-txt-freshness (and any sampling that flows through llms.txt) only descended one level into nested indexes, with an explicit “skip further .txt nesting” filter on sub-links.

That works for two-level patterns (Cloudflare per-product files, Supabase aggregates) but undercounts sites that use deeper progressive disclosure.

The bug

alchemy.com/docs has a three-level structure:

/docs/llms.txt                           (6 section links)
  └─ /docs/chains/llms.txt               (~80 chain index links)
      └─ /docs/chains/ethereum/llms.txt  (eth_call, eth_chainId, … pages)
      └─ /docs/chains/solana/llms.txt
      └─ … 80 more chains

Our walker stopped after Level 1, treating the chain .txt files at Level 2 as terminal and dropping them.

So getUrlsFromCachedLlmsTxt returned only the 311 pages from the 5 sections that happened to have a flat layout (Get Started, Node, Data, Wallets, Rollups). All ~5,100 chain method pages were invisible to us.

The visible symptom was llms-txt-freshness reporting 6% coverage (311 / 5,452 sitemap pages) and failing the check, even though Alchemy’s llms.txt structure is actually exhaustive — we just weren’t walking it.

The fix

Replace the one-level walk with a bounded BFS:

  • Recurse to MAX_AGGREGATE_DEPTH = 5 (deep enough for realistic trees, shallow enough to terminate on accidental loops)

  • Cap total fetches at MAX_AGGREGATE_FILES = 200 to prevent unbounded HTTP traffic

  • Track visited aggregates so:

    • the same .txt referenced from multiple parents is fetched once
    • cycles terminate safely

The classification logic (page URL vs aggregate) is factored into a small classify() closure so the same rules apply to both seed URLs and discovered sub-links.

Real-world evidence: alchemy.com/docs

Run on top of #54 (so canonical selection is correct):

Metric Before this PR After this PR
Pages discovered from llms.txt 311 5,436
llms-txt-freshness FAIL (6% coverage) PASS (100% of 5,452)
markdown-content-parity FAIL/WARN (biased sampling) PASS (avg 0% missing)
Observability category 57 (F) 100 (A+)
Overall score 88 (B) 92 (A)

Combined with #54, the alchemy.com/docs scorecard moves from 68 (D) → 92 (A) without any change to their docs.

Test plan

  • 3 new tests in test/unit/helpers/get-page-urls.test.ts:

    • three-level nested walk (mirrors Alchemy’s structure)
    • same aggregate referenced from multiple parents is fetched once
    • cycle in the aggregate graph terminates
  • All 849 existing tests pass (852 total with new ones)

  • Lint clean

  • Verified live against https://www.alchemy.com/docs

Files

src/helpers/get-page-urls.ts            | walker rewrite + depth/file caps
test/unit/helpers/get-page-urls.test.ts | nested-walk, dedup, cycle tests

When a site has both an apex `llms.txt` (often a marketing index) and a
docs-section `llms.txt`, downstream checks were aggregating links from
both. Sampling, size, validity, and freshness all leaked apex pages into
docs runs (and vice versa), masking real signal — see issue agent-ecosystem#53.

This change introduces a single canonical pick:

- `llms-txt-exists` selects one file as canonical using a most-specific-
  -prefix-of-the-baseUrl heuristic (`/docs/llms.txt` wins over `/llms.txt`
  when the user passed `/docs`, and the other way around when they passed
  the origin). Ties resolve to the candidate-discovery order.
- Downstream checks (`llms-txt-size`, `llms-txt-valid`,
  `llms-txt-links-resolve`, `llms-txt-links-markdown`, sampling via
  `getUrlsFromCachedLlmsTxt` / `fetchLlmsTxtUrls`, and `llms-txt-freshness`)
  now operate on that single file via a new `getLlmsTxtFilesForAnalysis`
  helper. `cache-header-hygiene` keeps probing every discovered file.
- A new `--llms-txt-url` CLI flag (and `llmsTxtUrl` config option) lets
  users point afdocs at an explicit canonical, bypassing the heuristic.
  When set, only that URL is probed and the cross-host fallback is
  skipped.
- The pass message now surfaces which file was chosen, and `details`
  exposes `canonicalUrl`, `canonicalLlmsTxt`, and `canonicalSource` for
  scripted consumers.

Made-with: Cursor
The previous walker only descended one level into aggregate `.txt` files
referenced from llms.txt, with an explicit "skip further .txt nesting"
filter on sub-links. That works for two-level patterns (Cloudflare,
Supabase) but undercounts sites that use deeper progressive disclosure.

Concrete miss: alchemy.com/docs has a three-level structure
(`/docs/llms.txt` → `/docs/chains/llms.txt` → `/docs/chains/{chain}/llms.txt`).
The chain method pages — about 5,100 of them — never made it into the
URL pool, so `llms-txt-freshness` reported 6% coverage instead of 100%
and the freshness check failed for what was actually a well-organized
docs site.

Replace the one-level walk with a bounded BFS:

- Recurse to `MAX_AGGREGATE_DEPTH = 5` (deep enough for realistic trees,
  shallow enough to terminate on accidental loops)
- Cap total fetches at `MAX_AGGREGATE_FILES = 200` so a malformed index
  can't trigger unbounded HTTP traffic
- Track visited aggregates so the same `.txt` referenced from multiple
  parents is fetched once and cycles terminate

On alchemy.com/docs this lifts the page-URL count from 311 → 5,436 and
the overall scorecard from 88 (B) → 92 (A) without any change to their
docs.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

llms.txt aggregate walker only descends one level, undercounts deeply-nested indexes

1 participant