Context
The aggregate .txt walker in src/helpers/get-page-urls.ts (walkAggregateLinks) only descends one level into nested llms.txt indexes. Sub-link .txt references at depth 2+ are explicitly filtered out (see “skip further .txt nesting” comment in the code).
This works for two-level patterns (Cloudflare per-product files, Supabase aggregate content files) but undercounts sites that use deeper progressive disclosure — which the spec encourages for large sites that would otherwise exceed llms-txt-size.
The visible symptom is llms-txt-freshness reporting very low coverage on sites whose llms.txt is actually exhaustive — the walker just isn’t reaching the leaves.
Concrete example
Site: alchemy.com/docs
Three-level structure (correctly organized per spec — sections split because a unified file would exceed llms-txt-size):
/docs/llms.txt # 6 section links, all .txt
└─ /docs/get-started/llms.txt # .md page links [walked]
└─ /docs/node/llms.txt # .md page links [walked]
└─ /docs/data/llms.txt # .md page links [walked]
└─ /docs/wallets/llms.txt # .md page links [walked]
└─ /docs/rollups/llms.txt # .md page links [walked]
└─ /docs/chains/llms.txt # ~80 chain .txt links [walked, children dropped]
└─ /docs/chains/ethereum/llms.txt # eth_call, eth_chainId, … [NOT REACHED]
└─ /docs/chains/solana/llms.txt # getBalance, getSlot, … [NOT REACHED]
└─ … 80 more chains [NOT REACHED]
Five sections have a flat layout, so their pages are counted (~311 page URLs total). The Chains section has another nesting level for per-chain files, and all ~5,100 method pages live at depth 2. Those never make it into the URL pool.
Verbose afdocs output:
✗ llms-txt-freshness: llms.txt covers 311/5452 sitemap doc pages (6%); 5141 missing
Fix: Your llms.txt covers less than 80% of your site's pages. ...
The 5,141 “missing” pages are mostly chain-specific RPC method docs:
/docs/chains/ethereum/ethereum-api-endpoints/eth-call
/docs/chains/solana/solana-api-endpoints/get-balance
- …
These are in the llms.txt tree — just one level deeper than the walker currently explores.
This also biases sampling for any check that flows through getUrlsFromCachedLlmsTxt:
markdown-content-parity
llms-txt-directive
- etc.
All sample only from the 5 flat sections, never from the ~5,100 chain method pages.
Suggested behaviors (in priority order)
-
Walk recursively, with bounded depth and file count
- Stop at e.g. depth = 5 and ≤200 total fetches
- Deduplicate visited
.txt URLs so cycles and shared sub-indexes don’t cause repeated fetches
- Current behavior is effectively “depth = 1 hardcoded”
-
Treat the aggregate walk uniformly
The seed URLs from the canonical llms.txt and discovered sub-links should go through the same classification logic (page URL vs aggregate-to-walk), instead of the current setup where the inner loop uses a different filter than the outer one.
-
Surface the walked tree in details
When verbose:
- List which aggregate files were fetched
- Include depth information
- Show whether safety caps were hit
Workarounds tried
-
None viable from the user side.
The site’s llms.txt is structured exactly as the spec recommends; the bug is on the consumer (afdocs) side.
-
Flattening llms.txt is technically possible, but defeats the purpose of the size-based recommendation to split files.
Context
The aggregate
.txtwalker insrc/helpers/get-page-urls.ts(walkAggregateLinks) only descends one level into nestedllms.txtindexes. Sub-link.txtreferences at depth 2+ are explicitly filtered out (see “skip further.txtnesting” comment in the code).This works for two-level patterns (Cloudflare per-product files, Supabase aggregate content files) but undercounts sites that use deeper progressive disclosure — which the spec encourages for large sites that would otherwise exceed
llms-txt-size.The visible symptom is
llms-txt-freshnessreporting very low coverage on sites whosellms.txtis actually exhaustive — the walker just isn’t reaching the leaves.Concrete example
Site:
alchemy.com/docsThree-level structure (correctly organized per spec — sections split because a unified file would exceed
llms-txt-size):Five sections have a flat layout, so their pages are counted (~311 page URLs total). The Chains section has another nesting level for per-chain files, and all ~5,100 method pages live at depth 2. Those never make it into the URL pool.
Verbose
afdocsoutput:The 5,141 “missing” pages are mostly chain-specific RPC method docs:
/docs/chains/ethereum/ethereum-api-endpoints/eth-call/docs/chains/solana/solana-api-endpoints/get-balanceThese are in the
llms.txttree — just one level deeper than the walker currently explores.This also biases sampling for any check that flows through
getUrlsFromCachedLlmsTxt:markdown-content-parityllms-txt-directiveAll sample only from the 5 flat sections, never from the ~5,100 chain method pages.
Suggested behaviors (in priority order)
Walk recursively, with bounded depth and file count
.txtURLs so cycles and shared sub-indexes don’t cause repeated fetchesTreat the aggregate walk uniformly
The seed URLs from the canonical
llms.txtand discovered sub-links should go through the same classification logic (page URL vs aggregate-to-walk), instead of the current setup where the inner loop uses a different filter than the outer one.Surface the walked tree in
detailsWhen verbose:
Workarounds tried
None viable from the user side.
The site’s
llms.txtis structured exactly as the spec recommends; the bug is on the consumer (afdocs) side.Flattening
llms.txtis technically possible, but defeats the purpose of the size-based recommendation to split files.