Skip to content

llms.txt aggregate walker only descends one level, undercounts deeply-nested indexes #57

@SahilAujla

Description

@SahilAujla

Context

The aggregate .txt walker in src/helpers/get-page-urls.ts (walkAggregateLinks) only descends one level into nested llms.txt indexes. Sub-link .txt references at depth 2+ are explicitly filtered out (see “skip further .txt nesting” comment in the code).

This works for two-level patterns (Cloudflare per-product files, Supabase aggregate content files) but undercounts sites that use deeper progressive disclosure — which the spec encourages for large sites that would otherwise exceed llms-txt-size.

The visible symptom is llms-txt-freshness reporting very low coverage on sites whose llms.txt is actually exhaustive — the walker just isn’t reaching the leaves.

Concrete example

Site: alchemy.com/docs

Three-level structure (correctly organized per spec — sections split because a unified file would exceed llms-txt-size):

/docs/llms.txt                           # 6 section links, all .txt
  └─ /docs/get-started/llms.txt          # .md page links     [walked]
  └─ /docs/node/llms.txt                 # .md page links     [walked]
  └─ /docs/data/llms.txt                 # .md page links     [walked]
  └─ /docs/wallets/llms.txt              # .md page links     [walked]
  └─ /docs/rollups/llms.txt              # .md page links     [walked]
  └─ /docs/chains/llms.txt               # ~80 chain .txt links [walked, children dropped]
      └─ /docs/chains/ethereum/llms.txt  # eth_call, eth_chainId, …  [NOT REACHED]
      └─ /docs/chains/solana/llms.txt    # getBalance, getSlot, …    [NOT REACHED]
      └─ … 80 more chains                                            [NOT REACHED]

Five sections have a flat layout, so their pages are counted (~311 page URLs total). The Chains section has another nesting level for per-chain files, and all ~5,100 method pages live at depth 2. Those never make it into the URL pool.

Verbose afdocs output:

✗ llms-txt-freshness: llms.txt covers 311/5452 sitemap doc pages (6%); 5141 missing
      Fix: Your llms.txt covers less than 80% of your site's pages. ...

The 5,141 “missing” pages are mostly chain-specific RPC method docs:

  • /docs/chains/ethereum/ethereum-api-endpoints/eth-call
  • /docs/chains/solana/solana-api-endpoints/get-balance

These are in the llms.txt tree — just one level deeper than the walker currently explores.

This also biases sampling for any check that flows through getUrlsFromCachedLlmsTxt:

  • markdown-content-parity
  • llms-txt-directive
  • etc.

All sample only from the 5 flat sections, never from the ~5,100 chain method pages.

Suggested behaviors (in priority order)

  1. Walk recursively, with bounded depth and file count

    • Stop at e.g. depth = 5 and ≤200 total fetches
    • Deduplicate visited .txt URLs so cycles and shared sub-indexes don’t cause repeated fetches
    • Current behavior is effectively “depth = 1 hardcoded”
  2. Treat the aggregate walk uniformly
    The seed URLs from the canonical llms.txt and discovered sub-links should go through the same classification logic (page URL vs aggregate-to-walk), instead of the current setup where the inner loop uses a different filter than the outer one.

  3. Surface the walked tree in details
    When verbose:

    • List which aggregate files were fetched
    • Include depth information
    • Show whether safety caps were hit

Workarounds tried

  • None viable from the user side.
    The site’s llms.txt is structured exactly as the spec recommends; the bug is on the consumer (afdocs) side.

  • Flattening llms.txt is technically possible, but defeats the purpose of the size-based recommendation to split files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions