Skip to content

Path-prefix filter applied after sitemap URL cap, discarding relevant URLs #31

@dacharyc

Description

@dacharyc

Problem

The path-prefix filtering introduced in PR #28 is applied after the MAX_SITEMAP_URLS = 500 cap in getUrlsFromSitemap(). When a sitemap index contains sub-sitemaps that are processed in alphabetical order, the 500 slots can fill entirely with URLs from earlier sub-sitemaps, leaving zero matches after filtering.

Reproduction

# Bare domain: discovers 129 pages (all Greek, but pages are found)
npx afdocs check https://docs.djangoproject.com --sampling deterministic --max-links 10

# With path prefix: discovers only 1 page (the base URL fallback)
npx afdocs check https://docs.djangoproject.com/en/6.0/ --sampling deterministic --max-links 10

What happens

  1. getUrlsFromSitemap() fetches the sitemap index, which contains sitemap-el.xml, sitemap-en.xml, sitemap-es.xml, etc.
  2. Sub-sitemaps are iterated in order. sitemap-el.xml has thousands of URLs and fills most or all of the 500-slot cap.
  3. getPageUrls() calls filterByPathPrefix(sitemapUrls, filterBase) on the 500 collected URLs.
  4. The filter keeps only URLs matching /en/6.0/. Almost none of the 500 Greek URLs match.
  5. Result: 1 page (base URL fallback).

Meanwhile, sitemap-en.xml contains 658 URLs under /en/6.0/ that were never collected because the cap was already reached.

Where in the code

Location What it does
src/helpers/get-page-urls.ts:259-310 getUrlsFromSitemap() collects up to 500 URLs without path filtering
src/helpers/get-page-urls.ts:335-347 filterByPathPrefix() filters after collection
src/helpers/get-page-urls.ts:378-380 getPageUrls() applies filter to already-capped results
src/constants.ts:35 MAX_SITEMAP_URLS = 500

Proposed fix

Apply path-prefix filtering inside getUrlsFromSitemap(), before counting against the 500 cap. This way only relevant URLs consume slots, and the cap still serves its performance purpose. The filter base URL would need to be passed into getUrlsFromSitemap() (or made available on the context).

An alternative is raising the cap (the freshness check uses MAX_FRESHNESS_SITEMAP_URLS = 50_000), but this is less targeted and doesn't address the root ordering issue.

Context

Discovered while scoring Django (docs.djangoproject.com). The site has no llms.txt, so sitemap is the only discovery source. Passing the versioned URL (/en/6.0/) was expected to scope the sample to current English docs, but the ordering bug prevents it from working. Related: #22 (version filtering), #30 (locale filtering on sitemap indexes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions