Problem
The path-prefix filtering introduced in PR #28 is applied after the MAX_SITEMAP_URLS = 500 cap in getUrlsFromSitemap(). When a sitemap index contains sub-sitemaps that are processed in alphabetical order, the 500 slots can fill entirely with URLs from earlier sub-sitemaps, leaving zero matches after filtering.
Reproduction
# Bare domain: discovers 129 pages (all Greek, but pages are found)
npx afdocs check https://docs.djangoproject.com --sampling deterministic --max-links 10
# With path prefix: discovers only 1 page (the base URL fallback)
npx afdocs check https://docs.djangoproject.com/en/6.0/ --sampling deterministic --max-links 10
What happens
getUrlsFromSitemap() fetches the sitemap index, which contains sitemap-el.xml, sitemap-en.xml, sitemap-es.xml, etc.
- Sub-sitemaps are iterated in order.
sitemap-el.xml has thousands of URLs and fills most or all of the 500-slot cap.
getPageUrls() calls filterByPathPrefix(sitemapUrls, filterBase) on the 500 collected URLs.
- The filter keeps only URLs matching
/en/6.0/. Almost none of the 500 Greek URLs match.
- Result: 1 page (base URL fallback).
Meanwhile, sitemap-en.xml contains 658 URLs under /en/6.0/ that were never collected because the cap was already reached.
Where in the code
| Location |
What it does |
src/helpers/get-page-urls.ts:259-310 |
getUrlsFromSitemap() collects up to 500 URLs without path filtering |
src/helpers/get-page-urls.ts:335-347 |
filterByPathPrefix() filters after collection |
src/helpers/get-page-urls.ts:378-380 |
getPageUrls() applies filter to already-capped results |
src/constants.ts:35 |
MAX_SITEMAP_URLS = 500 |
Proposed fix
Apply path-prefix filtering inside getUrlsFromSitemap(), before counting against the 500 cap. This way only relevant URLs consume slots, and the cap still serves its performance purpose. The filter base URL would need to be passed into getUrlsFromSitemap() (or made available on the context).
An alternative is raising the cap (the freshness check uses MAX_FRESHNESS_SITEMAP_URLS = 50_000), but this is less targeted and doesn't address the root ordering issue.
Context
Discovered while scoring Django (docs.djangoproject.com). The site has no llms.txt, so sitemap is the only discovery source. Passing the versioned URL (/en/6.0/) was expected to scope the sample to current English docs, but the ordering bug prevents it from working. Related: #22 (version filtering), #30 (locale filtering on sitemap indexes).
Problem
The path-prefix filtering introduced in PR #28 is applied after the
MAX_SITEMAP_URLS = 500cap ingetUrlsFromSitemap(). When a sitemap index contains sub-sitemaps that are processed in alphabetical order, the 500 slots can fill entirely with URLs from earlier sub-sitemaps, leaving zero matches after filtering.Reproduction
What happens
getUrlsFromSitemap()fetches the sitemap index, which containssitemap-el.xml,sitemap-en.xml,sitemap-es.xml, etc.sitemap-el.xmlhas thousands of URLs and fills most or all of the 500-slot cap.getPageUrls()callsfilterByPathPrefix(sitemapUrls, filterBase)on the 500 collected URLs./en/6.0/. Almost none of the 500 Greek URLs match.Meanwhile,
sitemap-en.xmlcontains 658 URLs under/en/6.0/that were never collected because the cap was already reached.Where in the code
src/helpers/get-page-urls.ts:259-310getUrlsFromSitemap()collects up to 500 URLs without path filteringsrc/helpers/get-page-urls.ts:335-347filterByPathPrefix()filters after collectionsrc/helpers/get-page-urls.ts:378-380getPageUrls()applies filter to already-capped resultssrc/constants.ts:35MAX_SITEMAP_URLS = 500Proposed fix
Apply path-prefix filtering inside
getUrlsFromSitemap(), before counting against the 500 cap. This way only relevant URLs consume slots, and the cap still serves its performance purpose. The filter base URL would need to be passed intogetUrlsFromSitemap()(or made available on the context).An alternative is raising the cap (the freshness check uses
MAX_FRESHNESS_SITEMAP_URLS = 50_000), but this is less targeted and doesn't address the root ordering issue.Context
Discovered while scoring Django (
docs.djangoproject.com). The site has no llms.txt, so sitemap is the only discovery source. Passing the versioned URL (/en/6.0/) was expected to scope the sample to current English docs, but the ordering bug prevents it from working. Related: #22 (version filtering), #30 (locale filtering on sitemap indexes).