Skip to content

Sitemap discovery misses root-domain sitemap when scoring a subdirectory URL #83

@dacharyc

Description

@dacharyc

Summary

When the scored URL is a subdirectory (e.g. https://www.swift.org/documentation/), afdocs checks for a sitemap at <base-url>/sitemap.xml — in this case https://www.swift.org/documentation/sitemap.xml. If that path returns 404, afdocs falls back to testing only the root URL and emits a single-page-sample diagnostic, even when a valid sitemap exists at https://www.swift.org/sitemap.xml.

Steps to reproduce

npx afdocs check https://www.swift.org/documentation/ --sampling deterministic --max-links 50 --format json --score

Expected: afdocs discovers and samples pages from the sitemap at https://www.swift.org/sitemap.xml.

Actual: discoverySources: ["fallback"], testedPages: 1, single-page-sample diagnostic fires.

Root cause

The discovery sequence:

  1. Checks robots.txt at https://www.swift.org/robots.txt — found, but no Sitemap: directive
  2. Tries https://www.swift.org/documentation/sitemap.xml — 404
  3. Falls back to testing only the root URL

Step 2 is path-scoped to the base URL. It never tries https://www.swift.org/sitemap.xml — the conventional root-domain location.

Expected behavior

When the base URL is a subdirectory and the path-relative sitemap returns 404, fall back to checking <scheme>://<host>/sitemap.xml (and <host>/sitemap-index.xml variants) before giving up on sitemap discovery.

Sitemaps are almost never placed under a subdirectory path — they're nearly always at the root. The scoped path check has low hit rate, while the root-domain fallback would recover a significant fraction of cases like this one.

Notes

  • https://www.swift.org/sitemap.xml returns HTTP 200 with 446 URLs, 47 of which are under /documentation/
  • https://www.swift.org/documentation/sitemap.xml returns HTTP 404
  • The robots.txt (User-agent: *, Disallow: /builds/) has no Sitemap: line

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions