Skip to content

fix: normalize .md URLs to HTML equivalents during page discovery#44

Merged
dacharyc merged 1 commit intoagent-ecosystem:mainfrom
mvvmm:fix/normalize-md-urls-in-discovery
Apr 19, 2026
Merged

fix: normalize .md URLs to HTML equivalents during page discovery#44
dacharyc merged 1 commit intoagent-ecosystem:mainfrom
mvvmm:fix/normalize-md-urls-in-discovery

Conversation

@mvvmm
Copy link
Copy Markdown
Contributor

@mvvmm mvvmm commented Apr 19, 2026

Summary

/index.md URL's in our llms.txt was causing extracted links from sitemap + llms.txt to essentially be duplicated in discoverAndSamplePages:

  • /r2/get-started/index.md from llms.txt
  • https://developers.cloudflare.com/r2/get-started/ from our sitemap.

Fix

Normalizes .md URLs to their HTML form in extractLinksFromLlmsTxtFiles() and walkAggregateLinks() using the existing toHtmlUrl() helper. Markdown-specific checks are unaffected because they derive .md candidates from HTML URLs via toMdUrls().

Observed impact (Cloudflare docs, 5 products):

  • ~50% fewer pages tested (duplicates eliminated)
  • ~3x faster audit runs
  • Correct HTML size measurements (boilerplate ratio back to ~80% from ~45%)

llms.txt links use .md URLs (e.g. /docs/guide/index.md) while sitemaps
use HTML URLs (e.g. /docs/guide/). Without normalization these appear as
duplicate pages, and checks like page-size-html that expect HTML URLs
silently fetch markdown content instead.

Normalize .md URLs to their HTML form in extractLinksFromLlmsTxtFiles()
and walkAggregateLinks() using the existing toHtmlUrl() helper.
Markdown-specific checks are unaffected because they derive .md
candidates from HTML URLs via toMdUrls().
Copy link
Copy Markdown
Member

@dacharyc dacharyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @mvvmm ! I note there are a few test failures, but they're logical results of the changes here. I'll merge and then rework or remove those tests. This should be a nice performance boost - I appreciate the work here!

@dacharyc dacharyc merged commit 6bb7d6b into agent-ecosystem:main Apr 19, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants