Skip to content

docs: omit .md pages from llms.txt without removing them completely#2480

Open
webrdaniel wants to merge 4 commits intomasterfrom
fix/omit-pages-from-llms-txt-without-removing-them-completely
Open

docs: omit .md pages from llms.txt without removing them completely#2480
webrdaniel wants to merge 4 commits intomasterfrom
fix/omit-pages-from-llms-txt-without-removing-them-completely

Conversation

@webrdaniel
Copy link
Copy Markdown
Contributor

@webrdaniel webrdaniel commented Apr 29, 2026

Follow-up to #2470. Listing pages in the llms-txt plugin's excludeRoutes also drops their /<route>.md counterparts from the build, so URLs like https://docs.apify.com/sdk.md started returning 404 (raised in #2470 (comment)).

This PR moves the exclusion from build time to post-build:

  • docusaurus.config.js: revert excludeRoutes back to just / and /search; add a NOTE so future contributors don't re-introduce the regression.
  • scripts/joinLlmsFiles.mjs: add LLMS_INDEX_EXCLUDE_PATTERNS and a filterLlmsIndex() postbuild step that strips matching - [Title](url) entries (and now-empty ## Section headings) from the generated build/llms.txt. The .md files stay on disk and continue to serve. Also fixes a pre-existing fire-and-forget race between joinFiles() and sanitizeFile().
  • package.json: add @docusaurus/utils as a direct dependency (used for createMatcher).
  • .github/workflows/test.yaml: add regression tests asserting that /sdk.md, /open-source.md, /api/v2/actor-builds-get.md, /api/v2/dataset-get.md, and /academy/tutorials.md still serve text/markdown. Also adds assert_final_content_type so child-repo homepages (/sdk/js, /sdk/python, /api/client/{js,python}, /cli) are checked through their nginx redirects for both HTML and Accept: text/markdown responses.

Net effect: same llms.txt index as #2470 produced, but the per-page .md files are restored.

Test plan

  • CI passes (the new .md-counterpart and child-repo redirect assertions exercise the regression)
  • npm run build succeeds locally
  • build/llms.txt size remains under the 100K limit enforced by npm run test:llms-size
  • Manually verify a few previously-broken URLs once deployed: https://docs.apify.com/sdk.md, https://docs.apify.com/open-source.md, https://docs.apify.com/api/v2/actor-builds-get.md

@webrdaniel webrdaniel requested review from TC-MO and marcel-rbro April 29, 2026 15:08
@webrdaniel webrdaniel self-assigned this Apr 29, 2026
@github-actions github-actions Bot added this to the 139th sprint - Web team milestone Apr 29, 2026
@github-actions github-actions Bot added the t-web Issues with this label are in the ownership of the web team. label Apr 29, 2026
@webrdaniel webrdaniel changed the title fix: omit .md pages from llms.txt without removing them completely docs: omit .md pages from llms.txt without removing them completely Apr 29, 2026
@webrdaniel webrdaniel added the adhoc Ad-hoc unplanned task added during the sprint. label Apr 29, 2026
@webrdaniel webrdaniel requested a review from B4nan April 29, 2026 15:12
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 2edc700 and is ready at https://pr-2480.preview.docs.apify.com!

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 29, 2026

Cheers. Pls can we add some tests for these special pages, to ensure the .md version work, and also that the HTML version contain the <link rel="alternate" type="text/markdown" href="https://docs.apify.com/xxx.md"> tag ?

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit bc07d5b and is ready at https://pr-2480.preview.docs.apify.com!

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 8380f85 and is ready at https://pr-2480.preview.docs.apify.com!

@TC-MO
Copy link
Copy Markdown
Contributor

TC-MO commented Apr 30, 2026

The builds will fail until we deploy it or do we need to make some changes to test assertions in nginx.conf?

@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 68594e4 and is ready at https://pr-2480.preview.docs.apify.com!

@webrdaniel
Copy link
Copy Markdown
Contributor Author

The tests should now run correctly against the staging

@webrdaniel webrdaniel marked this pull request as ready for review April 30, 2026 08:08
@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 30, 2026

Great thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-web Issues with this label are in the ownership of the web team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants