docs: reduce llms.txt size from ~196K to ~84K characters#2470
docs: reduce llms.txt size from ~196K to ~84K characters#2470marcel-rbro merged 6 commits intomasterfrom
Conversation
The llms.txt file was over 200K characters, well above the recommended 100K limit. This reduces it to ~84K by: 1. Replacing live fetching of external repo llms.txt files with a curated static file that keeps overview/guide/concept pages but drops individual class/interface/enum reference entries (~67K saved) 2. Excluding individual API endpoint pages from the plugin output via excludeRoutes, keeping only Introduction category pages (~17K saved) 3. Excluding legacy Academy content (old JS course, node-js/python tutorial collections, exercise solutions, marketing playbook deep pages) and outdated legal documents (~27K saved) The llms-full.txt file remains unchanged - it still fetches and includes all content from external repos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Preview for this PR was built for commit |
|
Also, please improve the section captions, e.g. |
|
Are legal docs relevant for agents at all? Maybe we could trim them completely? 🤔 |
Not for the agents, but definitely for the companies running those agents. But let's sort the sections by importance, so that if agents trim the text, the important stuff is kept |
- Add includeOrder to put Platform docs and API before Academy/Legal - Fix curated file: replace H1 headings with H2, move source URLs to blockquotes, improve section names (e.g. "## sdk" -> "## Apify SDK for Python") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move section reordering from the Docusaurus plugin (includeOrder) into joinLlmsFiles.mjs so it can control the final order after external curated content is appended. Academy and Legal are now at the very end of the file, after all SDK/client/CLI sections. Section order: Platform -> API -> JS Client -> Python Client -> JS SDK -> Python SDK -> CLI -> Academy -> Legal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Preview for this PR was built for commit |
|
One last nit - can we add the following intro to the very top?
Let's add there:
And also, make sure we include somewhere link to https://apify.com/pricing.md as that's quite important too |
|
Oh and one more - can we add an integration test that ensures the llms.txt is below let's say 90k chars? |
- Add Apify marketing blurb and pricing link to siteDescription - Remove Source blockquotes from curated file (not useful for LLMs) - Normalize list indentation to consistent 2-space nesting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Preview for this PR was built for commit |
|
I've resolved the nits but the test is a bigger change (also, purely vibecoded). Can you please review the integration test, @webrdaniel? |
Add scripts/checkLlmsSize.mjs that verifies both llms.txt and llms-full.txt exist after build and checks llms.txt character count: - Warning if over 90,000 characters - Error (exit 1) if over 100,000 characters Runs in CI after build via npm run test:llms-size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Preview for this PR was built for commit |
- Move Changelog to end of each section's sub-items so the indent script doesn't make concept/guide pages children of Changelog - Reorder entries to put Quick start first in each section - Fix reorderSections to ensure blank line between all ## sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Preview for this PR was built for commit |
|
🚀 |
| '/legal/candidate-referral-program-terms', | ||
| // Misc singleton pages | ||
| '/open-source', | ||
| '/sdk', |
There was a problem hiding this comment.
@marcel-rbro @TC-MO By including the pages here, the markdown variants themselves were removed. (e.g. https://docs.apify.com/sdk.md now returns 404)
I just want to verify if this ok and if it was intended.
There was a problem hiding this comment.
Good catch. Definitely not okay, all pages need to have their Markdown counterpart (+ .md) working. Can we add integration tests to ensure these pages work
There was a problem hiding this comment.
The tests are not checking the homepages of the child repos, but the setup is there already:
https://github.com/apify/apify-docs/actions/runs/25109134352/workflow#L67







Summary
llms.txtfrom ~196K to ~84K characters (57% reduction, under 100K target)llms.txtwith a curated static file (scripts/llms-external-curated.txt) that keeps overview, concept, and guide pages but drops ~400 individual class/interface/enum reference entriesexcludeRoutes, keeping only Introduction category pages and Getting startedllms-full.txtremains unchanged - it still fetches all content from external reposTest plan
npm run buildsucceedswc -c build/llms.txtis under 100K charactersbuild/llms.txtcontains API Introduction pages but no individual endpoint pagesbuild/llms.txtcontains no legacy Academy content or reference class entriesbuild/llms-full.txtsize is unchanged🤖 Generated with Claude Code