Skip to content

docs: reduce llms.txt size from ~196K to ~84K characters#2470

Merged
marcel-rbro merged 6 commits intomasterfrom
feat/reduce-llms-txt-size
Apr 28, 2026
Merged

docs: reduce llms.txt size from ~196K to ~84K characters#2470
marcel-rbro merged 6 commits intomasterfrom
feat/reduce-llms-txt-size

Conversation

@marcel-rbro
Copy link
Copy Markdown
Contributor

Summary

  • Reduce llms.txt from ~196K to ~84K characters (57% reduction, under 100K target)
  • Replace live fetching of external repo llms.txt with a curated static file (scripts/llms-external-curated.txt) that keeps overview, concept, and guide pages but drops ~400 individual class/interface/enum reference entries
  • Exclude individual API endpoint pages from the Docusaurus llms-txt plugin via excludeRoutes, keeping only Introduction category pages and Getting started
  • Exclude legacy Academy content (old JS scraping course, node-js/python tutorial collections, exercise solutions, marketing playbook deep pages) and outdated legal documents
  • llms-full.txt remains unchanged - it still fetches all content from external repos

Test plan

  • npm run build succeeds
  • wc -c build/llms.txt is under 100K characters
  • build/llms.txt contains API Introduction pages but no individual endpoint pages
  • build/llms.txt contains no legacy Academy content or reference class entries
  • build/llms-full.txt size is unchanged

🤖 Generated with Claude Code

The llms.txt file was over 200K characters, well above the recommended
100K limit. This reduces it to ~84K by:

1. Replacing live fetching of external repo llms.txt files with a
   curated static file that keeps overview/guide/concept pages but
   drops individual class/interface/enum reference entries (~67K saved)

2. Excluding individual API endpoint pages from the plugin output via
   excludeRoutes, keeping only Introduction category pages (~17K saved)

3. Excluding legacy Academy content (old JS course, node-js/python
   tutorial collections, exercise solutions, marketing playbook deep
   pages) and outdated legal documents (~27K saved)

The llms-full.txt file remains unchanged - it still fetches and
includes all content from external repos.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the t-docs Issues owned by technical writing team. label Apr 28, 2026
@marcel-rbro marcel-rbro marked this pull request as ready for review April 28, 2026 08:11
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 6ae0861 and is ready at https://pr-2470.preview.docs.apify.com!

@marcel-rbro marcel-rbro requested a review from TC-MO April 28, 2026 08:19
@TC-MO TC-MO requested a review from webrdaniel April 28, 2026 08:49
@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Great stuff. Pls can we move Academy and Legal docs to the end? It's less relevant for agents, and this way we increase the chance they will see the important bits

Also, this part (the heading) looks bit strange:

Screenshot 2026-04-28 at 11 20 34

Let's make the "Source:" something else than H1, and improve the formatting

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Also, please improve the section captions, e.g. ## sdk => ## Apify SDK, they look bit lame now

@TC-MO
Copy link
Copy Markdown
Contributor

TC-MO commented Apr 28, 2026

Are legal docs relevant for agents at all? Maybe we could trim them completely? 🤔

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Are legal docs relevant for agents at all?

Not for the agents, but definitely for the companies running those agents. But let's sort the sections by importance, so that if agents trim the text, the important stuff is kept

marcel-rbro and others added 2 commits April 28, 2026 11:52
- Add includeOrder to put Platform docs and API before Academy/Legal
- Fix curated file: replace H1 headings with H2, move source URLs to
  blockquotes, improve section names (e.g. "## sdk" -> "## Apify SDK
  for Python")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move section reordering from the Docusaurus plugin (includeOrder) into
joinLlmsFiles.mjs so it can control the final order after external
curated content is appended. Academy and Legal are now at the very end
of the file, after all SDK/client/CLI sections.

Section order: Platform -> API -> JS Client -> Python Client ->
JS SDK -> Python SDK -> CLI -> Academy -> Legal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit aa09e428 and is ready at https://pr-2470.preview.docs.apify.com!

Copy link
Copy Markdown
Contributor

@TC-MO TC-MO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

One last nit - can we add the following intro to the very top?

Screenshot 2026-04-28 at 12 28 11

Let's add there:

Apify is the largest marketplace of tools for AI. 25,000 ready-made Actors to automate your business. Get real-time web data, track competitors, generate leads, analyze sentiment, and orchestrate your apps. Actors are created by a global community of builders earning over $1M every month. Apify takes care of infrastructure, billing, and distribution.

And also, make sure we include somewhere link to https://apify.com/pricing.md as that's quite important too

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Small bug here - Unicode emojis are not rendered correctly:

Screenshot 2026-04-28 at 12 31 12

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

And this part is also weird:

Screenshot 2026-04-28 at 12 32 13

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

And missing newline here:

Screenshot 2026-04-28 at 12 32 45

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Oh and one more - can we add an integration test that ensures the llms.txt is below let's say 90k chars?

- Add Apify marketing blurb and pricing link to siteDescription
- Remove Source blockquotes from curated file (not useful for LLMs)
- Normalize list indentation to consistent 2-space nesting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@marcel-rbro marcel-rbro requested a review from webrdaniel April 28, 2026 11:21
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 029b5d2a and is ready at https://pr-2470.preview.docs.apify.com!

@marcel-rbro
Copy link
Copy Markdown
Contributor Author

I've resolved the nits but the test is a bigger change (also, purely vibecoded). Can you please review the integration test, @webrdaniel?

Add scripts/checkLlmsSize.mjs that verifies both llms.txt and
llms-full.txt exist after build and checks llms.txt character count:
- Warning if over 90,000 characters
- Error (exit 1) if over 100,000 characters

Runs in CI after build via npm run test:llms-size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 44025b03 and is ready at https://pr-2470.preview.docs.apify.com!

Copy link
Copy Markdown
Contributor

@webrdaniel webrdaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@jancurn
Copy link
Copy Markdown
Member

jancurn commented Apr 28, 2026

Thanks. Two more nits and it will be perfect :)

The Changelog probably shouldn't be the parent of the sub-items

Screenshot 2026-04-28 at 13 37 03

One more missing new-line before this heading:

Screenshot 2026-04-28 at 13 37 12

- Move Changelog to end of each section's sub-items so the indent
  script doesn't make concept/guide pages children of Changelog
- Reorder entries to put Quick start first in each section
- Fix reorderSections to ensure blank line between all ## sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apify-service-account
Copy link
Copy Markdown

Preview for this PR was built for commit 0ce18ce3 and is ready at https://pr-2470.preview.docs.apify.com!

@marcel-rbro marcel-rbro merged commit d01804b into master Apr 28, 2026
15 checks passed
@marcel-rbro marcel-rbro deleted the feat/reduce-llms-txt-size branch April 28, 2026 11:54
@marcel-rbro
Copy link
Copy Markdown
Contributor Author

🚀

Comment thread docusaurus.config.js
'/legal/candidate-referral-program-terms',
// Misc singleton pages
'/open-source',
'/sdk',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcel-rbro @TC-MO By including the pages here, the markdown variants themselves were removed. (e.g. https://docs.apify.com/sdk.md now returns 404)
I just want to verify if this ok and if it was intended.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Definitely not okay, all pages need to have their Markdown counterpart (+ .md) working. Can we add integration tests to ensure these pages work

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are not checking the homepages of the child repos, but the setup is there already:

https://github.com/apify/apify-docs/actions/runs/25109134352/workflow#L67

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could fix it: #2480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-docs Issues owned by technical writing team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants