Skip to content

Add sphinx-gp-llms: LLM-friendly documentation outputs#47

Merged
tony merged 8 commits into
mainfrom
feat/sphinx-gp-llms
May 25, 2026
Merged

Add sphinx-gp-llms: LLM-friendly documentation outputs#47
tony merged 8 commits into
mainfrom
feat/sphinx-gp-llms

Conversation

@tony
Copy link
Copy Markdown
Member

@tony tony commented May 25, 2026

Summary

  • Add new workspace package sphinx-gp-llms that generates four LLM-friendly output formats during the standard HTML Sphinx build
  • Add llms.txt — structured Markdown index following the llmstxt.org spec (Jeremy Howard, Answer.AI), with H1 project name, blockquote summary, and H2 sections grouped by toctree captions
  • Add llms-full.txt — concatenated full-content Markdown of all documentation pages, following the community convention adopted by Anthropic, Cloudflare, Mintlify, and GitBook
  • Add docs.json — agent-oriented manifest with agentEntrypoints, per-page markdownUrl, and heading outlines, following the Lakebed/Ping convention
  • Add per-page .md twins — source file copies alongside HTML output, following the Cloudflare "Markdown for Agents" convention
  • Update the footer's Machine-readable line to link all formats: Markdown, raw source, docs.json, llms.txt, llms-full.txt
  • Fix footer link injection to respect per-generate config flags — disabling an output no longer renders a broken link
  • Fix footer template guard so LLM links render independently of source_repository
  • Fix env.titles access to skip titleless pages instead of crashing

Design decisions

  • Hook-based, not a custom Builder: Uses build-finished to generate files alongside the HTML build, matching the sphinx-gp-sitemap pattern. No separate build invocation needed — every sphinx-build automatically produces LLM outputs.
  • Toctree captions as llms.txt sections: MyST's {toctree} :caption: option maps directly to the H2 sections in the llms.txt format, so existing toctree structure translates without new configuration.
  • Silent no-op when site_url is unset: Projects without docs_url configured skip LLM output at INFO level, matching sitemap behavior. No broken builds.
  • Footer links are context-injected and flag-gated: The html-page-context hook injects link URLs only when the extension is loaded and the corresponding llms_generate_* flag is True. The footer also renders LLM links independently of source_repository, falling back to an elif branch when the source-path section is unavailable.
  • Defensive title access: Generators use env.titles.get(docname) and skip pages without titles (_llms_txt.py, _docs_json.py) or fall back to the docname (_llms_full_txt.py).

Verification

Verify all output files are generated:

$ ls docs/_build/html/llms.txt docs/_build/html/llms-full.txt docs/_build/html/docs.json

Verify per-page .md twins exist:

$ ls docs/_build/html/index.md docs/_build/html/configuration.md

Verify footer links render:

$ grep -c "llms.txt" docs/_build/html/configuration/index.html

Test plan

  • uv run ruff check . — no lint issues
  • uv run mypy — no type errors
  • uv run pytest tests/ packages/ --reruns 0 — all tests pass
  • just build-docs — docs build successfully with all LLM outputs
  • llms.txt has correct H1/blockquote/H2 structure with page links
  • llms-full.txt concatenates all page content with headers and separators
  • docs.json has valid schema with agentEntrypoints and page entries
  • .md twins exist alongside HTML for every content page
  • Footer shows "Machine-readable: Markdown, raw source, docs.json, llms.txt, llms-full.txt" with working links
  • Disabling a generate flag removes its footer link
  • LLM links render even without source_repository configured
  • Titleless pages don't crash the build

why: LLM agents need machine-readable entry points to docs sites;
llms.txt, llms-full.txt, docs.json, and per-page .md twins are the
emerging conventions (llmstxt.org, Cloudflare, Mintlify, Lakebed).

what:
- New workspace package sphinx-gp-llms with Sphinx 8.1+ idioms
- llms.txt: structured Markdown index (H1/blockquote/H2 sections)
  following the llmstxt.org spec (Jeremy Howard, Answer.AI)
- llms-full.txt: concatenated full-content Markdown of all pages
  (community convention, Anthropic/Cloudflare/Mintlify/GitBook)
- docs.json: agent-oriented manifest with agentEntrypoints and
  per-page headings (Lakebed/Ping convention)
- Per-page .md twins: source file copy alongside each HTML page
  (Cloudflare "Markdown for Agents", Mintlify, Stripe, Vercel)
- Hooks into build-finished (file generation) and
  html-page-context (footer link injection)
- Config: llms_generate_txt, llms_generate_full, llms_generate_json,
  llms_generate_md_twins, llms_excludes, llms_description_length
- Silent no-op when site_url is unset (same pattern as sitemap)
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

Codecov Report

❌ Patch coverage is 94.39024% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.87%. Comparing base (23bd389) to head (c569960).

Files with missing lines Patch % Lines
...ages/sphinx-gp-llms/src/sphinx_gp_llms/__init__.py 90.47% 6 Missing ⚠️
...es/sphinx-gp-llms/src/sphinx_gp_llms/_docs_json.py 95.40% 4 Missing ⚠️
scripts/ci/package_tools.py 20.00% 4 Missing ⚠️
...phinx-gp-llms/src/sphinx_gp_llms/_llms_full_txt.py 90.62% 3 Missing ⚠️
.../sphinx-gp-llms/src/sphinx_gp_llms/_description.py 92.59% 2 Missing ⚠️
...ges/sphinx-gp-llms/src/sphinx_gp_llms/_llms_txt.py 95.23% 2 Missing ⚠️
...ges/sphinx-gp-llms/src/sphinx_gp_llms/_md_twins.py 91.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   91.82%   91.87%   +0.05%     
==========================================
  Files         220      233      +13     
  Lines       17776    18186     +410     
==========================================
+ Hits        16322    16709     +387     
- Misses       1454     1477      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tony added 4 commits May 24, 2026 21:59
why: make the new extension available to all consumer projects
and ensure CI smoke tests cover it.

what:
- Add sphinx-gp-llms to uv workspace sources and dev deps
- Add to DEFAULT_EXTENSIONS (after sphinx_gp_sitemap)
- Add to ruff known-first-party and pytest testpaths
- Add smoke_sphinx_gp_llms runner to CI package_tools
why: the footer's Machine-readable line should link every format
the new sphinx-gp-llms extension generates.

what:
- Add Markdown, docs.json, llms.txt, llms-full.txt links
- Links appear conditionally when sphinx-gp-llms injects context
  variables via html-page-context hook
- Existing raw source link preserved
why: workspace infrastructure tests require a docs page, redirect
entry, and cluster classification for every package.

what:
- Add docs/packages/sphinx-gp-llms/index.md landing page
- Add extensions/sphinx-gp-llms redirect in redirects.txt
- Add build-seo cluster in package_reference.py
- Add to publishable packages set in test_package_reference.py
why: verify llms.txt format, llms-full.txt content, docs.json
schema, and .md twin file generation.

what:
- conftest.py with module-scoped build fixture and shared scenario
- test_importable.py: smoke test for setup() callable
- test_llms_txt.py: H1, blockquote, sections, link format
- test_llms_full_txt.py: page content, separators, source URLs
- test_docs_json.py: manifest schema, page fields, headings
- test_md_twins.py: file existence and content verification
- All use NamedTuple fixtures with test_id, strict typing
@tony tony force-pushed the feat/sphinx-gp-llms branch from 9e5ebcc to 542870d Compare May 25, 2026 02:59
@tony
Copy link
Copy Markdown
Member Author

tony commented May 25, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notable observations below threshold (scored 75/100 — edge cases, not blocking):

  • _inject_llms_context injects all link URLs unconditionally without checking llms_generate_* flags, so disabling an output (e.g. llms_generate_json = False) still renders its footer link (404). The sibling _write_llm_outputs does check each flag.
  • The outer {%- if theme_source_repository and page_source_suffix -%} guard in page.html wraps the LLM links too, hiding them when source_repository is unset even though they don't depend on it.
  • app.env.titles[docname] in _llms_txt.py, _llms_full_txt.py, _docs_json.py uses dict subscript — pages without a section heading would KeyError.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

tony added 3 commits May 24, 2026 22:16
why: disabling an output (e.g. llms_generate_json = False) still
rendered its footer link, producing a 404.

what:
- Check llms_generate_md_twins, llms_generate_txt,
  llms_generate_full, llms_generate_json before injecting
  each context variable in _inject_llms_context
why: the outer template guard required theme_source_repository,
hiding all LLM footer links for projects that set site_url but
not source_repository.

what:
- Add elif branch for LLM links when source_repository is unset
- LLM links now render independently of source_repository
- Source path and raw-source link still require source_repository
why: pages without a section heading (stubs, pure-directive pages)
may not have an entry in env.titles, causing a KeyError crash
during build-finished.

what:
- Use env.titles.get(docname) in _llms_txt.py and _docs_json.py,
  skip titleless pages
- Use env.titles.get(docname) in _llms_full_txt.py, fall back to
  docname as title since content is still useful
@tony tony merged commit 5da67d0 into main May 25, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants