Skip to content

feat: generate llms-full.txt for AI knowledge-base ingestion#883

Merged
Jordy-Baby merged 1 commit into
mainfrom
chore/llms-full-generation
Apr 24, 2026
Merged

feat: generate llms-full.txt for AI knowledge-base ingestion#883
Jordy-Baby merged 1 commit into
mainfrom
chore/llms-full-generation

Conversation

@Jordy-Baby
Copy link
Copy Markdown
Collaborator

@Jordy-Baby Jordy-Baby commented Apr 23, 2026

Summary

  • Extends plugins/plugin-generate-llms.js to emit two additional files at build time
  • llms-full.txt (~1.7 MB, 325 doc pages): every docs page concatenated with source URL delimiters
  • llms-full-blog.txt (~0.5 MB, 55 posts): every blog post and case study concatenated
  • Existing llms.txt navigational index is unchanged
  • Target use cases: Claude Projects and Cursor users who paste the file into context for full-recall Q&A without mid-conversation browsing

Test plan

  • Local build succeeds, both files written to build/
  • Page count matches source tree (325 docs + 55 blog = 380)
  • Source URL delimiter present for every page
  • Verify preview deploy serves /llms-full.txt and /llms-full-blog.txt with 200
  • Paste llms-full.txt into Claude Projects and run a few sample doc queries

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Items are now tagged by source type (docs or blog)
    • Generated output files consolidate all documentation and blog content with proper formatting, source URLs, and headers for unified content ingestion

Concatenates every docs page (325) into llms-full.txt (~1.7 MB) and
every blog post (55) into llms-full-blog.txt (~0.5 MB). Each page is
prefixed with a source URL delimiter so agents can cite back.

Lets Claude Projects and Cursor users one-shot-ingest the Envio
knowledge base without mid-conversation browsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Jordy-Baby Jordy-Baby requested a review from nikbhintade as a code owner April 23, 2026 13:14
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
envio-docs Ready Ready Preview, Comment Apr 23, 2026 1:15pm

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

The plugin-generate-llms.js file is enhanced to tag collected items with source metadata ("docs" or "blog") and introduce a rendering routine that concatenates markdown content into two dedicated build outputs (llms-full.txt and llms-full-blog.txt) with per-item source URL comments and formatted headers for LLM ingestion contexts.

Changes

Cohort / File(s) Summary
LLM Output Generation Enhancement
plugins/plugin-generate-llms.js
Added source field tagging for collected items and new rendering routine that concatenates markdown content into two aggregated output files (llms-full.txt and llms-full-blog.txt) with per-item source URLs, headings, and descriptions for LLM ingestion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • nikbhintade
  • moose-code
  • keenbeen32

Poem

🐰 ✨
A plugin now sorts with care so keen,
Tagging docs and blogs pristine,
Concatenating wisdom into files bright,
Source comments glow through LLM night!
Content streams flow, all organized right! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: generate llms-full.txt for AI knowledge-base ingestion' directly and clearly describes the primary change—generating a new llms-full.txt file for AI knowledge-base ingestion. It is concise, specific, and accurately represents the main objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/llms-full-generation

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/plugin-generate-llms.js (2)

108-114: ⚠️ Potential issue | 🟡 Minor

Read full-content inputs from absolute paths.

Docs items store a cwd-relative filePath, but renderLLMSFull() now reopens it directly. Capture the already-resolved fullPath during collection so this build step does not depend on the process cwd.

🛠️ Proposed fix
                         collectedDocs.push({
                             filePath: path.join(config.path, file),
+                            contentPath: fullPath,
                             title,
                             description,
                             pageUrl,
                             source: "docs",
                         });
                         collectedDocs.push({
                             filePath: fullPath,
+                            contentPath: fullPath,
                             title,
                             description,
                             pageUrl,
                             source: "blog",
                         });
                 const parts = [header.trim(), ""];
                 for (const item of items) {
-                    const raw = fs.readFileSync(item.filePath, "utf-8");
+                    const raw = fs.readFileSync(item.contentPath, "utf-8");
                     const body = matter(raw).content.trimStart();

Also applies to: 174-180, 234-238

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/plugin-generate-llms.js` around lines 108 - 114, The collectedDocs
entries currently store a cwd-relative filePath which breaks renderLLMSFull when
it reopens files; update the collection code around where collectedDocs is
populated (the block adding objects with filePath, title, description, pageUrl,
source) to include a resolved fullPath (e.g., resolve with
path.join(config.path, file) or path.resolve) and store it as fullPath on each
item; then ensure renderLLMSFull uses the new fullPath property when reopening
files. Apply the same change to the other collection sites noted (the similar
blocks at lines referenced around 174-180 and 234-238) so all doc entries
include an absolute fullPath.

190-215: ⚠️ Potential issue | 🟡 Minor

Keep legacy llms.txt and markdown copies docs-only.

collectedDocs now contains docs and blog entries. orderDocs() still scans all entries, and writeMarkdownCopies(collectedDocs) will also emit blog markdown copies. Filter legacy outputs to source === "docs" so the existing navigational index remains unchanged.

🛠️ Proposed fix
             function orderDocs(includeOrder) {
                 if (!includeOrder || includeOrder.length === 0) {
                     return [];
                 }
 
+                const docsOnly = collectedDocs.filter(
+                    (doc) => doc.source === "docs"
+                );
                 const matched = new Set();
                 const ordered = [];
                 const duplicates = new Set();
 
                 for (const pattern of includeOrder) {
-                    for (const doc of collectedDocs) {
+                    for (const doc of docsOnly) {
                         const docPath = toPosix(doc.filePath);
                         const pat = toPosix(pattern);
                 if (main) {
-                    writeMarkdownCopies(collectedDocs);
-
-                    // Generate llms-full variants: one for docs, one for blog.
-                    // Agents that cannot browse mid-conversation (Claude Projects,
-                    // Cursor) paste these into their context window for full recall.
                     const docsItems = collectedDocs.filter(
                         (d) => d.source === "docs"
                     );
                     const blogItems = collectedDocs.filter(
                         (d) => d.source === "blog"
                     );
+
+                    writeMarkdownCopies(docsItems);
+
+                    // Generate llms-full variants: one for docs, one for blog.
+                    // Agents that cannot browse mid-conversation (Claude Projects,
+                    // Cursor) paste these into their context window for full recall.

Also applies to: 299-310

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/plugin-generate-llms.js` around lines 190 - 215, orderDocs and the
legacy-markdown/llms emitters are currently iterating over collectedDocs which
includes blogs; restrict legacy outputs to docs-only by filtering for entries
where doc.source === "docs". Update orderDocs (and the similar block around
writeMarkdownCopies / llms generation at the other location) to either accept a
filtered array (e.g., collectedDocs.filter(d => d.source === "docs")) or add a
guard inside the loop (skip if doc.source !== "docs") so ordered, duplicates,
and legacy markdown/llms.txt only reflect docs.
🧹 Nitpick comments (1)
plugins/plugin-generate-llms.js (1)

312-340: Derive header URLs from siteConfig.url.

The source delimiters already use siteConfig.url; doing the same in the generated headers keeps preview/staging/domain changes consistent.

♻️ Proposed refactor
                     if (docsItems.length > 0) {
+                        const siteUrl = url.replace(/\/$/, "");
                         const header =
                             `# Envio: Full Documentation for LLMs\n\n` +
-                            `> Every page of docs.envio.dev concatenated as markdown, ` +
+                            `> Every page of ${siteUrl} concatenated as markdown, ` +
                             `with per-page source URLs, for direct ingestion into ` +
-                            `LLM context windows. Pair with https://docs.envio.dev/llms.txt ` +
+                            `LLM context windows. Pair with ${siteUrl}/llms.txt ` +
                             `for the navigational index.`;
                     if (blogItems.length > 0) {
+                        const siteUrl = url.replace(/\/$/, "");
                         const header =
                             `# Envio: Full Blog and Case Studies for LLMs\n\n` +
-                            `> Every blog post and case study on docs.envio.dev ` +
+                            `> Every blog post and case study on ${siteUrl} ` +
                             `concatenated as markdown, with per-page source URLs. ` +
-                            `Pair with https://docs.envio.dev/llms-full.txt for ` +
+                            `Pair with ${siteUrl}/llms-full.txt for ` +
                             `technical documentation.`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/plugin-generate-llms.js` around lines 312 - 340, The headers for
llms-full.txt and llms-full-blog.txt hardcode docs.envio.dev; update them to
derive the base URL from the siteConfig.url value instead. In the block that
builds the header strings (around renderLLMSFull calls), reference
siteConfig.url (normalize to remove trailing slash if needed) when composing the
"Pair with ..." and any "source URLs" text so both headers use the configured
site URL; keep the rest of the header text intact and continue writing files to
context.outDir as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/plugin-generate-llms.js`:
- Around line 108-114: The collectedDocs entries currently store a cwd-relative
filePath which breaks renderLLMSFull when it reopens files; update the
collection code around where collectedDocs is populated (the block adding
objects with filePath, title, description, pageUrl, source) to include a
resolved fullPath (e.g., resolve with path.join(config.path, file) or
path.resolve) and store it as fullPath on each item; then ensure renderLLMSFull
uses the new fullPath property when reopening files. Apply the same change to
the other collection sites noted (the similar blocks at lines referenced around
174-180 and 234-238) so all doc entries include an absolute fullPath.
- Around line 190-215: orderDocs and the legacy-markdown/llms emitters are
currently iterating over collectedDocs which includes blogs; restrict legacy
outputs to docs-only by filtering for entries where doc.source === "docs".
Update orderDocs (and the similar block around writeMarkdownCopies / llms
generation at the other location) to either accept a filtered array (e.g.,
collectedDocs.filter(d => d.source === "docs")) or add a guard inside the loop
(skip if doc.source !== "docs") so ordered, duplicates, and legacy
markdown/llms.txt only reflect docs.

---

Nitpick comments:
In `@plugins/plugin-generate-llms.js`:
- Around line 312-340: The headers for llms-full.txt and llms-full-blog.txt
hardcode docs.envio.dev; update them to derive the base URL from the
siteConfig.url value instead. In the block that builds the header strings
(around renderLLMSFull calls), reference siteConfig.url (normalize to remove
trailing slash if needed) when composing the "Pair with ..." and any "source
URLs" text so both headers use the configured site URL; keep the rest of the
header text intact and continue writing files to context.outDir as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c12e8f89-b973-49b5-b1d7-69cef8fdf073

📥 Commits

Reviewing files that changed from the base of the PR and between d4b28da and e7504c3.

📒 Files selected for processing (1)
  • plugins/plugin-generate-llms.js

@Jordy-Baby Jordy-Baby enabled auto-merge (squash) April 24, 2026 10:12
@Jordy-Baby Jordy-Baby merged commit 6d3bd3e into main Apr 24, 2026
3 checks passed
@Jordy-Baby Jordy-Baby deleted the chore/llms-full-generation branch April 24, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants