-
Notifications
You must be signed in to change notification settings - Fork 45
feat: generate markdown static files for LLM agent token optimization #2862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Generate both HTML and Markdown versions of each documentation page to optimize token usage for LLM crawlers and AI agents. Research shows that serving markdown instead of HTML can reduce token consumption by 60-80%, significantly improving efficiency and reducing costs for AI-powered tools accessing documentation. Reference: https://x.com/cramforce/status/1972430376149913715 ## Implementation - Added post-build hook to convert HTML pages to clean Markdown format - Configured nginx content negotiation to serve markdown when requested - Added validation script to ensure markdown generation completeness - Integrated markdown generation into CI/CD pipeline - Added UI button with markdown icon for user access ## Usage **Via content negotiation (for agents/crawlers):** ```bash curl -H "Accept: text/markdown" https://ably.com/docs/channels ``` **Direct file access:** ```bash curl https://ably.com/docs/channels/index.md ``` **Via UI:** Click the Markdown icon button in the "Open In" section on any page ## Technical Details - Uses Turndown library for HTML to Markdown conversion - Preserves code block language annotations - Removes navigation, headers, footers and UI chrome - Markdown files located at `/docs/{page-path}/index.md` - Skips redirect pages (324 redirects detected) - Successfully generates markdown for 209/210 content pages - No frontmatter - clean markdown content only
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements markdown static file generation alongside HTML files to optimize token usage for LLM agents and AI-powered tools accessing the documentation. Research indicates this can reduce token consumption by 60-80%.
- Added post-build markdown generation using Turndown library to convert HTML to clean markdown
- Implemented nginx content negotiation to serve markdown when requested via Accept header
- Added UI button and validation scripts for markdown file access and completeness checking
Reviewed Changes
Copilot reviewed 8 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
src/components/Layout/RightSidebar.tsx | Adds markdown download link with icon and validation logic |
data/onPostBuild/generateMarkdown.ts | Core markdown generation script that converts HTML to clean markdown |
data/onPostBuild/index.ts | Integrates markdown generation into the post-build pipeline |
config/nginx.conf.erb | Implements content negotiation for serving markdown files |
config/mime.types | Adds text/markdown MIME type support |
package.json | Adds validate-markdown script |
.circleci/config.yml | Integrates markdown validation into CI pipeline |
README.md | Documents the markdown generation feature |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fetch(markdownUrl, { method: 'HEAD' }) | ||
.then((response) => { | ||
if (!response.ok) { | ||
e.preventDefault(); | ||
alert( | ||
'Markdown files are only available in production builds. Run "yarn build" to generate them.', | ||
); | ||
} | ||
}) | ||
.catch(() => { | ||
e.preventDefault(); | ||
alert('Markdown files are only available in production builds. Run "yarn build" to generate them.'); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fetch request on every click creates unnecessary network overhead. Consider checking if the environment is development and showing the alert immediately, or cache the availability status after the first check.
Copilot uses AI. Check for mistakes.
data/onPostBuild/generateMarkdown.ts
Outdated
// Create markdown frontmatter (disabled - returns empty string) | ||
const createFrontmatter = (title: string, description: string): string => { | ||
return ''; | ||
}; | ||
|
||
// Convert HTML content to Markdown | ||
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => { | ||
const turndownService = createTurndownService(); | ||
|
||
// Add frontmatter | ||
const frontmatter = createFrontmatter(title, description); | ||
|
||
// Convert HTML to Markdown | ||
const markdown = turndownService.turndown(htmlContent); | ||
|
||
// Clean up excessive newlines | ||
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n'); | ||
|
||
return frontmatter + cleanedMarkdown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function always returns an empty string and ignores its parameters. If frontmatter is intentionally disabled, consider removing this function and its usage, or add a comment explaining why it's kept for future use.
// Create markdown frontmatter (disabled - returns empty string) | |
const createFrontmatter = (title: string, description: string): string => { | |
return ''; | |
}; | |
// Convert HTML content to Markdown | |
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => { | |
const turndownService = createTurndownService(); | |
// Add frontmatter | |
const frontmatter = createFrontmatter(title, description); | |
// Convert HTML to Markdown | |
const markdown = turndownService.turndown(htmlContent); | |
// Clean up excessive newlines | |
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n'); | |
return frontmatter + cleanedMarkdown; | |
// Convert HTML content to Markdown | |
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => { | |
const turndownService = createTurndownService(); | |
// Convert HTML to Markdown | |
const markdown = turndownService.turndown(htmlContent); | |
// Clean up excessive newlines | |
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n'); | |
return cleanedMarkdown; |
Copilot uses AI. Check for mistakes.
data/onPostBuild/generateMarkdown.ts
Outdated
if (html.length < 1000 && html.includes('window.location.href')) { | ||
return null; // Skip redirect pages | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 1000 should be extracted to a named constant like REDIRECT_PAGE_MAX_SIZE
to improve code readability and maintainability.
Copilot uses AI. Check for mistakes.
data/onPostBuild/generateMarkdown.ts
Outdated
if (mainContent && mainContent.trim().length < 100) { | ||
return null; // Skip pages with minimal content | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 100 should be extracted to a named constant like MIN_CONTENT_LENGTH
to improve code readability and maintainability.
Copilot uses AI. Check for mistakes.
- Remove fetch overhead in RightSidebar by checking NODE_ENV instead - Simplify convertToMarkdown by removing unused createFrontmatter function - Extract magic numbers to named constants (REDIRECT_PAGE_MAX_SIZE, MIN_CONTENT_LENGTH) - Apply constants consistently across generateMarkdown.ts and validate-markdown.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things to look into here. The actual MD generation looks decent, obvs it's not optimised for human reading (and is mostly to get token consumption down), and it could be cleaned up a bit more - but it's not super necessary.
Also the newly-introduced CI check fails, so that should be looked at as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal approach is to add an icon to ably ui (no markdown one exists atm unfortunately), bump the version and use that here with the Icon
component instead of adding local assets (not sure why there's two) - I can look into that and prepare a new version of ably ui, then these assets can be removed.
The motivation is that we can have consistent icons across everything, and also standardise the assets themselves (size, fill/stroke colours etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal approach is to add an icon to ably ui (no markdown one exists atm unfortunately), bump the version and use that here with the Icon component instead of adding local assets (not sure why there's two) - I can look into that and prepare a new version of ably ui, then these assets can be removed.
Ok, fine. Please can I leave that with you to apply to this repo once that is done?
/> | ||
} | ||
> | ||
View in Markdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"View in Markdown" is inconsistent with the rest of the tooltips here as the surrounding context is "Open in ...", so "Open in View in Markdown" isn't right. Just "Markdown" would be better here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree, because you're not opening in Markdown. I didn't want to fundamentally change the UI, so felt this is in fact better.
// In development mode, markdown files aren't generated, so show alert immediately | ||
if (process.env.NODE_ENV === 'development') { | ||
e.preventDefault(); | ||
alert('Markdown files are only available in production builds. Run "yarn build" to generate them.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alerts are pretty poor UX, but given this is just for development, it's not a view I'll push strongly here. I would clarify that yarn serve
is needed in addition to yarn build
to run a local "production" version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, agreed, but it's for development.
There is no point telling someone they need to run yarn serve
in addition because they wouldn't see this if they weren't running a server.
href={`${location.pathname.replace(/\/$/, '')}/index.md`} | ||
className="flex h-5 ui-theme-dark group/markdown-link cursor-pointer" | ||
onClick={(e) => { | ||
// In development mode, markdown files aren't generated, so show alert immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this comment, the code is descriptive enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair
<a | ||
href={`${location.pathname.replace(/\/$/, '')}/index.md`} | ||
className="flex h-5 ui-theme-dark group/markdown-link cursor-pointer" | ||
onClick={(e) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider opening this in a new tab like the other links. It's not external, but it is a disruptive loss of context to replace the window you're on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, questionable, but I don't feel strongly so will update.
data/onPostBuild/generateMarkdown.ts
Outdated
}, | ||
}); | ||
|
||
// Remove navigation, headers, footers, and other UI elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is how Claude operates, but we should remove very obvious comments, they're just redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine
return null; // Skip redirect pages | ||
} | ||
|
||
const $ = cheerio.load(html); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost had to break glass in case of jQuery here 😄
data/onPostBuild/generateMarkdown.ts
Outdated
const $ = cheerio.load(html); | ||
|
||
// Remove unwanted elements | ||
$('nav, header, footer, script, style, noscript, .sidebar, .navigation').remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think sidebar and navigation exist as selectors to kill, not sure if there was another intention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what you mean.
|
||
// Check if content is meaningful (more than just whitespace/empty tags) | ||
if (mainContent && mainContent.trim().length < MIN_CONTENT_LENGTH) { | ||
return null; // Skip pages with minimal content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we skip stuff for certain pages, how do we handle this in the frontend? The Open in Markdown button is available all the time, but the pages may not be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, fair, this arguably the reason this builds are failing, there must be one page that is not "meaningful"
data/onPostBuild/generateMarkdown.ts
Outdated
|
||
if (!htmlContent) { | ||
reporter.warn(`${REPORTER_PREFIX} Could not extract content for ${slug}`); | ||
failCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a failure? The asset is not generated indeed but there's a difference between "this failed" and "the content has been skipped"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a failure because we're assuming all pages do have content (your point above). So I think this is valid, unless we have conditional logic for when we show the View Markdown option
Yup, I saw that. I wanted to get feedback on this PR before I finalise any last issue (only one page fails to generate out fo 200+) |
Thanks @jamiehenson for the feedback, thorough review. Please see my comments, I'm keen to:
|
- Use REDIRECT_PAGE_MAX_SIZE constant instead of hardcoded value - Fix slug extraction regex to properly handle root index.html file - Ensures all 210 content pages generate markdown files successfully
- Use regex for more robust redirect page detection - Remove unused title/description extraction (frontmatter disabled) - Remove non-existent .sidebar and .navigation CSS selectors - Distinguish between skipped pages and failures in logging - Remove redundant code comments
- Change markdown file paths from /docs/thing/index.md to /docs/thing.md - Update validator to use consistent regex for redirect detection - Align validator markdown path logic with generation script - Improves URL ergonomics and removes dated index convention
Generate both HTML and Markdown versions of each documentation page to optimize token usage for LLM crawlers and AI agents. Research shows that serving markdown instead of HTML can reduce token consumption by 60-80%, significantly improving efficiency and reducing costs for AI-powered tools accessing documentation.
Reference: https://x.com/cramforce/status/1972430376149913715
Internal Slack conversation: https://ably-real-time.slack.com/archives/C07C48W7K1A/p1759170942282069
Implementation
Note there is a corresponding PR in the website which ensure the
Accept: text/markdown
header is used to route to the markdown file.Usage
Via content negotiation (for agents/crawlers):
curl -H "Accept: text/markdown" https://ably.com/docs/channels
Direct file access:
Via UI:
Click the Markdown icon button in the "Open In" section on any page
Technical Details
/docs/{page-path}/index.md