Skip to content

Conversation

mattheworiordan
Copy link
Member

@mattheworiordan mattheworiordan commented Sep 30, 2025

Generate both HTML and Markdown versions of each documentation page to optimize token usage for LLM crawlers and AI agents. Research shows that serving markdown instead of HTML can reduce token consumption by 60-80%, significantly improving efficiency and reducing costs for AI-powered tools accessing documentation.

Reference: https://x.com/cramforce/status/1972430376149913715

Internal Slack conversation: https://ably-real-time.slack.com/archives/C07C48W7K1A/p1759170942282069

Implementation

  • Added post-build hook to convert HTML pages to clean Markdown format
  • Configured nginx content negotiation to serve markdown when requested
  • Added validation script to ensure markdown generation completeness
  • Integrated markdown generation into CI/CD pipeline
  • Added UI button with markdown icon for user access (see below, common pattern in other sites)
MO screenshot 2025-09-30 at 21 33 06

Note there is a corresponding PR in the website which ensure the Accept: text/markdown header is used to route to the markdown file.

Usage

Via content negotiation (for agents/crawlers):

curl -H "Accept: text/markdown" https://ably.com/docs/channels

Direct file access:

curl https://ably.com/docs/channels/index.md

Via UI:
Click the Markdown icon button in the "Open In" section on any page

Technical Details

  • Uses Turndown library for HTML to Markdown conversion
  • Preserves code block language annotations
  • Removes navigation, headers, footers and UI chrome
  • Markdown files located at /docs/{page-path}/index.md
  • Skips redirect pages (324 redirects detected)
  • Successfully generates markdown for 209/210 content pages
  • No frontmatter - clean markdown content only

Generate both HTML and Markdown versions of each documentation page to
optimize token usage for LLM crawlers and AI agents. Research shows that
serving markdown instead of HTML can reduce token consumption by 60-80%,
significantly improving efficiency and reducing costs for AI-powered tools
accessing documentation.

Reference: https://x.com/cramforce/status/1972430376149913715

## Implementation

- Added post-build hook to convert HTML pages to clean Markdown format
- Configured nginx content negotiation to serve markdown when requested
- Added validation script to ensure markdown generation completeness
- Integrated markdown generation into CI/CD pipeline
- Added UI button with markdown icon for user access

## Usage

**Via content negotiation (for agents/crawlers):**
```bash
curl -H "Accept: text/markdown" https://ably.com/docs/channels
```

**Direct file access:**
```bash
curl https://ably.com/docs/channels/index.md
```

**Via UI:**
Click the Markdown icon button in the "Open In" section on any page

## Technical Details

- Uses Turndown library for HTML to Markdown conversion
- Preserves code block language annotations
- Removes navigation, headers, footers and UI chrome
- Markdown files located at `/docs/{page-path}/index.md`
- Skips redirect pages (324 redirects detected)
- Successfully generates markdown for 209/210 content pages
- No frontmatter - clean markdown content only
Copy link

coderabbitai bot commented Sep 30, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch markdown-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements markdown static file generation alongside HTML files to optimize token usage for LLM agents and AI-powered tools accessing the documentation. Research indicates this can reduce token consumption by 60-80%.

  • Added post-build markdown generation using Turndown library to convert HTML to clean markdown
  • Implemented nginx content negotiation to serve markdown when requested via Accept header
  • Added UI button and validation scripts for markdown file access and completeness checking

Reviewed Changes

Copilot reviewed 8 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/components/Layout/RightSidebar.tsx Adds markdown download link with icon and validation logic
data/onPostBuild/generateMarkdown.ts Core markdown generation script that converts HTML to clean markdown
data/onPostBuild/index.ts Integrates markdown generation into the post-build pipeline
config/nginx.conf.erb Implements content negotiation for serving markdown files
config/mime.types Adds text/markdown MIME type support
package.json Adds validate-markdown script
.circleci/config.yml Integrates markdown validation into CI pipeline
README.md Documents the markdown generation feature

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 330 to 342
fetch(markdownUrl, { method: 'HEAD' })
.then((response) => {
if (!response.ok) {
e.preventDefault();
alert(
'Markdown files are only available in production builds. Run "yarn build" to generate them.',
);
}
})
.catch(() => {
e.preventDefault();
alert('Markdown files are only available in production builds. Run "yarn build" to generate them.');
});
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch request on every click creates unnecessary network overhead. Consider checking if the environment is development and showing the alert immediately, or cache the availability status after the first check.

Copilot uses AI. Check for mistakes.

Comment on lines 93 to 111
// Create markdown frontmatter (disabled - returns empty string)
const createFrontmatter = (title: string, description: string): string => {
return '';
};

// Convert HTML content to Markdown
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => {
const turndownService = createTurndownService();

// Add frontmatter
const frontmatter = createFrontmatter(title, description);

// Convert HTML to Markdown
const markdown = turndownService.turndown(htmlContent);

// Clean up excessive newlines
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n');

return frontmatter + cleanedMarkdown;
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function always returns an empty string and ignores its parameters. If frontmatter is intentionally disabled, consider removing this function and its usage, or add a comment explaining why it's kept for future use.

Suggested change
// Create markdown frontmatter (disabled - returns empty string)
const createFrontmatter = (title: string, description: string): string => {
return '';
};
// Convert HTML content to Markdown
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => {
const turndownService = createTurndownService();
// Add frontmatter
const frontmatter = createFrontmatter(title, description);
// Convert HTML to Markdown
const markdown = turndownService.turndown(htmlContent);
// Clean up excessive newlines
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n');
return frontmatter + cleanedMarkdown;
// Convert HTML content to Markdown
const convertToMarkdown = (htmlContent: string, title: string, description: string): string => {
const turndownService = createTurndownService();
// Convert HTML to Markdown
const markdown = turndownService.turndown(htmlContent);
// Clean up excessive newlines
const cleanedMarkdown = markdown.replace(/\n{3,}/g, '\n\n');
return cleanedMarkdown;

Copilot uses AI. Check for mistakes.

Comment on lines 62 to 64
if (html.length < 1000 && html.includes('window.location.href')) {
return null; // Skip redirect pages
}
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 1000 should be extracted to a named constant like REDIRECT_PAGE_MAX_SIZE to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.

Comment on lines 82 to 84
if (mainContent && mainContent.trim().length < 100) {
return null; // Skip pages with minimal content
}
Copy link
Preview

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 100 should be extracted to a named constant like MIN_CONTENT_LENGTH to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.

- Remove fetch overhead in RightSidebar by checking NODE_ENV instead
- Simplify convertToMarkdown by removing unused createFrontmatter function
- Extract magic numbers to named constants (REDIRECT_PAGE_MAX_SIZE, MIN_CONTENT_LENGTH)
- Apply constants consistently across generateMarkdown.ts and validate-markdown.ts
@m-hulbert m-hulbert added the review-app Create a Heroku review app label Oct 1, 2025
@ably-ci ably-ci temporarily deployed to ably-docs-markdown-supp-aqgfks October 1, 2025 08:27 Inactive
Copy link
Member

@jamiehenson jamiehenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things to look into here. The actual MD generation looks decent, obvs it's not optimised for human reading (and is mostly to get token consumption down), and it could be cleaned up a bit more - but it's not super necessary.

Also the newly-introduced CI check fails, so that should be looked at as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal approach is to add an icon to ably ui (no markdown one exists atm unfortunately), bump the version and use that here with the Icon component instead of adding local assets (not sure why there's two) - I can look into that and prepare a new version of ably ui, then these assets can be removed.

The motivation is that we can have consistent icons across everything, and also standardise the assets themselves (size, fill/stroke colours etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal approach is to add an icon to ably ui (no markdown one exists atm unfortunately), bump the version and use that here with the Icon component instead of adding local assets (not sure why there's two) - I can look into that and prepare a new version of ably ui, then these assets can be removed.

Ok, fine. Please can I leave that with you to apply to this repo once that is done?

/>
}
>
View in Markdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"View in Markdown" is inconsistent with the rest of the tooltips here as the surrounding context is "Open in ...", so "Open in View in Markdown" isn't right. Just "Markdown" would be better here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree, because you're not opening in Markdown. I didn't want to fundamentally change the UI, so felt this is in fact better.

// In development mode, markdown files aren't generated, so show alert immediately
if (process.env.NODE_ENV === 'development') {
e.preventDefault();
alert('Markdown files are only available in production builds. Run "yarn build" to generate them.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts are pretty poor UX, but given this is just for development, it's not a view I'll push strongly here. I would clarify that yarn serve is needed in addition to yarn build to run a local "production" version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, agreed, but it's for development.

There is no point telling someone they need to run yarn serve in addition because they wouldn't see this if they weren't running a server.

href={`${location.pathname.replace(/\/$/, '')}/index.md`}
className="flex h-5 ui-theme-dark group/markdown-link cursor-pointer"
onClick={(e) => {
// In development mode, markdown files aren't generated, so show alert immediately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this comment, the code is descriptive enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair

<a
href={`${location.pathname.replace(/\/$/, '')}/index.md`}
className="flex h-5 ui-theme-dark group/markdown-link cursor-pointer"
onClick={(e) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider opening this in a new tab like the other links. It's not external, but it is a disruptive loss of context to replace the window you're on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, questionable, but I don't feel strongly so will update.

},
});

// Remove navigation, headers, footers, and other UI elements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how Claude operates, but we should remove very obvious comments, they're just redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine

return null; // Skip redirect pages
}

const $ = cheerio.load(html);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost had to break glass in case of jQuery here 😄

const $ = cheerio.load(html);

// Remove unwanted elements
$('nav, header, footer, script, style, noscript, .sidebar, .navigation').remove();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sidebar and navigation exist as selectors to kill, not sure if there was another intention

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what you mean.


// Check if content is meaningful (more than just whitespace/empty tags)
if (mainContent && mainContent.trim().length < MIN_CONTENT_LENGTH) {
return null; // Skip pages with minimal content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip stuff for certain pages, how do we handle this in the frontend? The Open in Markdown button is available all the time, but the pages may not be

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, fair, this arguably the reason this builds are failing, there must be one page that is not "meaningful"


if (!htmlContent) {
reporter.warn(`${REPORTER_PREFIX} Could not extract content for ${slug}`);
failCount++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a failure? The asset is not generated indeed but there's a difference between "this failed" and "the content has been skipped"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a failure because we're assuming all pages do have content (your point above). So I think this is valid, unless we have conditional logic for when we show the View Markdown option

@mattheworiordan
Copy link
Member Author

Also the newly-introduced CI check fails, so that should be looked at as well.

Yup, I saw that. I wanted to get feedback on this PR before I finalise any last issue (only one page fails to generate out fo 200+)

@mattheworiordan
Copy link
Member Author

Thanks @jamiehenson for the feedback, thorough review.

Please see my comments, I'm keen to:

  • Get your input on the language issue, however I'd like to understand if the site is discoverable now by LLMs/crawlers anyway by language. I recall some time back I recorded issues with this when trying to crawl the site for LLMs.txt. Can I get an update on the status of that and your thoughts on what I proposed.
  • Do you know why https://ably-docs-markdown-supp-aqgfks.herokuapp.com/docs/getting-started/setup?lang=java gives an error? I'd prefer not to investigate that issue (seems unrelated to any changes I have made), but it is a blocker to fixing the above issue (at least in testing on the staging site().
  • FWIW. Whilst I appreciate your feedback and will address it, I do think just landing this so that LLMs (not humans) can start using this is more important than getting this into a great state. I appreciate equally that you want code to improve not get worse and understand that, but I'm leaning far more towards getting shit done given the low cost of changing things with LLM, as opposed to getting shit done with code we really like for these non-critical improvements. I recognise you may not agree :)

  - Use REDIRECT_PAGE_MAX_SIZE constant instead of hardcoded value - Fix
  slug extraction regex to properly handle root index.html file -
  Ensures all 210 content pages generate markdown files successfully
  - Use regex for more robust redirect page detection - Remove unused
  title/description extraction (frontmatter disabled) - Remove
  non-existent .sidebar and .navigation CSS selectors - Distinguish
  between skipped pages and failures in logging - Remove redundant code
  comments
  - Change markdown file paths from /docs/thing/index.md to
  /docs/thing.md - Update validator to use consistent regex for redirect
  detection - Align validator markdown path logic with generation script
  - Improves URL ergonomics and removes dated index convention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review-app Create a Heroku review app
Development

Successfully merging this pull request may close these issues.

5 participants