Skip to content

fix: improve markdown conversion speed#1053

Merged
amhsirak merged 2 commits intodevelopfrom
fast-md
May 7, 2026
Merged

fix: improve markdown conversion speed#1053
amhsirak merged 2 commits intodevelopfrom
fast-md

Conversation

@RohitR311
Copy link
Copy Markdown
Collaborator

@RohitR311 RohitR311 commented May 6, 2026

What this PR does?

These changes make the HTML-to-markdown conversion faster by reusing a single shared converter instance across all pages instead of rebuilding it from scratch each time, and by cleaning up noisy page elements in bulk rather than one-by-one. The result is the same clean markdown output, just produced more efficiently, most noticeably on crawl robots that process many pages in sequence, where the savings compound with each page visited.

Summary by CodeRabbit

Release Notes

  • Improvements
    • Enhanced HTML-to-Markdown conversion with improved link and image handling
    • Better extraction and formatting of web page content
    • More robust error handling and whitespace cleanup
    • Improved processing of various HTML structures for reliable conversion

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Walkthrough

This PR refactors the HTML-to-Markdown conversion pipeline in server/src/markdownify/markdown.ts by centralizing TurndownService configuration, adding selector-driven HTML tidying, introducing post-processing steps, and rewriting link-fixing logic to handle code blocks safely.

Changes

HTML to Markdown Conversion Pipeline

Layer / File(s) Summary
Module Initialization
server/src/markdownify/markdown.ts (lines 7–39)
Global _baseUrl holder and IIFE-wrapped _turndown instance created with configured options (forceAtxHeadings, truncate-svg) and extended rules including superscript (<sup>^content^).
Selector-Based Helpers
server/src/markdownify/markdown.ts (lines 112–133)
Defined TECHNICAL_SELECTOR, INNER_NOISE_SELECTOR, and UI_ARTIFACTS constants to drive content pruning and noise removal across tidying steps.
HTML Preprocessing
server/src/markdownify/markdown.ts (lines 170–223)
New tidyHtml() function introduced with selector-driven removal of technical elements, page chrome (header, footer, nav, aside), ARIA roles, and noise. Content selection flow revised to extract $content, prune noise, strip artifacts, and derive title.
Public API Refactoring
server/src/markdownify/markdown.ts (lines 134–141)
parseMarkdown() reorganized to set base URL, invoke _turndown for conversion, apply post-processing pipeline (fixBrokenLinks, stripSkipLinks, stripEditLinks, cleanupExtraWhitespace), and include error handling.
Post-Processing Logic
server/src/markdownify/markdown.ts (lines 233–267)
fixBrokenLinks() rewritten to detect and skip code blocks and fenced sections to avoid unintended escaping; cleanupExtraWhitespace() simplified to single-pass trim of trailing spaces before newlines.
URL Utilities
server/src/markdownify/markdown.ts (lines 156–159)
isRelativeUrl() updated to exclude additional schemes (e.g., tel:) alongside existing checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • getmaxun/maxun#1037: Overlapping refactors of the same file with turndown customization, HTML tidying, noise removal, and post-processing logic.

Suggested reviewers

  • amhsirak

Poem

🐰 A rabbit hops through HTML's weeds,
TurndownService now feeds,
With selectors so clean, and code blocks pristine,
Markdown flows where noise convenes—
Fenced and tidy, our conversion now gleams!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title claims to improve markdown conversion speed, but the PR summary indicates the main changes are reusing a converter instance and bulk element cleanup for efficiency—not addressing speed itself. Consider a more accurate title like 'refactor: reuse converter instance and bulk element cleanup for markdown conversion efficiency' that reflects the actual implementation approach.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fast-md

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/src/markdownify/markdown.ts (1)

7-7: ⚡ Quick win

Shared mutable _baseUrl is fragile — bind per call instead.

Hoisting _baseUrl to module scope works today only because parseMarkdown has no await and both tidyHtml (cheerio) and _turndown.turndown are synchronous, so each call runs to completion before the next interleaves. The moment an await is introduced anywhere in parseMarkdown (e.g., async post-processing, network-aware URL resolution, instrumentation), concurrent callers (output-post-processor.ts runs parseMarkdown on every page/result and is easy to parallelize) will race on _baseUrl and produce links resolved against the wrong base — silently. It also leaks state: after the call returns, _baseUrl retains the last value, which can mislead anyone who later invokes _turndown.turndown directly.

Consider passing the base URL through the rules without a module global, e.g. by setting it on the instance just for the call and restoring it, or by exposing a thin helper that closes over a per-call value:

♻️ One option: per-call binding via instance property
-let _baseUrl: string | null = null;
-
-const _turndown = (() => {
+const _turndown = (() => {
   const t = new TurndownService({
     headingStyle: "atx",
     codeBlockStyle: "fenced",
     bulletListMarker: "-",
   });
+  (t as any)._baseUrl = null as string | null;
   ...
-      if (_baseUrl && isRelativeUrl(href)) {
+      if ((t as any)._baseUrl && isRelativeUrl(href)) {
         try {
-          const u = new URL(href, _baseUrl);
+          const u = new URL(href, (t as any)._baseUrl);
           href = u.toString();
         } catch { }
       }
   ...
-      if (_baseUrl && isRelativeUrl(src)) {
+      if ((t as any)._baseUrl && isRelativeUrl(src)) {
         try {
-          src = new URL(src, _baseUrl).toString();
+          src = new URL(src, (t as any)._baseUrl).toString();
         } catch {}
       }
   const tidiedHtml = tidyHtml(html);
-  _baseUrl = baseUrl ?? null;
-
+  (_turndown as any)._baseUrl = baseUrl ?? null;
   try {
     let out = _turndown.turndown(tidiedHtml);
     ...
-    return out.trim();
+    return out.trim();
   } catch (err) {
     ...
+  } finally {
+    (_turndown as any)._baseUrl = null;
   }

A cleaner alternative is to factor the rule replacement bodies into closures created per call, but that would partially undo this PR's "build once" optimization, so the property-based binding above is the lighter-touch fix.

Also applies to: 141-141

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/src/markdownify/markdown.ts` at line 7, The module-level mutable
_baseUrl makes parseMarkdown (and _turndown rules) unsafe for concurrent
callers; change to bind the base URL per call by removing reliance on the module
global and passing the base into the turndown rule logic for each invocation of
parseMarkdown (for example, set an instance property on the _turndown object at
the start of parseMarkdown and restore it after, or create per-call closures
that capture the base); update any uses in tidyHtml/_turndown.turndown rule
bodies to read the per-call value rather than _baseUrl so state is not shared or
leaked between calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/src/markdownify/markdown.ts`:
- Line 7: The module-level mutable _baseUrl makes parseMarkdown (and _turndown
rules) unsafe for concurrent callers; change to bind the base URL per call by
removing reliance on the module global and passing the base into the turndown
rule logic for each invocation of parseMarkdown (for example, set an instance
property on the _turndown object at the start of parseMarkdown and restore it
after, or create per-call closures that capture the base); update any uses in
tidyHtml/_turndown.turndown rule bodies to read the per-call value rather than
_baseUrl so state is not shared or leaked between calls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34199449-b25d-467c-94f1-eda50c5aaab5

📥 Commits

Reviewing files that changed from the base of the PR and between 1f0da54 and 1e8081a.

📒 Files selected for processing (1)
  • server/src/markdownify/markdown.ts

@amhsirak amhsirak merged commit b2fda8d into develop May 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants