Conversation
WalkthroughThis PR refactors the HTML-to-Markdown conversion pipeline in ChangesHTML to Markdown Conversion Pipeline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/src/markdownify/markdown.ts (1)
7-7: ⚡ Quick winShared mutable
_baseUrlis fragile — bind per call instead.Hoisting
_baseUrlto module scope works today only becauseparseMarkdownhas noawaitand bothtidyHtml(cheerio) and_turndown.turndownare synchronous, so each call runs to completion before the next interleaves. The moment anawaitis introduced anywhere inparseMarkdown(e.g., async post-processing, network-aware URL resolution, instrumentation), concurrent callers (output-post-processor.tsrunsparseMarkdownon every page/result and is easy to parallelize) will race on_baseUrland produce links resolved against the wrong base — silently. It also leaks state: after the call returns,_baseUrlretains the last value, which can mislead anyone who later invokes_turndown.turndowndirectly.Consider passing the base URL through the rules without a module global, e.g. by setting it on the instance just for the call and restoring it, or by exposing a thin helper that closes over a per-call value:
♻️ One option: per-call binding via instance property
-let _baseUrl: string | null = null; - -const _turndown = (() => { +const _turndown = (() => { const t = new TurndownService({ headingStyle: "atx", codeBlockStyle: "fenced", bulletListMarker: "-", }); + (t as any)._baseUrl = null as string | null; ... - if (_baseUrl && isRelativeUrl(href)) { + if ((t as any)._baseUrl && isRelativeUrl(href)) { try { - const u = new URL(href, _baseUrl); + const u = new URL(href, (t as any)._baseUrl); href = u.toString(); } catch { } } ... - if (_baseUrl && isRelativeUrl(src)) { + if ((t as any)._baseUrl && isRelativeUrl(src)) { try { - src = new URL(src, _baseUrl).toString(); + src = new URL(src, (t as any)._baseUrl).toString(); } catch {} }const tidiedHtml = tidyHtml(html); - _baseUrl = baseUrl ?? null; - + (_turndown as any)._baseUrl = baseUrl ?? null; try { let out = _turndown.turndown(tidiedHtml); ... - return out.trim(); + return out.trim(); } catch (err) { ... + } finally { + (_turndown as any)._baseUrl = null; }A cleaner alternative is to factor the rule replacement bodies into closures created per call, but that would partially undo this PR's "build once" optimization, so the property-based binding above is the lighter-touch fix.
Also applies to: 141-141
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/markdownify/markdown.ts` at line 7, The module-level mutable _baseUrl makes parseMarkdown (and _turndown rules) unsafe for concurrent callers; change to bind the base URL per call by removing reliance on the module global and passing the base into the turndown rule logic for each invocation of parseMarkdown (for example, set an instance property on the _turndown object at the start of parseMarkdown and restore it after, or create per-call closures that capture the base); update any uses in tidyHtml/_turndown.turndown rule bodies to read the per-call value rather than _baseUrl so state is not shared or leaked between calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@server/src/markdownify/markdown.ts`:
- Line 7: The module-level mutable _baseUrl makes parseMarkdown (and _turndown
rules) unsafe for concurrent callers; change to bind the base URL per call by
removing reliance on the module global and passing the base into the turndown
rule logic for each invocation of parseMarkdown (for example, set an instance
property on the _turndown object at the start of parseMarkdown and restore it
after, or create per-call closures that capture the base); update any uses in
tidyHtml/_turndown.turndown rule bodies to read the per-call value rather than
_baseUrl so state is not shared or leaked between calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 34199449-b25d-467c-94f1-eda50c5aaab5
📒 Files selected for processing (1)
server/src/markdownify/markdown.ts
What this PR does?
These changes make the HTML-to-markdown conversion faster by reusing a single shared converter instance across all pages instead of rebuilding it from scratch each time, and by cleaning up noisy page elements in bulk rather than one-by-one. The result is the same clean markdown output, just produced more efficiently, most noticeably on crawl robots that process many pages in sequence, where the savings compound with each page visited.
Summary by CodeRabbit
Release Notes