Summary
markdown.Document.HTML() can emit untrusted markdown as unsanitized HTML, exposing XSS when the rendered output is shown in a browser (e.g. the entity webapp). There are two related vectors that should be addressed together:
- Unvalidated URL schemes —
markdown/document.go (link ~:182, image ~:188-194) writes href/src from n.Href through html.EscapeString only. html.EscapeString does not block dangerous schemes, so javascript: / data: URIs survive into the attribute.
- Raw HTML passthrough (dominant vector) —
markdown/options.go defaults PreserveHTML: true, and markdown/document.go:226 (case "raw-html", "html") returns n.RawHTML verbatim. Raw <script>, <img onerror=>, etc. pass straight through.
Why this is deferred (not a drive-by fix)
- A scheme allowlist on
href/src alone is incomplete — the raw-HTML path is the bigger hole, so fixing only (1) gives a false sense of security.
- Simply flipping
PreserveHTML to false is behavior-breaking: it silently strips/flattens embedded HTML, breaks markdown round-trip fidelity, and breaks <details>/admonition handling (parseDetailsHTML depends on PreserveHTML: true).
Correct fix
Sanitize on render (allowlist of tags / attributes / URL schemes — e.g. bluemonday or an equivalent), keeping HTML preservation intact, rather than a piecemeal href allowlist or a default flip. Confirm the intended trust model first:
- Is
Document.HTML() expected to be safe for untrusted input, or only for trusted local CLI/PDF rendering?
- Should sanitization be opt-in (a
Sanitize option) or the default for the webapp consumer?
Context
Deferred from PR #119 CodeRabbit review (comments on markdown/document.go:194 and markdown/options.go:25). Both are the same underlying surface.
Summary
markdown.Document.HTML()can emit untrusted markdown as unsanitized HTML, exposing XSS when the rendered output is shown in a browser (e.g. the entity webapp). There are two related vectors that should be addressed together:markdown/document.go(link~:182, image~:188-194) writeshref/srcfromn.Hrefthroughhtml.EscapeStringonly.html.EscapeStringdoes not block dangerous schemes, sojavascript:/data:URIs survive into the attribute.markdown/options.godefaultsPreserveHTML: true, andmarkdown/document.go:226(case "raw-html", "html") returnsn.RawHTMLverbatim. Raw<script>,<img onerror=>, etc. pass straight through.Why this is deferred (not a drive-by fix)
href/srcalone is incomplete — the raw-HTML path is the bigger hole, so fixing only (1) gives a false sense of security.PreserveHTMLtofalseis behavior-breaking: it silently strips/flattens embedded HTML, breaks markdown round-trip fidelity, and breaks<details>/admonition handling (parseDetailsHTMLdepends onPreserveHTML: true).Correct fix
Sanitize on render (allowlist of tags / attributes / URL schemes — e.g. bluemonday or an equivalent), keeping HTML preservation intact, rather than a piecemeal
hrefallowlist or a default flip. Confirm the intended trust model first:Document.HTML()expected to be safe for untrusted input, or only for trusted local CLI/PDF rendering?Sanitizeoption) or the default for the webapp consumer?Context
Deferred from PR #119 CodeRabbit review (comments on
markdown/document.go:194andmarkdown/options.go:25). Both are the same underlying surface.