Skip to content

Attack Classes & Bypass History

Cure53 edited this page Jun 4, 2026 · 10 revisions

Attack Classes & Bypass History

A reference for the families of HTML-, SVG-, and MathML-based attacks that an HTML sanitizer has to withstand, drawn from DOMPurify's regression test suite (test/test-suite.js). Every payload below corresponds to a class that was, at some point, a real bypass and is now covered by a regression test. The goal is defensive: to help developers understand why these inputs are dangerous, where they bite, and how to test and configure a sanitizer so they stay closed.

These are historical / fixed classes documented for understanding. Treat the payloads as test vectors for your own pipeline, not as anything to deploy. If you find a working bypass against a current release, report it privately via the project's security advisories rather than publishing it.


1. The first rule: sanitization is contextual

A sanitizer returns a string (or a DOM tree). It is only safe if the context it is re-inserted into parses the same way the sanitizer parsed it. The suite tests three different re-insertion contexts for exactly this reason:

  • element.innerHTML = clean — the normal case. The browser re-parses the string in an HTML context.
  • jQuery(...).html(clean) — jQuery does extra parsing/normalization before insertion, which historically opened a mutation window that plain innerHTML did not.
  • iframe.contentDocument.write(clean) / document.write — a fresh parsing context with its own quirks.

Lesson: a string that is inert under innerHTML is not automatically inert under every other sink. The dangerous gap between "how the sanitizer parsed it" and "how the destination re-parses it" is the root of most of what follows.


2. Mutation XSS (mXSS)

mXSS is the central class. The payload looks harmless immediately after sanitize(), but the parser mutates it into executable markup when the string is serialized and re-parsed. The sanitizer inspected one tree; the browser later built a different one.

2.1 Namespace confusion (HTML ↔ SVG ↔ MathML)

HTML, SVG, and MathML have different parsing rules. Foreign-content elements (<svg>, <math>) switch the tokenizer into a mode where the same bytes nest differently. An attacker crafts markup that the sanitizer sees as benign foreign content, but that "breaks out" into HTML on re-parse.

Canonical public example (Chrome 77 disclosure):

<svg></p><style><a id="</style><img src=1 onerror=alert(1)>"></svg>

The </p> and the quoting inside <style> cause the re-parse to foster an <img onerror> into the HTML namespace that was never present in the tree the sanitizer approved.

MathML variant:

<math><mtext><table><mglyph><style><img src onerror=alert(1)>

Defense / test: the sanitizer must track the namespace of every node and forbid foreign-to-HTML transitions that the allow-list does not sanction. DOMPurify enforces per-node namespaces and ships an ALLOWED_NAMESPACES/ NAMESPACE configuration; its suite asserts that constructs like <svg><canvas></canvas><textarea></textarea></svg> are reduced to the safe subtree rather than allowed to re-contextualize.

2.2 Foreign-content integration points

Some foreign elements are integration points whose children are parsed as HTML even though they sit inside SVG/MathML:

  • <math><annotation-xml encoding="text/html">…</annotation-xml></math>
  • <svg><foreignObject>…</foreignObject></svg>

Wrapping a legacy text-only element such as <xmp> inside one of these mixes parsing modes:

<math><annotation-xml encoding="text/html"><xmp><img src=x onerror=alert(1)></xmp></annotation-xml></math>
<svg><foreignobject><xmp><img src=x onerror=alert(1)></xmp></foreignobject></svg>

Defense / test: the sanitizer must parse integration-point children in the correct mode so the misnesting cannot smuggle an event handler across the boundary.

2.3 Re-contextualization via wrapper elements

A payload that is benign in one parsing context becomes live when re-inserted inside a "rawtext"/special wrapper (script, xmp, iframe, noembed, noframes, noscript). The output passes sanitize(), then mutates into an event-handler-bearing element during a second parse inside the wrapper. This is the mechanism behind the SAFE_FOR_XML rawtext advisories (see §3).


3. Rawtext / RCDATA element breakouts

Elements such as noscript, noembed, noframes, xmp, textarea, title, and style switch the tokenizer into a "text-only" mode. A closing tag embedded in an attribute value can prematurely terminate that mode and let the rest of an attribute be re-parsed as live markup.

<noscript><p title="</noscript><img src=x onerror=alert(1)>">
<noembed><img src=x onerror=alert(1)></noembed>
<style>a[href="</style><img src=x onerror=alert(1)>"]{}</style>

A subtlety: in server-side rendering / jsdom scripting is disabled, so the contents of <noscript> are parsed as HTML rather than as text — a bypass that does not reproduce in a scripting-enabled browser but is very real on the server.

Defense / test: the sanitizer's text-context handling must account for every rawtext/RCDATA element (this is the class the "missing rawtext elements in the SAFE_FOR_XML regex" advisories addressed) and must not let a regex over-consume a closing tag that lives inside an attribute value ("attribute breakout").


4. Nesting, depth, and foster-parenting

The HTML parser "foster-parents" misplaced nodes (e.g. a <script> directly inside <table>) out to a different position than where they were written:

<table><script>alert(1)</script></table>

Two distinct risks live here:

  1. Content relocation — foster-parenting can move a node out of the subtree the sanitizer was inspecting. The sanitizer must evaluate the tree the parser actually produces, not the literal nesting in the string.

  2. Algorithmic blow-up (DoS) — deeply nested re-parenting structures historically caused O(n²) work:

    <table><table><table>…  (×200)

    A sanitizer needs a depth cap so a small payload cannot consume unbounded CPU. Note also that prototype pollution was able to weaken depth checks in past versions — see §8.

Defense / test: assert both that foster-parented script is removed and that deep nesting completes within a time bound.


5. DOM clobbering

DOM clobbering uses named HTML elements (id/name) to shadow JavaScript properties via the browser's named-property lookup — no script required. The classic primitive against a sanitizer:

<form><input name="nodeName"></form>

form.nodeName now resolves to the <input> element instead of the string "FORM". Any sanitizer logic that reads node.nodeName as an instance property can be confused (or made to throw, causing a partial-sanitization DoS).

Two notes from the suite's findings:

  • Clobbering yields node references, not attacker-chosen strings, and only works on elements that expose named children (forms, document, window). It cannot, for example, make a plain <a> report a fake tag name. This bounds what the technique can achieve.
  • A clobbered form can also be reached via an external form= association (an input elsewhere in the document pointing at the form by id), which the sanitizer must account for when deciding whether a node is clobbered.

Defense / test: read security-sensitive node properties (nodeName, attributes, parentNode, …) through realm-safe cached prototype getters rather than instance properties, so a clobbered instance property cannot shadow the real value. DOMPurify centralizes this and also offers SANITIZE_DOM / SANITIZE_NAMED_PROPS to neutralize clobbering of id/name.


6. Cross-realm and IN_PLACE node input

When the input is a DOM node (e.g. IN_PLACE: true) rather than a string, two things change:

  • The node may come from a different realm (an <iframe>'s document). Its prototype chain and constructors differ from the main realm's, so naive instanceof checks and uncached getters can misbehave. The sanitizer must still strip dangerous attributes such as href="javascript:…" from foreign-realm nodes (covered by the cross-realm regression).
  • The caller's live nodes are processed directly, so any pre-existing state on those nodes (including clobbering, or, in pathological app code, explicitly defined property getters) is in scope. The defensive answer is the same as §5: classify nodes via realm-safe getters, never trust instance properties.

Defense / test: feed the sanitizer a node built in a foreign realm and confirm dangerous attributes are removed; reject non-node objects passed where a node is expected.


7. Template-expression injection (SAFE_FOR_TEMPLATES)

Apps that feed sanitized HTML into a client-side template engine ask DOMPurify to also strip template expressions ({{…}}, ${…}, ERB tags). A subtle bypass: an expression can be split across adjacent text nodes so that no single text node matches the expression regex, yet the fragments merge into a live expression after the DOM is normalize()d:

text node 1:  "$"
text node 2:  "{constructor.constructor(\"alert(1)\")()"

An even sharper variant hides the split text nodes inside <template>.content, a separate DocumentFragment that a NodeIterator rooted at the body does not traverse — so an expression scrubber that walks only the main tree misses it entirely.

Defense / test: scrub expressions after accounting for node merging, and explicitly recurse into <template>.content (and shadow roots) rather than relying on a single top-level walk.


8. Custom elements & prototype-pollution gadget chains

With the default config, a sanitizer should reject unknown custom elements. Two related risks:

  • Permissive CUSTOM_ELEMENT_HANDLING — a too-broad tagNameCheck/ attributeNameCheck lets arbitrary custom elements with arbitrary attributes (including event handlers) through. This must be opt-in and tightly scoped.
  • Prototype pollution as a force-multiplier — if an earlier gadget pollutes Object.prototype, a sanitizer that initializes config objects with {} (and reads missing keys off the prototype) can inherit attacker-controlled tagNameCheck/attributeNameCheck values, downgrading the default-deny. The defensive fix is to initialize internal config with Object.create(null) so polluted prototype keys are never inherited. PP gadgets are common in the ecosystem (lodash/jQuery/qs/merge-deep, …), which is what makes this class practically relevant rather than theoretical.

Defense / test: keep custom-element checks default-deny and prototype-safe; verify that polluting Object.prototype.tagNameCheck does not loosen output.


9. Engine-deferred mutation (<selectedcontent>)

Newer engine features can mutate the DOM after sanitization. Chrome's <selectedcontent> mirrors the selected <option>'s subtree into its own children after parsing — so a sanitizer that inspects only the static markup misses the payload the engine later clones in.

Defense / test: forbid such elements unless explicitly opted in, and when opted in, re-walk the subtree after the engine populates it ("refresh after sanitize"). This is a general lesson: any feature that clones or defers content needs post-mutation re-inspection.


10. Configuration pitfalls (self-inflicted bypasses)

Many "bypasses" are really misconfigurations that downgrade protection:

  • ALLOW_SELF_CLOSE_IN_ATTR interacts with older jQuery's html() normalization (the jQuery />-rewriting class).
  • ADD_TAGS / ADD_ATTR widen the allow-list; predicate-function forms must not short-circuit URI validation (the javascript: URL on an allowed href class).
  • ALLOW_UNKNOWN_PROTOCOLS, ADD_URI_SAFE_ATTR, and a loosened ALLOWED_URI_REGEXP can re-admit dangerous URI schemes.
  • WHOLE_DOCUMENT, RETURN_DOM, RETURN_DOM_FRAGMENT change the output shape and the re-insertion contract.

Lesson: the secure defaults are the product; most config flags trade safety for capability and need a threat-model justification.


How to use this page

  • Building a test corpus? Each payload above is a regression vector. Run it through your sanitizer in all three re-insertion contexts (§1) and assert on the live parsed tree, not on a substring of the output string (encoded markup produces false negatives).
  • Reviewing a config? Walk §10 and require a justification for every non-default flag.
  • Auditing sanitizer internals? §5/§6 are the recurring root cause: read node identity through realm-safe getters, never instance properties, and re-inspect anything the engine clones, defers, or hides in a separate fragment (§7, §9).

Sourced from the DOMPurify regression suite. Payloads are public, fixed test vectors documented for defensive testing and education.

Clone this wiki locally