-
-
Notifications
You must be signed in to change notification settings - Fork 855
Attack Classes & Bypass History
A reference for the families of HTML-, SVG-, and MathML-based attacks that an
HTML sanitizer has to withstand, drawn from DOMPurify's regression test suite
(test/test-suite.js). Every payload below corresponds to a class that was, at
some point, a real bypass and is now covered by a regression test. The goal is
defensive: to help developers understand why these inputs are dangerous,
where they bite, and how to test and configure a sanitizer so they stay
closed.
These are historical / fixed classes documented for understanding. Treat the payloads as test vectors for your own pipeline, not as anything to deploy. If you find a working bypass against a current release, report it privately via the project's security advisories rather than publishing it.
A sanitizer returns a string (or a DOM tree). It is only safe if the context it is re-inserted into parses the same way the sanitizer parsed it. The suite tests three different re-insertion contexts for exactly this reason:
-
element.innerHTML = clean— the normal case. The browser re-parses the string in an HTML context. -
jQuery(...).html(clean)— jQuery does extra parsing/normalization before insertion, which historically opened a mutation window that plaininnerHTMLdid not. -
iframe.contentDocument.write(clean)/document.write— a fresh parsing context with its own quirks.
Lesson: a string that is inert under innerHTML is not automatically inert
under every other sink. The dangerous gap between "how the sanitizer parsed it"
and "how the destination re-parses it" is the root of most of what follows.
mXSS is the central class. The payload looks harmless immediately after
sanitize(), but the parser mutates it into executable markup when the
string is serialized and re-parsed. The sanitizer inspected one tree; the
browser later built a different one.
HTML, SVG, and MathML have different parsing rules. Foreign-content elements
(<svg>, <math>) switch the tokenizer into a mode where the same bytes nest
differently. An attacker crafts markup that the sanitizer sees as benign
foreign content, but that "breaks out" into HTML on re-parse.
Canonical public example (Chrome 77 disclosure):
<svg></p><style><a id="</style><img src=1 onerror=alert(1)>"></svg>The </p> and the quoting inside <style> cause the re-parse to foster an
<img onerror> into the HTML namespace that was never present in the tree the
sanitizer approved.
MathML variant:
<math><mtext><table><mglyph><style><img src onerror=alert(1)>Defense / test: the sanitizer must track the namespace of every node
and forbid foreign-to-HTML transitions that the allow-list does not sanction.
DOMPurify enforces per-node namespaces and ships an ALLOWED_NAMESPACES/
NAMESPACE configuration; its suite asserts that constructs like
<svg><canvas></canvas><textarea></textarea></svg> are reduced to the safe
subtree rather than allowed to re-contextualize.
Some foreign elements are integration points whose children are parsed as HTML even though they sit inside SVG/MathML:
<math><annotation-xml encoding="text/html">…</annotation-xml></math><svg><foreignObject>…</foreignObject></svg>
Wrapping a legacy text-only element such as <xmp> inside one of these mixes
parsing modes:
<math><annotation-xml encoding="text/html"><xmp><img src=x onerror=alert(1)></xmp></annotation-xml></math>
<svg><foreignobject><xmp><img src=x onerror=alert(1)></xmp></foreignobject></svg>Defense / test: the sanitizer must parse integration-point children in the correct mode so the misnesting cannot smuggle an event handler across the boundary.
A payload that is benign in one parsing context becomes live when re-inserted
inside a "rawtext"/special wrapper (script, xmp, iframe, noembed,
noframes, noscript). The output passes sanitize(), then mutates into an
event-handler-bearing element during a second parse inside the wrapper. This is
the mechanism behind the SAFE_FOR_XML rawtext advisories (see §3).
Elements such as noscript, noembed, noframes, xmp, textarea,
title, and style switch the tokenizer into a "text-only" mode. A closing
tag embedded in an attribute value can prematurely terminate that mode and
let the rest of an attribute be re-parsed as live markup.
<noscript><p title="</noscript><img src=x onerror=alert(1)>">
<noembed><img src=x onerror=alert(1)></noembed>
<style>a[href="</style><img src=x onerror=alert(1)>"]{}</style>A subtlety: in server-side rendering / jsdom scripting is disabled, so the
contents of <noscript> are parsed as HTML rather than as text — a bypass
that does not reproduce in a scripting-enabled browser but is very real on the
server.
Defense / test: the sanitizer's text-context handling must account for every rawtext/RCDATA element (this is the class the "missing rawtext elements in the SAFE_FOR_XML regex" advisories addressed) and must not let a regex over-consume a closing tag that lives inside an attribute value ("attribute breakout").
The HTML parser "foster-parents" misplaced nodes (e.g. a <script> directly
inside <table>) out to a different position than where they were written:
<table><script>alert(1)</script></table>Two distinct risks live here:
-
Content relocation — foster-parenting can move a node out of the subtree the sanitizer was inspecting. The sanitizer must evaluate the tree the parser actually produces, not the literal nesting in the string.
-
Algorithmic blow-up (DoS) — deeply nested re-parenting structures historically caused O(n²) work:
<table><table><table>… (×200)
A sanitizer needs a depth cap so a small payload cannot consume unbounded CPU. Note also that prototype pollution was able to weaken depth checks in past versions — see §8.
Defense / test: assert both that foster-parented script is removed and that deep nesting completes within a time bound.
DOM clobbering uses named HTML elements (id/name) to shadow JavaScript
properties via the browser's named-property lookup — no script required. The
classic primitive against a sanitizer:
<form><input name="nodeName"></form>form.nodeName now resolves to the <input> element instead of the string
"FORM". Any sanitizer logic that reads node.nodeName as an instance
property can be confused (or made to throw, causing a partial-sanitization DoS).
Two notes from the suite's findings:
- Clobbering yields node references, not attacker-chosen strings, and only
works on elements that expose named children (forms, document, window). It
cannot, for example, make a plain
<a>report a fake tag name. This bounds what the technique can achieve. - A clobbered
formcan also be reached via an externalform=association (an input elsewhere in the document pointing at the form by id), which the sanitizer must account for when deciding whether a node is clobbered.
Defense / test: read security-sensitive node properties (nodeName,
attributes, parentNode, …) through realm-safe cached prototype getters
rather than instance properties, so a clobbered instance property cannot shadow
the real value. DOMPurify centralizes this and also offers
SANITIZE_DOM / SANITIZE_NAMED_PROPS to neutralize clobbering of id/name.
When the input is a DOM node (e.g. IN_PLACE: true) rather than a string,
two things change:
- The node may come from a different realm (an
<iframe>'s document). Its prototype chain and constructors differ from the main realm's, so naiveinstanceofchecks and uncached getters can misbehave. The sanitizer must still strip dangerous attributes such ashref="javascript:…"from foreign-realm nodes (covered by the cross-realm regression). - The caller's live nodes are processed directly, so any pre-existing state on those nodes (including clobbering, or, in pathological app code, explicitly defined property getters) is in scope. The defensive answer is the same as §5: classify nodes via realm-safe getters, never trust instance properties.
Defense / test: feed the sanitizer a node built in a foreign realm and confirm dangerous attributes are removed; reject non-node objects passed where a node is expected.
Apps that feed sanitized HTML into a client-side template engine ask DOMPurify
to also strip template expressions ({{…}}, ${…}, ERB tags). A subtle bypass:
an expression can be split across adjacent text nodes so that no single text
node matches the expression regex, yet the fragments merge into a live
expression after the DOM is normalize()d:
text node 1: "$"
text node 2: "{constructor.constructor(\"alert(1)\")()"
An even sharper variant hides the split text nodes inside
<template>.content, a separate DocumentFragment that a NodeIterator
rooted at the body does not traverse — so an expression scrubber that walks
only the main tree misses it entirely.
Defense / test: scrub expressions after accounting for node merging, and
explicitly recurse into <template>.content (and shadow roots) rather than
relying on a single top-level walk.
With the default config, a sanitizer should reject unknown custom elements. Two related risks:
-
Permissive
CUSTOM_ELEMENT_HANDLING— a too-broadtagNameCheck/attributeNameChecklets arbitrary custom elements with arbitrary attributes (including event handlers) through. This must be opt-in and tightly scoped. -
Prototype pollution as a force-multiplier — if an earlier gadget pollutes
Object.prototype, a sanitizer that initializes config objects with{}(and reads missing keys off the prototype) can inherit attacker-controlledtagNameCheck/attributeNameCheckvalues, downgrading the default-deny. The defensive fix is to initialize internal config withObject.create(null)so polluted prototype keys are never inherited. PP gadgets are common in the ecosystem (lodash/jQuery/qs/merge-deep, …), which is what makes this class practically relevant rather than theoretical.
Defense / test: keep custom-element checks default-deny and prototype-safe;
verify that polluting Object.prototype.tagNameCheck does not loosen output.
Newer engine features can mutate the DOM after sanitization. Chrome's
<selectedcontent> mirrors the selected <option>'s subtree into its own
children after parsing — so a sanitizer that inspects only the static markup
misses the payload the engine later clones in.
Defense / test: forbid such elements unless explicitly opted in, and when opted in, re-walk the subtree after the engine populates it ("refresh after sanitize"). This is a general lesson: any feature that clones or defers content needs post-mutation re-inspection.
Many "bypasses" are really misconfigurations that downgrade protection:
-
ALLOW_SELF_CLOSE_IN_ATTRinteracts with older jQuery'shtml()normalization (the jQuery/>-rewriting class). -
ADD_TAGS/ADD_ATTRwiden the allow-list; predicate-function forms must not short-circuit URI validation (thejavascript:URL on an allowedhrefclass). -
ALLOW_UNKNOWN_PROTOCOLS,ADD_URI_SAFE_ATTR, and a loosenedALLOWED_URI_REGEXPcan re-admit dangerous URI schemes. -
WHOLE_DOCUMENT,RETURN_DOM,RETURN_DOM_FRAGMENTchange the output shape and the re-insertion contract.
Lesson: the secure defaults are the product; most config flags trade safety for capability and need a threat-model justification.
- Building a test corpus? Each payload above is a regression vector. Run it through your sanitizer in all three re-insertion contexts (§1) and assert on the live parsed tree, not on a substring of the output string (encoded markup produces false negatives).
- Reviewing a config? Walk §10 and require a justification for every non-default flag.
- Auditing sanitizer internals? §5/§6 are the recurring root cause: read node identity through realm-safe getters, never instance properties, and re-inspect anything the engine clones, defers, or hides in a separate fragment (§7, §9).
Sourced from the DOMPurify regression suite. Payloads are public, fixed test vectors documented for defensive testing and education.