From 0cea38e636739f5188294e62d3753dddf9455edb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Adam=20Zieli=C5=84ski?= Date: Sun, 3 May 2026 19:17:13 +0200 Subject: [PATCH] docs: fix inaccurate API claims surfaced by an audit pass MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit I cross-checked every prose claim in bin/_docs_components/.md against the actual source under components// by running each relevant API in PHP. Most claims hold, but five real misstatements in html.md and one in markdown.md would mislead a reader who took them at face value. Each is fixed below. bin/_docs_components/html.md * The pitfall "Tag closers are visited too. next_tag() stops on both opening and closing tags" was wrong. WP_HTML_Tag_Processor::matches() explicitly skips closing tags unless `stop_on_tag_closers` is set (class-wp-html-tag-processor.php ~line 4279). So a guard like `! $tags->is_tag_closer()` inside a next_tag() loop never fires. Rewrote the pitfall to describe the actual behavior and point at next_token() for code that wants to see closers. * The pitfall "If you call get_breadcrumbs() on it, you'll get a thin shape that doesn't reflect HTML5 tree construction" was wrong. WP_HTML_Tag_Processor doesn't expose get_breadcrumbs() at all — calling it raises `Call to undefined method`. Rewrote the pitfall to say so explicitly. * sanitize-html.php had `! $tags->is_tag_closer()` inside a next_tag() loop (dead code per the corrected pitfall). Removed the guard and added a one-line comment explaining why none is needed. Snippet output unchanged. * csp-nonce.php had the same dead guard. Removed. * decode-entities.php demonstrated `attribute_starts_with()` with `'java script:alert(1)'` (an encoded TAB, not an encoded letter) and var_dumped `bool(false)`. The comment said "respects encoded colons" but ` ` is a tab, and this method does NOT handle ASCII-whitespace bypasses anyway — browsers strip ASCII whitespace from URL attributes, but `attribute_starts_with()` doesn't. Reader's takeaway from the misleading example was "this function catches javascript:-with-tab", which is false. Replaced with `'javascript:alert(1)'` (entity-encoded `j`), which the method DOES correctly recognize as starting with `javascript:` — the actual claim the doc is making. * "When to use which" table row for attribute_starts_with() updated to match: the function decodes character references, so `javascript:` (where `a` = `a`) is correctly recognized. bin/_docs_components/markdown.md * Claimed "the result is a BlocksWithMetadata object" without noting that the class lives in WordPress\DataLiberation\DataFormatConsumer, not WordPress\Markdown. A reader who tried `use WordPress\Markdown\BlocksWithMetadata` got a missing-class error. Added the canonical namespace inline. Other components audited (zip, blockparser, xml, filesystem, bytestream, httpclient, dataliberation, encoding, git, merge, polyfill, blueprints, cli, corsproxy, httpserver, coding-standards) — class names, method signatures, return shapes, and constants all match the source. Verified by re-running every snippet with bin/run-snippets.py --check: 87/87 still pass. --- bin/_docs_components/html.md | 20 ++++++++++++-------- bin/_docs_components/markdown.md | 2 +- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/bin/_docs_components/html.md b/bin/_docs_components/html.md index 35d566120..b2aa2c50f 100644 --- a/bin/_docs_components/html.md +++ b/bin/_docs_components/html.md @@ -128,7 +128,9 @@ HTML; $tags = new WP_HTML_Tag_Processor( $untrusted ); while ( $tags->next_tag() ) { - if ( 'SCRIPT' === $tags->get_tag() && ! $tags->is_tag_closer() ) { + // next_tag() never lands on closing tags, so no is_tag_closer() guard + // is needed here. + if ( 'SCRIPT' === $tags->get_tag() ) { $tags->set_modifiable_text( '' ); } foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) { @@ -168,7 +170,7 @@ HTML; $tags = new WP_HTML_Tag_Processor( $html ); while ( $tags->next_tag() ) { $tag = $tags->get_tag(); - if ( ( 'SCRIPT' === $tag || 'STYLE' === $tag ) && ! $tags->is_tag_closer() ) { + if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) { $tags->set_attribute( 'nonce', $nonce ); } } @@ -237,9 +239,11 @@ require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&b=2&copy' ) . "\n"; echo "text: " . WP_HTML_Decoder::decode_text_node( 'AT&T — 100% 😀' ) . "\n"; -// Safe URL prefix check that respects encoded colons (a classic XSS vector). +// Safe URL prefix check that decodes character references while comparing. +// `j` is the letter `j`, so this string really does start with javascript:. +// strpos() would miss it. $is_javascript = WP_HTML_Decoder::attribute_starts_with( - 'java script:alert(1)', + 'javascript:alert(1)', 'javascript:', 'ascii-case-insensitive' ); @@ -250,7 +254,7 @@ var_dump( $is_javascript ); ``` attribute: path?a=1&b=2© text: AT&T — 100% 😀 -bool(false) +bool(true) ``` ## Find images by ancestry with breadcrumbs @@ -400,11 +404,11 @@ echo $tags->get_updated_html(); WP_HTML_Tag_ProcessorAttribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context. WP_HTML_Processor::create_fragment()Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one." WP_HTML_Decoder::decode_text_node()Turning entity-encoded text (AT&T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own. -WP_HTML_Decoder::attribute_starts_with()Safe URL-prefix checks that respect encoded characters (java	script:). The classic strpos approach misses these. +WP_HTML_Decoder::attribute_starts_with()Safe URL-prefix checks that decode HTML character references while comparing — so javascript: (where a is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these. -

Footgun: Tag closers are visited too. next_tag() stops on both opening and closing tags. For most attribute-rewriting code, gate with ! $tags->is_tag_closer() so you don't try to set attributes on a </script>.

+

Footgun: next_tag() only stops on opening tags. Closers and text are skipped, so a guard like ! $tags->is_tag_closer() inside a next_tag() loop is harmless but never fires. If you need to visit closing tags or text nodes, use next_token() instead and check get_token_type().

Footgun: Tag-name matches are uppercase. get_tag() always returns the tag name in uppercase ('IMG', not 'img'). Compare accordingly. The filter argument to next_tag() is case-insensitive in either direction.

-

Footgun: Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind. If you call get_breadcrumbs() on it, you'll get a thin shape that doesn't reflect HTML5 tree construction — implicit <tbody> insertion, automatic <p> closing, and the rest live only in WP_HTML_Processor.

+

Footgun: Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind, and it doesn't expose get_breadcrumbs() at all — calling that on a WP_HTML_Tag_Processor raises a Call to undefined method error. Breadcrumbs and HTML5 tree construction (implicit <tbody> insertion, automatic <p> closing, and the rest) live only on WP_HTML_Processor.

diff --git a/bin/_docs_components/markdown.md b/bin/_docs_components/markdown.md index ba86de27d..37f09baea 100644 --- a/bin/_docs_components/markdown.md +++ b/bin/_docs_components/markdown.md @@ -22,7 +22,7 @@ Bidirectional converter between Markdown and WordPress block markup. Useful for ## Markdown to blocks -

Feed Markdown into MarkdownConsumer, get block markup back. The result is a BlocksWithMetadata object that holds both the rendered blocks and any frontmatter parsed from the document.

+

Feed Markdown into MarkdownConsumer, get block markup back. The result is a BlocksWithMetadata object (defined in WordPress\DataLiberation\DataFormatConsumer — the shared shape every DataFormatConsumer in the toolkit emits) that holds both the rendered blocks and any frontmatter parsed from the document.