HTML API: Return elements pushed and popped rather than tags read. #6348

dmsnell · 2024-04-03T08:48:26Z

Summary

Creates virtual nodes when pushing to and popping from the stack of open elements. It's these nodes that are returned by next_tag(), while subclassed methods intercept tag information, all within the HTML Processor.

Splitting time!

Add get_current_depth() to return the depth of the currently-matched element in the stack of open elements. This needs to account for marker and other not-yet-implemented items in the stack.
Add expects_closer() or similar function to indicate if the currently-matched element needs or expects a closing element.

Questions

What do we do with void tags?
- There are tradeoffs in how to represent tags without a closing tag (be they void elements or self-closing elements within foreign content or special atomic elements like SCRIPT). We can represent a unique open and close tag/event for them, or only expose the opening.
  - In exposing an open and a close there's a very straightforward and universal model of sequential open and close events for every tag.
  - It's weird to have a closing tag for an IMG.
- I think that the value in having a close tag is outweighed by the awkwardness of it. For situations where we're doing things like tracking depth, we should rely on an internal method like get_depth() from the HTML API itself instead of exporting that HTML nuance onto the caller.

Related work

Examples

Convert to Markdown	Output

Normalize HTML	Output

cc: @sirreal

github-actions · 2024-04-03T09:14:18Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell · 2024-05-24T01:56:44Z

@sirreal it looks like we're failing tests after seeking, which I believe might be because when jumping back to the start of the document, we wipe out the context node.

github-actions · 2024-06-02T17:07:48Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, audrasjb, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

sirreal · 2024-06-03T11:44:12Z

src/wp-includes/html-api/class-wp-html-processor.php

@@ -424,28 +428,19 @@ public function next_tag( $query = null ) {
 					continue;
 				}

-				if ( ! parent::is_tag_closer() ) {
+				if ( ! parent::is_tag_closer() || $visit_closers ) {


Should this also use $this?

Suggested change

if ( ! parent::is_tag_closer() || $visit_closers ) {

if ( ! $this->is_tag_closer() || $visit_closers ) {

Yes. Can you concoct a test case to fail for it? If not, don't worry - we can just fix it.

sirreal

There's a regression here where this HTML:

<?wp-bit hey?>

This is a PI_NODE_LOOKALIKE. Before, get_tag() here would be wp-bit - corresponding to the PI target. After this change get_tag() returns #comment.

sirreal · 2024-06-03T16:15:29Z

src/wp-includes/html-api/class-wp-html-processor.php

+		if ( isset( $this->current_element ) ) {
+			return $this->current_element->token->node_name;
+		}


This is what breaks the tag_name for PI lookalikes. The node_name is #comment, but it's a special comment type that has more handling in the Tag Processor:

wordpress-develop/src/wp-includes/html-api/class-wp-html-tag-processor.php

Lines 2640 to 2651 in fbb5020

$tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );

if ( self::STATE_MATCHED_TAG === $this->parser_state ) {

return strtoupper( $tag_name );

}

if (

self::STATE_COMMENT === $this->parser_state &&

self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()

) {

return $tag_name;

}

A potential fix is to only return here if we have a "non-special" node:

Suggested change

if ( isset( $this->current_element ) ) {

return $this->current_element->token->node_name;

}

if ( isset( $this->current_element ) && '#' !== $this->current_element->token->node_name[0] ) {

return $this->current_element->token->node_name;

}

sirreal · 2024-06-03T17:09:07Z

~~The following HTML reaches a P tag at the wrong depth:~~

EDIT: This is pre-existing behavior.

redacted

<div></p>

$p = WP_HTML_Processor::create_fragment('<div></p>');
$p->next_token();
$p->next_token();
var_dump( $p->get_tag(), $p->get_current_depth(), $p->get_breadcrumbs() );

php shell code:1:
string(1) "P"
php shell code:1:
int(3)
php shell code:1:
array(3) {
  [0] =>
  string(4) "HTML"
  [1] =>
  string(4) "BODY"
  [2] =>
  string(3) "DIV"
}

…nced-html

HTML is a kind of short-hand for a DOM structure. This means that there are many cases in HTML where an element's opening tag or closing tag is missing (or both). This is because many of the parsing rules imply creating elements in the DOM which may not exist in the text of the HTML. The HTML Processor, being the higher-level counterpart to the Tag Processor, is already aware of these nodes, but since it's inception has not paused on them when scanning through a document. Instead, these are visible when pausing on a child of such an element, but otherwise not seen. In this patch the HTML Processor starts exposing those implicitly-created nodes, including opening tags, and closing tags, that aren't foudn in the text content of the HTML input document. Previously, the sequence of matched tokens when scanning with `WP_HTML_Processor::next_token()` would depend on how the HTML document was written, but with this patch, all semantically equal HTML documents will parse and scan in the same exact manner, presenting an idealized or "perfect" view of the document the same way as would occur when traversing a DOM in a browser. Developed in #6348 Discussed in https://core.trac.wordpress.org/ticket/61348 Props audrasjb, dmsnell, gziolo, jonsurrell. Fixes #61348. git-svn-id: https://develop.svn.wordpress.org/trunk@58304 602fd350-edb4-49c9-b593-d223f7449a82

HTML is a kind of short-hand for a DOM structure. This means that there are many cases in HTML where an element's opening tag or closing tag is missing (or both). This is because many of the parsing rules imply creating elements in the DOM which may not exist in the text of the HTML. The HTML Processor, being the higher-level counterpart to the Tag Processor, is already aware of these nodes, but since it's inception has not paused on them when scanning through a document. Instead, these are visible when pausing on a child of such an element, but otherwise not seen. In this patch the HTML Processor starts exposing those implicitly-created nodes, including opening tags, and closing tags, that aren't foudn in the text content of the HTML input document. Previously, the sequence of matched tokens when scanning with `WP_HTML_Processor::next_token()` would depend on how the HTML document was written, but with this patch, all semantically equal HTML documents will parse and scan in the same exact manner, presenting an idealized or "perfect" view of the document the same way as would occur when traversing a DOM in a browser. Developed in WordPress/wordpress-develop#6348 Discussed in https://core.trac.wordpress.org/ticket/61348 Props audrasjb, dmsnell, gziolo, jonsurrell. Fixes #61348. Built from https://develop.svn.wordpress.org/trunk@58304 git-svn-id: http://core.svn.wordpress.org/trunk@57761 1a063a9b-81f0-0310-95a4-ce76da25c4cd

dmsnell · 2024-06-03T19:48:18Z

Merged in [58304]
163c3fd

HTML is a kind of short-hand for a DOM structure. This means that there are many cases in HTML where an element's opening tag or closing tag is missing (or both). This is because many of the parsing rules imply creating elements in the DOM which may not exist in the text of the HTML. The HTML Processor, being the higher-level counterpart to the Tag Processor, is already aware of these nodes, but since it's inception has not paused on them when scanning through a document. Instead, these are visible when pausing on a child of such an element, but otherwise not seen. In this patch the HTML Processor starts exposing those implicitly-created nodes, including opening tags, and closing tags, that aren't foudn in the text content of the HTML input document. Previously, the sequence of matched tokens when scanning with `WP_HTML_Processor::next_token()` would depend on how the HTML document was written, but with this patch, all semantically equal HTML documents will parse and scan in the same exact manner, presenting an idealized or "perfect" view of the document the same way as would occur when traversing a DOM in a browser. Developed in WordPress/wordpress-develop#6348 Discussed in https://core.trac.wordpress.org/ticket/61348 Props audrasjb, dmsnell, gziolo, jonsurrell. Fixes #61348. Built from https://develop.svn.wordpress.org/trunk@58304 git-svn-id: https://core.svn.wordpress.org/trunk@57761 1a063a9b-81f0-0310-95a4-ce76da25c4cd

westonruter · 2024-06-04T01:32:49Z

src/wp-includes/html-api/class-wp-html-open-elements.php

+	 *
+	 * @param Closure $handler The handler function.
+	 */
+	public function set_pop_handler( Closure $handler ) {


Is there a reason this is specifically a Closure and isn't just more generally a callable?

Closures cannot be serialized or deserialized, meaning that there's no possible way to prep a database record with user input that sets something unexpected here.

dmsnell requested review from ockham and adamziel April 3, 2024 08:48

dmsnell force-pushed the html-api/fake-balanced-html branch 2 times, most recently from 924f171 to 407a433 Compare April 3, 2024 08:54

dmsnell force-pushed the html-api/fake-balanced-html branch from 2b400b4 to e9165f6 Compare April 3, 2024 13:11

dmsnell mentioned this pull request Apr 3, 2024

HTML API: Plans for WP 6.6 WordPress/gutenberg#60324

Closed

2 tasks

dmsnell force-pushed the html-api/fake-balanced-html branch from 598fdd0 to ad7b39d Compare April 15, 2024 11:07

sirreal force-pushed the html-api/fake-balanced-html branch from 1b53221 to 250e539 Compare May 6, 2024 16:41

dmsnell added 3 commits May 20, 2024 18:33

Add WP_HTML_Stack_Event class.

06f6b94

Add push and pop handlers in stack of open elements.

8a55288

Report from push/pull instead of next_tag()

6457a0a

dmsnell force-pushed the html-api/fake-balanced-html branch from d05ddd4 to 6457a0a Compare May 21, 2024 01:48

Merge branch 'trunk' into html-api/fake-balanced-html

aef758e

dmsnell added 2 commits May 23, 2024 19:23

Examine node from top of stack instead of currently-matched node.

09dfe2e

Ensure that seek and get_attribute recognize push/pop.

bd42e03

dmsnell marked this pull request as ready for review June 2, 2024 17:07

Linting

1892de5

sirreal reviewed Jun 3, 2024

View reviewed changes

small docblock coding standard fixes

e96d7ec

dmsnell added 2 commits June 3, 2024 21:14

Update docblock comments.

d9001f0

Merge remote-tracking branch 'upstream/trunk' into html-api/fake-bala…

cc3a031

…nced-html

dmsnell closed this Jun 3, 2024

dmsnell deleted the html-api/fake-balanced-html branch June 3, 2024 19:48

westonruter reviewed Jun 4, 2024

View reviewed changes

sirreal mentioned this pull request Jun 4, 2024

HTML Processor: Ensure is_tag_closer is correctly called #6726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Return elements pushed and popped rather than tags read. #6348

HTML API: Return elements pushed and popped rather than tags read. #6348

dmsnell commented Apr 3, 2024 •

edited

Loading

github-actions bot commented Apr 3, 2024

dmsnell commented May 24, 2024

github-actions bot commented Jun 2, 2024 •

edited

Loading

sirreal Jun 3, 2024

dmsnell Jun 3, 2024

sirreal left a comment •

edited

Loading

sirreal Jun 3, 2024

sirreal commented Jun 3, 2024 •

edited

Loading

dmsnell commented Jun 3, 2024

westonruter Jun 4, 2024

dmsnell Jun 4, 2024

	if ( ! parent::is_tag_closer() \|\| $visit_closers ) {
	if ( ! $this->is_tag_closer() \|\| $visit_closers ) {

	$tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );

	if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
	return strtoupper( $tag_name );
	}

	if (
	self::STATE_COMMENT === $this->parser_state &&
	self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
	) {
	return $tag_name;
	}

HTML API: Return elements pushed and popped rather than tags read. #6348

HTML API: Return elements pushed and popped rather than tags read. #6348

Conversation

dmsnell commented Apr 3, 2024 • edited Loading

Summary

Splitting time!

Questions

Related work

Examples

github-actions bot commented Apr 3, 2024

Test using WordPress Playground

Some things to be aware of

dmsnell commented May 24, 2024

github-actions bot commented Jun 2, 2024 • edited Loading

sirreal Jun 3, 2024

Choose a reason for hiding this comment

dmsnell Jun 3, 2024

Choose a reason for hiding this comment

sirreal left a comment • edited Loading

Choose a reason for hiding this comment

sirreal Jun 3, 2024

Choose a reason for hiding this comment

sirreal commented Jun 3, 2024 • edited Loading

dmsnell commented Jun 3, 2024

westonruter Jun 4, 2024

Choose a reason for hiding this comment

dmsnell Jun 4, 2024

Choose a reason for hiding this comment

dmsnell commented Apr 3, 2024 •

edited

Loading

github-actions bot commented Jun 2, 2024 •

edited

Loading

sirreal left a comment •

edited

Loading

sirreal commented Jun 3, 2024 •

edited

Loading