Skip to content

Facilitate use of WP_HTML_Tag_Processor to normalize HTML for comparison #51275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
westonruter opened this issue Jun 6, 2023 · 10 comments
Closed
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] HTML API An API for updating HTML attributes in markup [Type] Enhancement A suggestion for improvement.

Comments

@westonruter
Copy link
Member

westonruter commented Jun 6, 2023

What problem does this address?

In writing some unit tests for core (10up/wordpress-develop#67), I needed compare actual markup with expected markup, but ignore insignificant differences like attribute order and whether double- vs single-quoted attributes were used. I turned to WP_HTML_Tag_Processor to implement a normalize_markup() method.

Source Code
function normalize_markup( $markup ) {
	$p = new WP_HTML_Tag_Processor( $markup );
	while ( $p->next_tag() ) {
		if ( $p->is_tag_closer() ) {
			continue;
		}
		$attribute_names = $p->get_attribute_names_with_prefix( '' );
		sort( $attribute_names );
		$attributes = array();
		foreach ( $attribute_names as $attribute_name ) {
			// For some reason these are considered attributes.
			if ( '<' === $attribute_name || strtoupper( $attribute_name ) === $p->get_tag() ) {
				continue;
			}
			$attributes[ $attribute_name ] = $p->get_attribute( $attribute_name );
			$p->remove_attribute( $attribute_name );
		}
		$p->get_updated_html(); // This seems to be required to "commit" the changes, otherwise re-adding them below will result in no change.
		foreach ( $attributes as $attribute_name => $attribute_value ) {
			$p->set_attribute( $attribute_name, $attribute_value );
		}
	}

	$normalized = $p->get_updated_html();

	// Normalize inside of IE conditional comments which the HTML tag processor rightfully skips over.
	$normalized = preg_replace_callback(
		'#(<!--\[[^\]]+?\]>)(.+?)(<!\[endif\]-->)#s',
		function ( $matches ) {
			return $matches[1] . normalize_markup( $matches[2] ) . $matches[3];
		},
		$normalized
	);

	return $normalized;
}

It worked, but there were some oddities:

  1. There was no method for getting all the attributes of a given tag. I had to use $p->get_attribute_names_with_prefix( '' ) instead.
  2. The get_attribute_names_with_prefix method worked, but it also unexpectedly included the "attributes" of "<" and the tag name.
  3. To normalize the attributes, I iterated over them to build up an associative array. Then I removed them all, and then re-added them in alphabetical order. In order for these mutations to stick, I had to add a call to $p->get_updated_html() after changing the attributes or else no changes would result.
  4. Re-ordering the attributes in the above way resulted in trailing whitespace inside the start tag. So if I had <script type='text/plain' id=foo> originally, then the normalized tag would be <script id="foo" type="text/plain" >.

What is your proposed solution?

  1. Add a get_attribute_names method that doesn't require passing a prefix.
  2. Prevent returning non-attribute names from get_attribute_names_with_prefix.
  3. Automatically "commit" changes to the markup if any mutation was done upon moving the pointer. (Or something?)
  4. Trim trailing whitespace in start tags.
@westonruter westonruter added [Type] Enhancement A suggestion for improvement. [Feature] Block API API that allows to express the block paradigm. labels Jun 6, 2023
@westonruter
Copy link
Member Author

cc @adamziel

@bph bph added the [Feature] HTML API An API for updating HTML attributes in markup label Jun 7, 2023
@bph
Copy link
Contributor

bph commented Jun 7, 2023

Also sending bat signal to @dmsnell

@dmsnell
Copy link
Member

dmsnell commented Jun 7, 2023

Thanks for playing with this @westonruter - I've wanted to start moving some HTML verification in tests to the Tag Processor. I'd recommend we take this a step further though and stop comparing strings against strings - we have the HTML parsing, why not rely on it?

function is_equivalent_html( $a, $b ) {
	$pa = new WP_HTML_Tag_Processor( $a );
	$pb = new WP_HTML_Tag_Processor( $b );

	while ( true ) {
		$next_a = $pa->next_tag( array( 'tag_closers' => 'visit' ) );
		$next_b = $pb->next_tag( array( 'tag_closers' => 'visit' ) );

		if ( $next_a !== $next_b ) {
			return false;
		}

		if ( false === $next_a ) {
			return true;
		}

		if ( $pa->get_tag() !== $pb->get_tag() ) {
			return false;
		}

		foreach ( $pa->get_attribute_names_with_prefix( '' ) as $name ) {
			if ( $pa->get_attribute( $name ) !== $pb->get_attribute( $name ) ) {
				return false;
			}
		}

		foreach ( $pb->get_attribute_names_with_prefix( '' ) as $name ) {
			if ( $pa->get_attribute( $name ) !== $pb->get_attribute( $name ) ) {
				return false;
			}
		}
	}
}

So here we are guaranteed that we're comparing equivalences against each other, and doing so structurally, and ensuring that our tests don't overspecify what behaviors they assert (whitespace, escaping, etc…).

There was no method for getting all the attributes of a given tag. I had to use $p->get_attribute_names_with_prefix( '' ) instead.

You are doing this exactly as you should be. We deliberated over this interface and chose this design so that the default behavior is performant and encourages people to only ask for what they need, whereas getting every attribute and allocating new data for the attributes is not generally what people need. It's easy to get only the attributes one cares about (e.g. $p->get_attributes_names_with_prefix( 'aria-' )) and then only copy/allocate strings for the values one cares about but takes thought an intention to broadly gobble up and copy all the data.

Add a get_attribute_names method that doesn't require passing a prefix.

Exposing get_attribute_names() encourages risky use of the API as a default and I'd prefer we preserve this small oddity as a self-teaching tool of the API, something to guide people naturally into how to properly use it.

it also unexpectedly included the "attributes" of "<" and the tag name.

Prevent returning non-attribute names from get_attribute_names_with_prefix.

Among the highest priority objectives of the Tag Processor is to remain compliant with the HTML5 specification. It includes these attributes because they are, by specification, attribute names, despite being parse errors. If we arbitrarily decide to exclude them we will get off track and invite corruption into the rest of the document, possibly inviting security exploits. Much of the logic in the tag processor is extremely unfamiliar, but the HTML5 specification is a large and complicated beast.

Although now I see we have no unit tests for this specific case, so I will try and add some to make sure we don't accidentally break spec compliance by "fixing" this. Good spotting! I'm curious what led you to this.

To normalize the attributes, I iterated over them to build up an associative array. Then I removed them all, and then re-added them in alphabetical order. In order for these mutations to stick, I had to add a call to $p->get_updated_html() after changing the attributes or else no changes would result.

Would be interested to better understand this. My guess is that you're hitting an odd edge because of what you're trying to do (the Tag Processor is not designed to normalize HTML).

We've got two things happening I think: the Tag Processor tries to minimize the number of operations and also the number of changes to a document in order to product an output that when parsed into the DOM results in the requested changes.

Most operations within the Tag Processor are designed to minimize the difference between an input and output document for any given change. For example, the add_class and remove_class methods preserve whitespace and the class ordering within the class attribute; and when encountering tags with duplicated attributes, the Tag Processor will leave those invalid duplicate attributes where they are but update the proper attribute which the browser will read for parsing its value. An exception to this rule is that all attribute updates store their values as double-quoted strings, meaning that attributes on input with single-quoted or unquoted values will appear in the output with double-quotes.

https://developer.wordpress.org/reference/classes/wp_html_tag_processor/#design-and-limitations

So if we ask to remove an attribute and then set that attribute, I believe it will preserve the original attribute name and only change the value, because semantically we have ended up in a state where that's the change we asked it to make.

Also if we end up asking the Tag Processor to set an attribute to a value equivalent to what it already has, I believe it will skip all updates to that attribute.

Automatically "commit" changes to the markup if any mutation was done upon moving the pointer. (Or something?)

Actually when moving the pointer via next_tag() or seek() all updates are in fact committed, but while operating on a tag we defer that computation for exactly the kind of circumstances as you are exposing here 🙃 - we assume people will end up asking to mutate a given attribute a number of times and the only thing that matters is the end value, so we bypass operations that might slow down the processor but provide no changed output.

Re-ordering the attributes in the above way resulted in trailing whitespace inside the start tag. So if I had <script type='text/plain' id=foo> originally, then the normalized tag would be <script id="foo" type="text/plain" >.

I would have liked to have removed this whitespace but the prettier appearance didn't merit the additional complexity and overhead it would require. In the parsed DOM this whitespace is meaningless, and so in order to minimize the operations and changes required, it's left in there.


In general I'd encourage you to try and embrace even more the ideas behind the Tag Processor and the HTML API; that is, use the parsed structural representation of HTML so we don't have to get stuck in the same string-based complexities and bugs we are so known for.

The Tag Processor is already parsing that HTML document, why not build our test assertions on the parsed representations with semantic assertions instead of string equality?

@westonruter
Copy link
Member Author

@dmsnell Thanks for the detailed reply and for chatting over Slack.

I'd recommend we take this a step further though and stop comparing strings against strings - we have the HTML parsing, why not rely on it?
So here we are guaranteed that we're comparing equivalences against each other, and doing so structurally, and ensuring that our tests don't overspecify what behaviors they assert (whitespace, escaping, etc…).
...
In general I'd encourage you to try and embrace even more the ideas behind the Tag Processor and the HTML API; that is, use the parsed structural representation of HTML so we don't have to get stuck in the same string-based complexities and bugs we are so known for.

The Tag Processor is already parsing that HTML document, why not build our test assertions on the parsed representations with semantic assertions instead of string equality?

The problem here is that the result is a pass/fail equivalence test. In this context of a PHPUnit test, the very nice thing about string comparison with assertSame is that we get a diff of where there is a lack of equivalence. For a developer, it's much easier to look at a blob of (normalized) HTML with a diff rather than getting a diff of specific nodes in a list outside the context of the HTML, I imagine.

You are doing this exactly as you should be. We deliberated over this interface and chose this design so that the default behavior is performant and encourages people to only ask for what they need, whereas getting every attribute and allocating new data for the attributes is not generally what people need. It's easy to get only the attributes one cares about (e.g. $p->get_attributes_names_with_prefix( 'aria-' )) and then only copy/allocate strings for the values one cares about but takes thought an intention to broadly gobble up and copy all the data.

Glad I stumbled across the right way to do it!

So if we ask to remove an attribute and then set that attribute, I believe it will preserve the original attribute name and only change the value, because semantically we have ended up in a state where that's the change we asked it to make.

Also if we end up asking the Tag Processor to set an attribute to a value equivalent to what it already has, I believe it will skip all updates to that attribute.

Thanks, I suspected this was the case.

@westonruter
Copy link
Member Author

Would be interested to better understand this. My guess is that you're hitting an odd edge because of what you're trying to do (the Tag Processor is not designed to normalize HTML).

And yeah, just discovered that WordPress/wordpress-develop@e3d345800d broke my normalization routine.

@westonruter
Copy link
Member Author

Silly me, it turns out that PHPUnit's assertEquals supports passing DOM nodes which it normalizes for comparison! So no need to use WP_HTML_Tag_Processor here.

@westonruter westonruter closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023
@dmsnell
Copy link
Member

dmsnell commented Jun 8, 2023

supports passing DOM nodes which it normalizes for comparison!

Sweet! Just beware that DOMDocument gets a lot of basic spec compliance issues wrong 😢

If it doesn't matter for your test cases, that is fine.

I think when we build our custom PHPUnit assertion though we can print out a normalized document, and even do things to minimize differences to focus on what doesn't match. That's probably a bit more involved than what you need, so I'm glad DOMDocument is enough for now.

@westonruter
Copy link
Member Author

Yeah, the test cases we're checking involve comparing a few script tags to see if they have the same attributes. So it's fairly constrained and works great.

@westonruter
Copy link
Member Author

BTW, blogged about Comparing Markup with PHPUnit.

@dmsnell
Copy link
Member

dmsnell commented Jul 4, 2023

Nice post! It would be nice if later on we added something like this with the HTML Processor, once that has full support, due to the issues DOMDocument has with parsing normative spec-compliant HTML.

First steps are scheduled to appear in 6.4 if you want a preview - 58517.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] HTML API An API for updating HTML attributes in markup [Type] Enhancement A suggestion for improvement.
Projects
None yet
Development

No branches or pull requests

3 participants