-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Facilitate use of WP_HTML_Tag_Processor to normalize HTML for comparison #51275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @adamziel |
Also sending bat signal to @dmsnell |
Thanks for playing with this @westonruter - I've wanted to start moving some HTML verification in tests to the Tag Processor. I'd recommend we take this a step further though and stop comparing strings against strings - we have the HTML parsing, why not rely on it? function is_equivalent_html( $a, $b ) {
$pa = new WP_HTML_Tag_Processor( $a );
$pb = new WP_HTML_Tag_Processor( $b );
while ( true ) {
$next_a = $pa->next_tag( array( 'tag_closers' => 'visit' ) );
$next_b = $pb->next_tag( array( 'tag_closers' => 'visit' ) );
if ( $next_a !== $next_b ) {
return false;
}
if ( false === $next_a ) {
return true;
}
if ( $pa->get_tag() !== $pb->get_tag() ) {
return false;
}
foreach ( $pa->get_attribute_names_with_prefix( '' ) as $name ) {
if ( $pa->get_attribute( $name ) !== $pb->get_attribute( $name ) ) {
return false;
}
}
foreach ( $pb->get_attribute_names_with_prefix( '' ) as $name ) {
if ( $pa->get_attribute( $name ) !== $pb->get_attribute( $name ) ) {
return false;
}
}
}
} So here we are guaranteed that we're comparing equivalences against each other, and doing so structurally, and ensuring that our tests don't overspecify what behaviors they assert (whitespace, escaping, etc…).
You are doing this exactly as you should be. We deliberated over this interface and chose this design so that the default behavior is performant and encourages people to only ask for what they need, whereas getting every attribute and allocating new data for the attributes is not generally what people need. It's easy to get only the attributes one cares about (e.g.
Exposing
Among the highest priority objectives of the Tag Processor is to remain compliant with the HTML5 specification. It includes these attributes because they are, by specification, attribute names, despite being parse errors. If we arbitrarily decide to exclude them we will get off track and invite corruption into the rest of the document, possibly inviting security exploits. Much of the logic in the tag processor is extremely unfamiliar, but the HTML5 specification is a large and complicated beast. Although now I see we have no unit tests for this specific case, so I will try and add some to make sure we don't accidentally break spec compliance by "fixing" this. Good spotting! I'm curious what led you to this.
Would be interested to better understand this. My guess is that you're hitting an odd edge because of what you're trying to do (the Tag Processor is not designed to normalize HTML). We've got two things happening I think: the Tag Processor tries to minimize the number of operations and also the number of changes to a document in order to product an output that when parsed into the DOM results in the requested changes.
https://developer.wordpress.org/reference/classes/wp_html_tag_processor/#design-and-limitations So if we ask to remove an attribute and then set that attribute, I believe it will preserve the original attribute name and only change the value, because semantically we have ended up in a state where that's the change we asked it to make. Also if we end up asking the Tag Processor to set an attribute to a value equivalent to what it already has, I believe it will skip all updates to that attribute.
Actually when moving the pointer via
I would have liked to have removed this whitespace but the prettier appearance didn't merit the additional complexity and overhead it would require. In the parsed DOM this whitespace is meaningless, and so in order to minimize the operations and changes required, it's left in there. In general I'd encourage you to try and embrace even more the ideas behind the Tag Processor and the HTML API; that is, use the parsed structural representation of HTML so we don't have to get stuck in the same string-based complexities and bugs we are so known for. The Tag Processor is already parsing that HTML document, why not build our test assertions on the parsed representations with semantic assertions instead of string equality? |
@dmsnell Thanks for the detailed reply and for chatting over Slack.
The problem here is that the result is a pass/fail equivalence test. In this context of a PHPUnit test, the very nice thing about string comparison with
Glad I stumbled across the right way to do it!
Thanks, I suspected this was the case. |
And yeah, just discovered that WordPress/wordpress-develop@e3d345800d broke my normalization routine. |
Silly me, it turns out that PHPUnit's |
Sweet! Just beware that DOMDocument gets a lot of basic spec compliance issues wrong 😢 If it doesn't matter for your test cases, that is fine. I think when we build our custom PHPUnit assertion though we can print out a normalized document, and even do things to minimize differences to focus on what doesn't match. That's probably a bit more involved than what you need, so I'm glad DOMDocument is enough for now. |
Yeah, the test cases we're checking involve comparing a few script tags to see if they have the same attributes. So it's fairly constrained and works great. |
BTW, blogged about Comparing Markup with PHPUnit. |
Nice post! It would be nice if later on we added something like this with the HTML Processor, once that has full support, due to the issues DOMDocument has with parsing normative spec-compliant HTML. First steps are scheduled to appear in 6.4 if you want a preview - 58517. |
What problem does this address?
In writing some unit tests for core (10up/wordpress-develop#67), I needed compare actual markup with expected markup, but ignore insignificant differences like attribute order and whether double- vs single-quoted attributes were used. I turned to
WP_HTML_Tag_Processor
to implement anormalize_markup()
method.Source Code
It worked, but there were some oddities:
$p->get_attribute_names_with_prefix( '' )
instead.get_attribute_names_with_prefix
method worked, but it also unexpectedly included the "attributes" of "<
" and the tag name.$p->get_updated_html()
after changing the attributes or else no changes would result.<script type='text/plain' id=foo>
originally, then the normalized tag would be<script id="foo" type="text/plain" >
.What is your proposed solution?
get_attribute_names
method that doesn't require passing a prefix.get_attribute_names_with_prefix
.The text was updated successfully, but these errors were encountered: