Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

Closed

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Apr 15, 2024

Trac ticket: Core-61009

Status

The one test failure appears to be related to deprecating the outdated IE comment syntax. It's not a problem because the test enshrines a mis-parsing of special HTML comments.

This change fixes that mis-parsing and will prevent further corruption.

Description

When wp_kses_split processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.

In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:

  • HTML comments with invalid syntax, <!-->, <!-- --!>, etc…
  • HTML closing tags whose tag name is invalid </3>, </%happy>, etc…
  • Things that look like XML CDATA sections, <![CDATA[…]]>
  • Things that look like XML Processor Instruction nodes, <?include "blarg">

This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:

  • HTML comments with the incorrect closer --!>, because this one was a simple and easy change.
  • Closing tags with an invalid tag name, e.g. </%dolly>, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"

These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.

On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.

Discussion

Concerning the removed kses test, as far as I can tell there is no support for conditional comments like these after IE9, which I believe we don't support anymore. All modern browsers will interpret the entire span of content as a single comment.

Unfortunately there's a twist to the attack which follow best-practice recommendations for conditional comments, which is to add a short abruptly-closed comment after the opening.

<!--[if gte IE 4]><!-->
<SCRIPT>alert('XSS');</SCRIPT>
<![endif]-->

The addition of <!--> ends the opening comment and re-reveals the <SCRIPT> tag. However, this should not affect the security of WordPress because it will be looking for the SCRIPT tags through other code paths after the removal of this restriction.

Warning: Should verify this in a clear way.

When `wp_kses_split` processes a document it attempts to leave HTML
comments relatively alone. It makes minor adjustments, but leaves the
comments in the document in its output.

Unfortunately it only recognizes one kind of HTML comment and rejects
many other kinds which appear as the result of various invalid HTML
markup.

This patch makes a minor adjustment to the algorithm in `wp_kses_split`
to allow two additional kinds of HTML comments:

 - HTML comments with the incorrect closer `--!>`.
 - Closing tags with an invalid tag name, e.g. `</%dolly>`.

In an HTML parser these all become comments, and so leaving them in the
document should be a benign operation, improving the reliability of
detecting comments in Core. These invalid closing tags, which in a
browser are interpreted as comments, are one proposal for a placeholder
mechanism in the HTML API unlocking HTML templating, a new kind of
shortcode, and more. Having these persist in Core is a requirement for
exploring and utilizing the new syntax.
Copy link

github-actions bot commented Apr 15, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, zieladam, ellatrix, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@adamziel
Copy link
Contributor

The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:

Conditional comments are conditional statements interpreted by Microsoft Internet Explorer versions 5 through 9 in HTML source code

https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/

* it should be interpreted as an HTML comment. It extends until the
* first `>` character after the initial opening `</`.
*/
if ( 1 === preg_match( '~</[^a-zA-Z][^>]*>~', $content ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this bales out an returns $content if there is a match anywhere in the string. Is this expected? Would this kick in if that's inside of an attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call: I think this needs anchors. I'll review it and add them accordingly

src/wp-includes/kses.php Outdated Show resolved Hide resolved
@adamziel
Copy link
Contributor

adamziel commented Apr 16, 2024

Other than my preg_match question, this LGTM. I'd just remove the failing test case, conditional comments were only relevant up to IE9 which WordPress doesn't support anymore. And even if it did, the new output doesn't seem like an XSS vector. I'd still get inputs from others, cc @westonruter

@westonruter
Copy link
Member

Confirmed that Chrome parses these as comments:

<!-- comment 1 -->
foo
<!-- comment 2 --!>
bar
</%comment3>
baz

image

How common are these other invalid comments?

I presume you'll be adding tests specifically for the changes here?

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 17, 2024

How common are these other invalid comments?

I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch only addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.

I presume you'll be adding tests specifically for the changes here?

Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular.

In a local branch I was playing with replacing wp_kses_split2 using the HTML API and it provides some great affordances, but as usual, far more questions to resolve.

For instance, WordPress currently strips away <:::>, interpreting it as an invalid tag. It's not though, it's legitimate plaintext despite being invalid markup. So now what do we do?

  • Do we attempt to maintain this legacy behavior that corrupts documents that otherwise are fine?
  • Do we preserve this legitimate HTML and allow its surprising form to thread through a stack of code expecting more familiar HTML?
  • Do we eagerly replace confusing HTML like this before passing it along into WordPress, taking a performance hit to provide regularity and mitigate further misinterpretation by later code on the render chain?

There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required.

@westonruter
Copy link
Member

This patch only addresses a single kind of invalid comment leaving most other constructions alone.

Is there a list of all the possible invalid comment syntaxes?

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 18, 2024

Is there a list of all the possible invalid comment syntaxes?

They are not listed separated like this but you can find them in the parsing errors in the specification.

The Tag Processor draws its own classification which is retrievable with get_comment_type(), whereas it distinguishes:

  • Comments that are basically well-formed.
  • Comments that were abruptly closed.
  • Comments that appear as though someone wanted to write a CDATA section.
  • Comments that appear as though someone wanted to write an SGML processing instruction.
  • Comments that form from a variety of other invalid HTML constructions.

It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders.

@dmsnell
Copy link
Contributor Author

dmsnell commented Apr 19, 2024

How common are these other invalid comments?

@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at / for the domain.

invalid-comments.txt

This file has ANSI terminal codes for clearer output. You can cat it and see colors…

Screenshot 2024-04-19 at 6 47 42 PM

@dmsnell dmsnell force-pushed the html-api/allow-storing-funky-comments branch from d18593b to cc39a88 Compare June 12, 2024 13:55
( # Detect comments of various flavors before attempting to find tags.
(?:<!--.*?(--!?>|$)) # Well-formed HTML comments like `<!-- ... -->` and also with invalid `--!>` closer.
|
</[^a-zA-Z][^>]*> # Closing tags with invalid tag names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: could we somehow reuse the same part of the regular expression </[^a-zA-Z][^>]*> below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not stupid, but while we can do that, it's valuable IMO to leave this separate, as the closing tag with an invalid tag name has an incredibly simple and straightforward syntax pattern: the tag name starts with anything that isn't a - z (ASCII case-insensitive) and extends until the first >.

So by leaving this separate we can be even more careful about what we're examining. For example, opening tags with invalid tag names are not interpreted as invalid comments. The following pattern just isn't very robust in detecting normal tags let alone invalid ones, so as long as the correct approach is simpler being separate, I think it's worthwhile not joining them.

@ellatrix
Copy link
Member

Thanks for omitting the others improvements, it really helps to understand that they are not related to allowing the some additional invalid comment syntaxes and are completely separate fixes. It also reduces the risk of a revert in the unlikely that case there's a problem with the additional fixes. Let's do those in a separate PR 👍

while ( $transformed !== $content ) {
$transformed = wp_kses( $content, $allowed_html, $allowed_protocols );
$content = $transformed;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copied from the existing comment behavior below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an inline comment?

Copy link
Member

@ellatrix ellatrix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me know, love the changes to make it more concise.

@@ -1080,12 +1092,48 @@ function _wp_kses_split_callback( $matches ) {
function wp_kses_split2( $content, $allowed_html, $allowed_protocols ) {
$content = wp_kses_stripslashes( $content );

// It matched a ">" character.
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should wp_kses_split2 also get an updated @since?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great idea. Thanks for catching it!

Copy link
Member

@ellatrix ellatrix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe add an extra unit test case?

@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 12, 2024

Found an issue with encoding HTML comments, I'll examine.

Update: Was testing locally on another branch and got confused; all is fine.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +989 to +990
|
</[^a-zA-Z][^>]*> # - Closing tags with invalid tag names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are essentially the only change to the regex pattern (an addition).

* first `>` character after the initial opening `</`.
*
* Preserve these comments and do not treat them like tags.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the > character ever not terminate the string? In other words, what would change if you removed the $ from the regexp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I know now. Then we couldn’t assume bkw that the content starts at the at index two and ends at index minus one

*
* Preserve these comments and do not treat them like tags.
*/
if ( 1 === preg_match( '~^</[^a-zA-Z][^>]*>$~', $content ) ) {
Copy link
Contributor

@adamziel adamziel Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to check for any invalid comment closers? I’m on a bus and can't remember exactly, but there were cases like triple dash or cdata-lookalike closer. Or does none of that matter with these tag-closer-like comments?

Copy link
Contributor

@adamziel adamziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few notes I mean mostly as double-checking questions - I think you got everything right here and we’re good to go. Thank you!

As a side note, I’d love to see a fuzzer trying all the different risky inputs with and without every kses PR

@dmsnell dmsnell force-pushed the html-api/allow-storing-funky-comments branch from 6c708fb to 5949d4c Compare June 12, 2024 20:19
pento pushed a commit that referenced this pull request Jun 15, 2024
When `wp_kses_split` processes a document it attempts to leave HTML comments
alone. It makes minor adjustments, but leaves the comments in the document in
its output. Unfortunately it only recognizes one kind of HTML comment and
rejects many others.

This patch makes a minor adjustment to the algorithm in `wp_kses_split` to
recognize and preserve an additional kind of HTML comment: closing tags with
an invalid tag name, e.g. `</%dolly>`.

These invalid closing tags must be interpreted as comments by a browser.
This bug fix aligns the implementation of `wp_kses_split()` more closely
with its stated goal of leaving HTML comments as comments.

It doesn't attempt to fully fix the mis-parsed comments, but it does propose a
minor fix that hopefully won't break any existing code or projects.

Developed in #6395
Discussed in https://core.trac.wordpress.org/ticket/61009

Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam.
See #61009.


git-svn-id: https://develop.svn.wordpress.org/trunk@58418 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Contributor Author

dmsnell commented Jun 15, 2024

Merged in [58418]
c756dfe

@dmsnell dmsnell closed this Jun 15, 2024
@dmsnell dmsnell deleted the html-api/allow-storing-funky-comments branch June 15, 2024 06:33
markjaquith pushed a commit to WordPress/WordPress that referenced this pull request Jun 15, 2024
When `wp_kses_split` processes a document it attempts to leave HTML comments
alone. It makes minor adjustments, but leaves the comments in the document in
its output. Unfortunately it only recognizes one kind of HTML comment and
rejects many others.

This patch makes a minor adjustment to the algorithm in `wp_kses_split` to
recognize and preserve an additional kind of HTML comment: closing tags with
an invalid tag name, e.g. `</%dolly>`.

These invalid closing tags must be interpreted as comments by a browser.
This bug fix aligns the implementation of `wp_kses_split()` more closely
with its stated goal of leaving HTML comments as comments.

It doesn't attempt to fully fix the mis-parsed comments, but it does propose a
minor fix that hopefully won't break any existing code or projects.

Developed in WordPress/wordpress-develop#6395
Discussed in https://core.trac.wordpress.org/ticket/61009

Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam.
See #61009.

Built from https://develop.svn.wordpress.org/trunk@58418


git-svn-id: http://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to gilzow/wordpress-performance that referenced this pull request Jun 15, 2024
When `wp_kses_split` processes a document it attempts to leave HTML comments
alone. It makes minor adjustments, but leaves the comments in the document in
its output. Unfortunately it only recognizes one kind of HTML comment and
rejects many others.

This patch makes a minor adjustment to the algorithm in `wp_kses_split` to
recognize and preserve an additional kind of HTML comment: closing tags with
an invalid tag name, e.g. `</%dolly>`.

These invalid closing tags must be interpreted as comments by a browser.
This bug fix aligns the implementation of `wp_kses_split()` more closely
with its stated goal of leaving HTML comments as comments.

It doesn't attempt to fully fix the mis-parsed comments, but it does propose a
minor fix that hopefully won't break any existing code or projects.

Developed in WordPress/wordpress-develop#6395
Discussed in https://core.trac.wordpress.org/ticket/61009

Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam.
See #61009.

Built from https://develop.svn.wordpress.org/trunk@58418


git-svn-id: https://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants