Wp html processor text based #2

adamziel · 2023-02-27T15:20:59Z

Attempt at WordPress#4125 but without expanding the DOM tree in memory

Supported features

"in body" insertion mode
nth_child next_sibling`
get/set inner_html and outer_html
Correctly processes non-normative markup like <ul><li>1<li>2<li>3</ul>

Missing features

Insertion modes other than in body – they actually seem important for properly treating markup like <table><tr><td><tr><td></table>
Retaining active formatting elements in code like <p><b>First<p>Second. Currently, the b tag is forcibly removed from the list of active formatting elements when the second <p> is encountered
Support for injecting tag closers into the code – applying lexical updates before the cursor position while traversing the document is problematic
Support for find( $css_selector )

Stats

Current stats for parsing the 13MB single page HTML parsing spec document:

Mem peak usage: 38.6MB
time: 11.61s

That's pretty good!

cc @dmsnell @ockham

…king a tag closer This commit marks the start of a bookmark one byte before the tag name start for tag openers, and two bytes before the tag name for tag closers. Setting a bookmark on a tag should set its "start" position before the opening "<", e.g.: ``` <div> Testing a <b>Bookmark</b> ----------------^ ``` The current calculation assumes this is always one byte to the left from $tag_name_starts_at. However, in tag closers that index points to a solidus symbol "/": ``` <div> Testing a <b>Bookmark</b> ----------------------------^ ``` The bookmark should therefore start two bytes before the tag name: ``` <div> Testing a <b>Bookmark</b> ---------------------------^ ```

…closers' into wp_html_processor

dmsnell · 2023-03-03T03:33:57Z

src/wp-includes/html-api/class-wp-html-processor.php

+			dbg( "Found {$this->current_token->tag} tag opener" );
+			switch ( $this->current_token->tag ) {
+				case 'HTML':
+					$this->drop_current_tag_token();


what's the purpose here? it seems like we're modifying a document that nobody asked to modify

This specific case is a hack I added to prevent a failure upon finding an HTML tag. It shouldn't be needed anymore once we look into other insertion modes. Now that I think about it, perhaps a return statement to ignore it would be a better hack.

As for the general drop_current_tag_token function, it used to be ignore_token. Let's make it ignore_token again – there's no need to make such uninvited updates.

dmsnell · 2023-03-03T03:34:47Z

src/wp-includes/html-api/class-wp-html-processor.php

+					if ( $this->is_element_in_button_scope( 'P' ) ) {
+						$this->close_p_element();
+					}
+					if ( in_array( $this->current_node()->tag, array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' ) ) ) {


isn't this implied by the case waterfall?

The switch checks the $this->current_token->tag while this check looks into $this->current_node() – the most recent entry in the open elements stack. In other words, this ensures that h2 in markup like <div><h1>Primary heading <h2> Secondary heading</div> does not become a child of h1

dmsnell · 2023-03-03T03:37:38Z

src/wp-includes/html-api/class-wp-html-processor.php

+		if($token !== $this->current_token) {
+			// Aesthetic choice for now.
+			// @TODO: discuss it with the team
+			$tag = strtolower($token->tag);


I'm in favor of respecting what someone sends. if they send dIv then that's fine.

sounds good to me 👍

dmsnell · 2023-03-03T03:38:52Z

src/wp-includes/html-api/class-wp-html-processor.php

+				)
+			);
+		}
+		array_push($this->open_elements, $token);


I think this doesn't matter right now, but I'm strongly opposed to offering a function which would allow people to add a partial tag, an opener without a closer. adding a void or self-closing foreign element seems fine, but leaving an open element on the stack seems to be an invitation for trouble

I agree! I didn't mean to offer this method to people. insert_element is a private API that implements the insert an HTML element for a token part of the parsing spec.

In the DOM PR, it was quite a bit more complex, but in this stream processing version it all collapses to inserting a tag opener and pushing an element to stack.

dmsnell

a thing I'm most worried about with partial HTMl support is properly indicating that we bailed. if we return false I wonder if that's enough to communicate that we didn't change anything. same goes if we return the original un-modified string.

I don't have an answer here; I'm just raising the question

luisherranz · 2023-03-03T12:01:32Z

a thing I'm most worried about with partial HTMl support is properly indicating that we bailed

I'm also been wondering what would be the ideal API for that.

Maybe throwing an error:

try {
  $p = new WP_HTML_Processor( $html_which_may_be_broken );
  // process...
} catch (Exception $e) {
  // The HTML was broken and it couldn't be processed, do whatever you need
  // to do in this situation.
}

adamziel · 2023-03-03T12:19:09Z

I have a related worry – what if we need to bale out after applying some updates already? For example:

$p = new WP_HTML_Processor( '<div></div><b><section></b></section>' );
$p->find('div');
$p->set_inner_html('Yay!');
$p->get_updated_html();
// <div>Yay!</div><b><section></b></section>

$p->find('section');
$p->set_inner_html('Yay!'); // throws? or returns false?
$p->get_updated_html(); // <div>Yay!</div><b><section></b></section>

$p->find('div'); // does this return true? or are we in unrecoverable state now?

I can only see two ways to proceed:

Apply updates in the supported part of the markup but refuse to do anything in the unsupported part
Scan the rest of the document on first get_updated_html and make sure we're dealing with a supported markup

I can't easily tell if 1 is safe. Intuitively it should be since it's dealing only with the markup that's well-formed. If it's not safe, though, we're stuck with parsing the entire markup at least once.

luisherranz · 2023-03-03T15:06:02Z

My idea of how this was supposed to work was that $p->get_updated_html() was final, and it would go all the way to the end to make sure the rest of the HTML is OK and then return the modified value. If it's not ok, it would throw.

And my throw-an-error suggestion was based on the assumption that it can fail at any time (even on the first $p->find('div')), and when it does, there's no reason to keep processing the rest.

But based on your comment, $p->get_updated_html() is not the final command, it can be called in the middle of the processing. Is that correct?

I guess the throw-an-error idea only works if you can't get any modified HTML until the very end. I mean, getting the modified HTML finalizes the processing.

adamziel · 2023-03-03T15:27:26Z

But based on your comment, $p->get_updated_html() is not the final command, it can be called in the middle of the processing. Is that correct?

In Tag Processor $p->get_updated_html() can be called anytime during processing. In fact, it's even called internally in seek() as a way of flushing the enqueuec updates.

The big reason is performance. Tag Processor won't process any markup it doesn't have to process. This means you can update the first ten tags in a very long document and never bear the cost of parsing all of it.

Performance is even more of a concern in HTML Processor but correctness matters even more. The big question is whether it's okay to make a partial update. My intuition says it should be fine, but I need to think about it more. If we need to go through the entire thing, throwing an exception sounds nice. I actually wonder – can we throw in WordPress core? I don't see many instances of that aside of external libraries that were merged at one point.

dmsnell · 2023-03-06T18:19:35Z

Maybe throwing an error:

We try really hard to avoid throwing in Core because we don't want to crash peoples' sites. As convenient as it can be at times, a corrupted site is generally preferable to a white screen.

what if we need to bale out after applying some updates already?

For the HTML processor as we talk through these things I think the only viable way to approach this at least as an initial run is to scan through the entire document on get_udpated_html(). That means we couldn't have already-processed changes and then later find unexpected inputs.

It doesn't mean we have to give up performance for this aspect. On that initial modification operation we can simply finish scanning and verify the document structure. We might even be able to set an internal flag indicating that the rest of the document is fine, letting us proceed after that on the same processor instance, without re-scanning.

Finally we can iterate to look at the impact of finding un-opened tags - I can't remember off the top of my head - because if we're at a point in the document where tags are balanced I don't see a problem making modifications there and then bailing only when the problems appear.

There is a can of worms here and I think we're closer to settling on an interface, but I bet that's still three and a half weeks away. This HTML processor operates on HTML that exists within a broader HTML document so there is the likelihood of partial input or unexpected context which would flavor what we're doing within the domain of the processor itself. i.e. Some things that look fine to us won't be inside the page just as some things that look broken to us might be fine when stitched back in to the page.

As for errors, if it's not enough to return false (which maybe it is), we might look to a Maybe/Optional type with something like these:

[true, $updated_html] / [false]
[$updated_html] / []
[$updated_html], / [false, $error_type]

dmsnell · 2023-03-07T00:49:32Z

src/wp-includes/html-api/class-wp-html-processor.php

+		$active_formatting_elements = $this->active_formatting_elements;
+
+		/**
+		 * seek() will rewing before the current tag


typo: rewing > rewind

adamziel · 2023-03-08T18:29:59Z

I think the only viable way to approach this at least as an initial run is to scan through the entire document on get_udpated_html()

@dmsnell Why would you say this is the only viable approach? I’m still on the fence about this. Is there anything wrong with modifying the well-formed part of the markup? Maybe let’s just allow it and only refuse to process things once we run into unsupported markup?

Finally we can iterate to look at the impact of finding un-opened tags - I can't remember off the top of my head - because if we're at a point in the document where tags are balanced I don't see a problem making modifications there and then bailing only when the problems appear.

Most of the time they're ignored or, IIRC, assumed to close the current tag.

As for errors, if it's not enough to return false (which maybe it is), we might look to a Maybe/Optional type with something like these:

I'd rather stick to false. It's standard in WordPress, not surprising, and simple to check for. I love optionals, but I'd rather not introduce yet another coding pattern. Alternatively, we could return a WP_Error instance.

dmsnell · 2023-03-10T20:13:45Z

Alternatively, we could return a WP_Error instance.

yeah I always forget about that one (probably because I don't like it 🙃)

Why would you say this is the only viable approach?

it was based on some things you said, which we started discussing outside of this issue. if we don't have to look out for tags later on that would change tags beforehand then what I said is wrong.

Support for find( $css_selector )

I think it's fine to not have this, or if we add it, to only add it in late once we have that in another. class. I think we can build it purely from next_sibling(), next_tag(), and next_child()

one of the things I love about this is another part of what I am nervous about: the big switch statement handling tag openings.

I'm wondering if we could blend the idea of a WP_HTML_Spec class/database to flag some of those decisions and focus the logic here on the decisions, less on the tags.

I've thought that it might help to grab the reference to a class once and then reference static flags on that class multiple times.

still the tradeoff between supporting things we can vs. the simplicity of supporting only a well-defined small subset irks me. my partner PR doesn't support single-page.html so I can't give any performance comparisons.

adamziel added 19 commits February 21, 2023 18:13

Explore HTML parsing and Adoption Agency Algorithm

7e7602c

Emit text tokens

0bdd4f6

Merge branch 'html-api-start-bookmarks-before-tag-opener-also-in-tag-…

a24349a

…closers' into wp_html_processor

Consume HTML text nodes as tokens

ff9505b

Implement DOM insertion

afbfdc5

Fix a bug in the adoption agency algorithm

9d31cb7

Correctly cose the p tags

eeea95a

Simplify HTML Processor

ddf2c73

Correct the is_element_in_scope checks

db40a94

Uncomment some test inputs

ea4f392

Document insert_node

66fd636

Simplify ignore_token()

93fea6c

Start exploring a text-based API

fd2ddcf

Doodling more

faf724e

Simplify the adoption agency algorithm

0565b6b

Get rid of next_token() logic

a287999

Remove Object-oriented logic

7430067

More advanced diff-based approach

4f6ec24

adamziel changed the base branch from trunk to wp_html_processor February 27, 2023 15:21

Remove class-wp-html-processor.php

3a0ed5f

adamziel changed the base branch from wp_html_processor to trunk February 27, 2023 15:22

adamziel added 7 commits February 28, 2023 13:59

Close the tags in a correct order

a7d76e7

Reconstruct the active formatting elements in their correct location

0663a48

Remove $reconstructed_html – always operate on the tag processor stream

7887401

Clean up the API

481fce5

Cleanup the API

140459e

Don't skip over RCData and Script tag closers

26c6f21

MVP parser capable of parsing the entire HTML spec

37659fb

adamziel force-pushed the wp_html_processor-text-based branch from 9a29748 to 37659fb Compare February 28, 2023 14:47

adamziel added 10 commits March 1, 2023 16:35

Adjust bookmarks setting to suit outer_html better

4efef0b

Correctly process outer_html() using pinned bookmarks

0e3799b

Fix processing H1-H6 tag closers

ab608e5

Simplify unit tests

5be3ba1

Add a complex use-case test

7d19b9b

Simplify nth_child and nth_sibling

cb9d35d

Fixy fix in the get_updated_html method

3b27ca2

MVP support for updates before current parsing cursor

802e1c0

Get all wp-html-processor tests to pass

13badeb

Remove debug artifacts

2918ada

adamziel mentioned this pull request Mar 2, 2023

Explore HTML parser based on the Tag Processor WordPress/wordpress-develop#4125

Closed

dmsnell reviewed Mar 3, 2023

View reviewed changes

This was referenced Mar 6, 2023

Add WP_HTML_Processor WordPress/gutenberg#47573

Closed

WP_HTML_Processor: Add set_content_inside_balanced_tags WordPress/gutenberg#47036

Closed

dmsnell reviewed Mar 7, 2023

View reviewed changes

dmsnell mentioned this pull request May 4, 2023

Tag processor/dangerously replace contents WordPress/gutenberg#46080

Closed

dmsnell mentioned this pull request Jun 17, 2023

HTML-API: Introduce minimal HTML Processor. WordPress/wordpress-develop#4519

Closed

7 tasks

dmsnell mentioned this pull request Jul 27, 2023

HTML API: Adjust code styling to Gutenberg's linter's preferences WordPress/wordpress-develop#4918

Closed

adamziel force-pushed the trunk branch from bb99260 to 5412fd5 Compare December 9, 2023 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wp html processor text based #2

Wp html processor text based #2

adamziel commented Feb 27, 2023 •

edited

dmsnell Mar 3, 2023

adamziel Mar 3, 2023

adamziel Mar 3, 2023

dmsnell Mar 3, 2023

adamziel Mar 3, 2023 •

edited

dmsnell Mar 3, 2023

adamziel Mar 3, 2023

dmsnell Mar 3, 2023

adamziel Mar 3, 2023 •

edited

dmsnell left a comment

luisherranz commented Mar 3, 2023

adamziel commented Mar 3, 2023 •

edited

luisherranz commented Mar 3, 2023 •

edited

adamziel commented Mar 3, 2023 •

edited

dmsnell commented Mar 6, 2023

dmsnell Mar 7, 2023

adamziel commented Mar 8, 2023 •

edited

dmsnell commented Mar 10, 2023

Wp html processor text based #2

Are you sure you want to change the base?

Wp html processor text based #2

Conversation

adamziel commented Feb 27, 2023 • edited

Supported features

Missing features

Stats

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Mar 3, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Mar 3, 2023 • edited

Choose a reason for hiding this comment

dmsnell left a comment

Choose a reason for hiding this comment

luisherranz commented Mar 3, 2023

adamziel commented Mar 3, 2023 • edited

luisherranz commented Mar 3, 2023 • edited

adamziel commented Mar 3, 2023 • edited

dmsnell commented Mar 6, 2023

Choose a reason for hiding this comment

adamziel commented Mar 8, 2023 • edited

dmsnell commented Mar 10, 2023

adamziel commented Feb 27, 2023 •

edited

adamziel Mar 3, 2023 •

edited

adamziel Mar 3, 2023 •

edited

adamziel commented Mar 3, 2023 •

edited

luisherranz commented Mar 3, 2023 •

edited

adamziel commented Mar 3, 2023 •

edited

adamziel commented Mar 8, 2023 •

edited