Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wp html processor text based #2

Open
wants to merge 43 commits into
base: trunk
Choose a base branch
from

Conversation

adamziel
Copy link
Owner

@adamziel adamziel commented Feb 27, 2023

Attempt at WordPress#4125 but without expanding the DOM tree in memory

Supported features

  • "in body" insertion mode
  • nth_child next_sibling`
  • get/set inner_html and outer_html
  • Correctly processes non-normative markup like <ul><li>1<li>2<li>3</ul>

Missing features

  • Insertion modes other than in body – they actually seem important for properly treating markup like <table><tr><td><tr><td></table>
  • Retaining active formatting elements in code like <p><b>First<p>Second. Currently, the b tag is forcibly removed from the list of active formatting elements when the second <p> is encountered
  • Support for injecting tag closers into the code – applying lexical updates before the cursor position while traversing the document is problematic
  • Support for find( $css_selector )

Stats

Current stats for parsing the 13MB single page HTML parsing spec document:

Mem peak usage: 38.6MB
time: 11.61s

That's pretty good!

cc @dmsnell @ockham

…king a tag closer

This commit marks the start of a bookmark one byte before
the tag name start for tag openers, and two bytes before
the tag name for tag closers.

Setting a bookmark on a tag should set its "start" position
before the opening "<", e.g.:

```
<div> Testing a <b>Bookmark</b>
----------------^
```

The current calculation assumes this is always one byte
to the left from $tag_name_starts_at.

However, in tag closers that index points to a solidus
symbol "/":

```
<div> Testing a <b>Bookmark</b>
----------------------------^
```

The bookmark should therefore start two bytes before the tag name:

```
<div> Testing a <b>Bookmark</b>
---------------------------^
```
@adamziel adamziel changed the base branch from trunk to wp_html_processor February 27, 2023 15:21
@adamziel adamziel changed the base branch from wp_html_processor to trunk February 27, 2023 15:22
dbg( "Found {$this->current_token->tag} tag opener" );
switch ( $this->current_token->tag ) {
case 'HTML':
$this->drop_current_tag_token();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose here? it seems like we're modifying a document that nobody asked to modify

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific case is a hack I added to prevent a failure upon finding an HTML tag. It shouldn't be needed anymore once we look into other insertion modes. Now that I think about it, perhaps a return statement to ignore it would be a better hack.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the general drop_current_tag_token function, it used to be ignore_token. Let's make it ignore_token again – there's no need to make such uninvited updates.

if ( $this->is_element_in_button_scope( 'P' ) ) {
$this->close_p_element();
}
if ( in_array( $this->current_node()->tag, array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' ) ) ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this implied by the case waterfall?

Copy link
Owner Author

@adamziel adamziel Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch checks the $this->current_token->tag while this check looks into $this->current_node() – the most recent entry in the open elements stack. In other words, this ensures that h2 in markup like <div><h1>Primary heading <h2> Secondary heading</div> does not become a child of h1

if($token !== $this->current_token) {
// Aesthetic choice for now.
// @TODO: discuss it with the team
$tag = strtolower($token->tag);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of respecting what someone sends. if they send dIv then that's fine.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me 👍

)
);
}
array_push($this->open_elements, $token);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't matter right now, but I'm strongly opposed to offering a function which would allow people to add a partial tag, an opener without a closer. adding a void or self-closing foreign element seems fine, but leaving an open element on the stack seems to be an invitation for trouble

Copy link
Owner Author

@adamziel adamziel Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! I didn't mean to offer this method to people. insert_element is a private API that implements the insert an HTML element for a token part of the parsing spec.

In the DOM PR, it was quite a bit more complex, but in this stream processing version it all collapses to inserting a tag opener and pushing an element to stack.

Copy link

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thing I'm most worried about with partial HTMl support is properly indicating that we bailed. if we return false I wonder if that's enough to communicate that we didn't change anything. same goes if we return the original un-modified string.

I don't have an answer here; I'm just raising the question

@luisherranz
Copy link

a thing I'm most worried about with partial HTMl support is properly indicating that we bailed

I'm also been wondering what would be the ideal API for that.

Maybe throwing an error:

try {
  $p = new WP_HTML_Processor( $html_which_may_be_broken );
  // process...
} catch (Exception $e) {
  // The HTML was broken and it couldn't be processed, do whatever you need
  // to do in this situation.
}

@adamziel
Copy link
Owner Author

adamziel commented Mar 3, 2023

I have a related worry – what if we need to bale out after applying some updates already? For example:

$p = new WP_HTML_Processor( '<div></div><b><section></b></section>' );
$p->find('div');
$p->set_inner_html('Yay!');
$p->get_updated_html();
// <div>Yay!</div><b><section></b></section>

$p->find('section');
$p->set_inner_html('Yay!'); // throws? or returns false?
$p->get_updated_html(); // <div>Yay!</div><b><section></b></section>

$p->find('div'); // does this return true? or are we in unrecoverable state now?

I can only see two ways to proceed:

  1. Apply updates in the supported part of the markup but refuse to do anything in the unsupported part
  2. Scan the rest of the document on first get_updated_html and make sure we're dealing with a supported markup

I can't easily tell if 1 is safe. Intuitively it should be since it's dealing only with the markup that's well-formed. If it's not safe, though, we're stuck with parsing the entire markup at least once.

@luisherranz
Copy link

luisherranz commented Mar 3, 2023

My idea of how this was supposed to work was that $p->get_updated_html() was final, and it would go all the way to the end to make sure the rest of the HTML is OK and then return the modified value. If it's not ok, it would throw.

And my throw-an-error suggestion was based on the assumption that it can fail at any time (even on the first $p->find('div')), and when it does, there's no reason to keep processing the rest.

But based on your comment, $p->get_updated_html() is not the final command, it can be called in the middle of the processing. Is that correct?

I guess the throw-an-error idea only works if you can't get any modified HTML until the very end. I mean, getting the modified HTML finalizes the processing.

@adamziel
Copy link
Owner Author

adamziel commented Mar 3, 2023

But based on your comment, $p->get_updated_html() is not the final command, it can be called in the middle of the processing. Is that correct?

In Tag Processor $p->get_updated_html() can be called anytime during processing. In fact, it's even called internally in seek() as a way of flushing the enqueuec updates.

The big reason is performance. Tag Processor won't process any markup it doesn't have to process. This means you can update the first ten tags in a very long document and never bear the cost of parsing all of it.

Performance is even more of a concern in HTML Processor but correctness matters even more. The big question is whether it's okay to make a partial update. My intuition says it should be fine, but I need to think about it more. If we need to go through the entire thing, throwing an exception sounds nice. I actually wonder – can we throw in WordPress core? I don't see many instances of that aside of external libraries that were merged at one point.

@dmsnell
Copy link

dmsnell commented Mar 6, 2023

Maybe throwing an error:

We try really hard to avoid throwing in Core because we don't want to crash peoples' sites. As convenient as it can be at times, a corrupted site is generally preferable to a white screen.

what if we need to bale out after applying some updates already?

For the HTML processor as we talk through these things I think the only viable way to approach this at least as an initial run is to scan through the entire document on get_udpated_html(). That means we couldn't have already-processed changes and then later find unexpected inputs.

It doesn't mean we have to give up performance for this aspect. On that initial modification operation we can simply finish scanning and verify the document structure. We might even be able to set an internal flag indicating that the rest of the document is fine, letting us proceed after that on the same processor instance, without re-scanning.

Finally we can iterate to look at the impact of finding un-opened tags - I can't remember off the top of my head - because if we're at a point in the document where tags are balanced I don't see a problem making modifications there and then bailing only when the problems appear.

There is a can of worms here and I think we're closer to settling on an interface, but I bet that's still three and a half weeks away. This HTML processor operates on HTML that exists within a broader HTML document so there is the likelihood of partial input or unexpected context which would flavor what we're doing within the domain of the processor itself. i.e. Some things that look fine to us won't be inside the page just as some things that look broken to us might be fine when stitched back in to the page.

As for errors, if it's not enough to return false (which maybe it is), we might look to a Maybe/Optional type with something like these:

  • [true, $updated_html] / [false]
  • [$updated_html] / []
  • [$updated_html], / [false, $error_type]

$active_formatting_elements = $this->active_formatting_elements;

/**
* seek() will rewing before the current tag
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: rewing > rewind

@adamziel
Copy link
Owner Author

adamziel commented Mar 8, 2023

I think the only viable way to approach this at least as an initial run is to scan through the entire document on get_udpated_html()

@dmsnell Why would you say this is the only viable approach? I’m still on the fence about this. Is there anything wrong with modifying the well-formed part of the markup? Maybe let’s just allow it and only refuse to process things once we run into unsupported markup?

Finally we can iterate to look at the impact of finding un-opened tags - I can't remember off the top of my head - because if we're at a point in the document where tags are balanced I don't see a problem making modifications there and then bailing only when the problems appear.

Most of the time they're ignored or, IIRC, assumed to close the current tag.

As for errors, if it's not enough to return false (which maybe it is), we might look to a Maybe/Optional type with something like these:

I'd rather stick to false. It's standard in WordPress, not surprising, and simple to check for. I love optionals, but I'd rather not introduce yet another coding pattern. Alternatively, we could return a WP_Error instance.

@dmsnell
Copy link

dmsnell commented Mar 10, 2023

Alternatively, we could return a WP_Error instance.

yeah I always forget about that one (probably because I don't like it 🙃)

Why would you say this is the only viable approach?

it was based on some things you said, which we started discussing outside of this issue. if we don't have to look out for tags later on that would change tags beforehand then what I said is wrong.

Support for find( $css_selector )

I think it's fine to not have this, or if we add it, to only add it in late once we have that in another. class. I think we can build it purely from next_sibling(), next_tag(), and next_child()


one of the things I love about this is another part of what I am nervous about: the big switch statement handling tag openings.

I'm wondering if we could blend the idea of a WP_HTML_Spec class/database to flag some of those decisions and focus the logic here on the decisions, less on the tags.

I've thought that it might help to grab the reference to a class once and then reference static flags on that class multiple times.

still the tradeoff between supporting things we can vs. the simplicity of supporting only a well-defined small subset irks me. my partner PR doesn't support single-page.html so I can't give any performance comparisons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants