Explore HTML parser based on the Tag Processor #4125

adamziel · 2023-02-24T10:54:04Z

(This is a continuation of adamziel#1 but opened against wordpress-develop for more visibility. It's exploratory work and not something I'd like to merge in its current shape)

Description

Explores processing HTML as a tree to go beyond WP_HTML_Tag_Processor and enable these highly requested features:

set_content_inside_balanced_tags
Finding HTML nodes via a CSS selector
Inserting new HTML nodes at a correct location

Understanding a document tree is hard

CSS selector, innerHTML, and others all operate on a DOM tree. However, figuring out a DOM tree from HTML markup isn't a straightforward task! Consider:

new DOMParser().parseFromString("<p>Tree <b>structure</p> isn't </b> easy", "text/html").body.innerHTML

"<p>Tree <b>structure</b></p><b> isn't </b> easy"

Let's implement the HTML parsing spec

The WHATWG HTML spec describes in detail how to handle misnested tags, unexpected markup, and other non-normative artifacts commonly seen in HTML markup. One of the fundamental ideas it describes is the adoption agency algorithm used to rewrite the DOM tree.

But let's ditch most of it

HTML spec is huge and contains many features indispensable for web browsers. However, we're not building a browser. We only want to handle non-normative HTML markup fragments.

For example, the parsing spec demands ignoring <tr> tags found outside of a <table>. Here's what happens if we implement that part:

$p = new WP_HTML_Processor('<tr><td><b>Test!</b></td></tr>');
echo $p->get_updated_html();
// <b>Test!</b>

That's not helpful at all!

The spec describes an algorithm for Parsing HTML fragments that takes an HTML string and a context elements. Eventually, we could perhaps use it to treat the markup as if it was nested in a specific tag:

$p = new WP_HTML_Processor('<tr><td><b>Test!</b></td></tr>', '<table>');
echo $p->get_updated_html();
// <tr><td><b>Test!</b></td></tr>

Still, it adds a lot of complexity AND requires developers to be mindful of the $context. I suggest not implementing it at this time.

Specific things I'd like to ditch from Tree Construction

All insertion modes targeting markup outside of document body

Let's skip the following sections and assume we're already in a document body. The parser will be permissive of weird stuff like multiple doctype declarations:

All insertion modes targeting framesets

Come on, who uses framesets?

The text insertion mode

WP_HTML_Tag_Processor already implements the script nesting rules, so we can safely assume anything not classified as a tag is a text. This insertion mode doesn't offer any extra value.

Tag-specific insertion modes

We're working with HTML fragments and can't assume the context in which it will be displayed. Brownie points – by ditching these insertion modes we don't have to implement foster parenting.

Let's only implement parts of the in body insertion mode

That's all we need in order to handle:

Misnested formatting elements (<p>1<b>2<i>3</b>4</i>5</p>)
Misnested blocks (<b>1<p>2</b>3</p>)
Misnested list items (<ul><li>1<li>2</ul>)

Conveniently, this means new nodes are always inserted after the last child of the target node = simpler text-based implementation.

Important findings so far

Object-oriented DOM representation works but is inefficient

I benchmarked it on the HTML parsing spec itself, which is a 12MB HTML document:

I tried parsing the HTML spec page (12MB):

Mem peak usage: 499MB
Time: 30.60s

That's awful but not surprising. This PR builds an actual document tree and uses inefficient operations such as array_splice.

A text-based version similar to WP_HTML_Tag_Processor should be much faster and more memory-efficient. Let's explore one!

Parsing HTML requires a full pass through the HTML document

HTML spec deals with misnested nodes using Adoption Agency Algorithm. In the worst-case scenario, the entire document must be parsed to know even the second node.

Consider this markup:

<b>
 <div>
    <div><!-- 100k tags amounting to 2 MB of normative HTML --></div>
    </b> <!-- suddenly, a rogue </b> -->
  </div>
</b>

The correct DOM would be:

B
DIV
└─ B
      └─ DIV (with 100k tags)

The adoption agency algorithm makes the <div> a direct child of <html> only once we process the misnested </b>.

What if we built an HTML normalizer instead?

Since the entire markup must be processed upfront, this could work just as well:

class WP_HTML_Processor {

     public function __construct( $html, $options ) {
         // Apply HTML parsing rules first, unless explicitly asked not to
         if ( true !== $options['is_normative'] ) {
              $html = WP_HTML_Normalizer::normalize( $html );
         }

         // From now on, we assume normative markup
         $this->html = $html;
     }

     public function next_by_css( $selector );
     public function set_inner_html( $html );

     // ...

cc @dmsnell @ockham

Open questions

What can we assume about the context? Even in the in body insertion mode </ul> closers are ignored if there's no matching <ul> opener. I think that's fine, but I'd like to bring it up for discussion.
What to do with SVG and MathML foreign elements? If they have no conflicting tag names, then we could perhaps let them be handled by the default in body insertion rules.

cc @ockham @dmsnell

Trac ticket: TODO

…king a tag closer This commit marks the start of a bookmark one byte before the tag name start for tag openers, and two bytes before the tag name for tag closers. Setting a bookmark on a tag should set its "start" position before the opening "<", e.g.: ``` <div> Testing a <b>Bookmark</b> ----------------^ ``` The current calculation assumes this is always one byte to the left from $tag_name_starts_at. However, in tag closers that index points to a solidus symbol "/": ``` <div> Testing a <b>Bookmark</b> ----------------------------^ ``` The bookmark should therefore start two bytes before the tag name: ``` <div> Testing a <b>Bookmark</b> ---------------------------^ ```

…closers' into wp_html_processor

azaozz · 2023-02-28T01:01:45Z

Frankly I'm thinking this should be a hard "NO". Don't see the point of having an HTML parser written in PHP. Will (most likely) be large, slow, hard to make, and in need of constant maintenance. I.e will have to be a quite substantial, stand-alone PHP project with enough support and resources to be able to survive.

enable these highly requested features:

WP_HTML_Processor: Add set_content_inside_balanced_tags gutenberg#47036

Finding HTML nodes via a CSS selector

Inserting new HTML nodes at a correct location

These sound good, but... Are they really that "essential"? If they really are, a much better, easier, cheaper, and a lot more forward-compatible way of doing all that would be to require something like headless Chromium, see https://github.com/chrome-php/chrome#interacting-with-dom. Imho this would open a lot of possibilities in WP, and perhaps can be implemented as "progressive enhancement" at first considering the complexity and the experimental nature.

adamziel · 2023-03-02T12:57:29Z

Frankly I'm thinking this should be a hard "NO". Don't see the point of having an HTML parser written in PHP. Will (most likely) be large, slow, hard to make, and in need of constant maintenance. I.e will have to be a quite substantial, stand-alone PHP project with enough support and resources to be able to survive.

@azaozz the DOM-based approach is a dead-end. I mostly wanted to understand the parsing spec better. I don't like how slow it is and how much memory it consumes either. I'll go ahead and close this PR.

I'm noodling on a simplified version that processes HTML as a stream in adamziel#2 It's really fast and requires very little memory, but stream-processing means it cannot understand non-normative markup in all the same ways as a browser would.

That being said, it can get very close. I'm thinking about supporting markup that's mostly correct, e.g. <ul><li>1<li>2<li>3</ul> or <div>1</span>2</div> while refusing to process anything that's hard to support with a stream parser, e.g. <table><b>ABC</b></table>. I'm not sure where that will go yet.

would be to require something like headless Chromium, see https://github.com/chrome-php/chrome#interacting-with-dom. Imho this would open a lot of possibilities in WP, and perhaps can be implemented as "progressive enhancement" at first considering the complexity and the experimental nature.

Interesting idea! It wouldn't run on most webhosts, though, would it?

azaozz · 2023-03-03T19:08:59Z

That being said, it can get very close.

Yep, very close but no parity. That would mean there will be edge cases. Perhaps few at first, but more and more will get discovered with more usage. So the chances are it will turn into a "good thing but with a long list of exceptions and limitations"...

Interesting idea! It wouldn't run on most webhosts, though, would it?

Hmmmm, not sure. https://github.com/chrome-php/chrome#requirements lists only PHP 7.4+ and a Chromium executable as a requirement. Looking at the actual requirements, it uses a few PHP libraries, most notably the Symphony Filesystem component.

Perhaps some of the requirements may need access to PHP functions that are typically disabled on many hosts (like shell_exec()) but worth a look/try imho. Other than that there may be a need for some permissions juggling, but thinking that would work everywhere.

adamziel added 15 commits February 21, 2023 18:13

Explore HTML parsing and Adoption Agency Algorithm

7e7602c

Emit text tokens

0bdd4f6

Merge branch 'html-api-start-bookmarks-before-tag-opener-also-in-tag-…

a24349a

…closers' into wp_html_processor

Consume HTML text nodes as tokens

ff9505b

Implement DOM insertion

afbfdc5

Fix a bug in the adoption agency algorithm

9d31cb7

Correctly cose the p tags

eeea95a

Simplify HTML Processor

ddf2c73

Correct the is_element_in_scope checks

db40a94

Uncomment some test inputs

ea4f392

Document insert_node

66fd636

Simplify ignore_token()

93fea6c

Start exploring a text-based API

fd2ddcf

Doodling more

faf724e

adamziel mentioned this pull request Feb 24, 2023

Explore HTML parsing and Adoption Agency Algorithm adamziel/wordpress-develop#1

Closed

adamziel changed the title ~~Explore HTML parser based on the Tag Processor (Adoption Agency Algorithm)~~ Explore HTML parser based on the Tag Processor Feb 24, 2023

adamziel mentioned this pull request Mar 2, 2023

Wp html processor text based adamziel/wordpress-develop#2

Open

adamziel closed this Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore HTML parser based on the Tag Processor #4125

Explore HTML parser based on the Tag Processor #4125

adamziel commented Feb 24, 2023 •

edited

Loading

azaozz commented Feb 28, 2023 •

edited

Loading

adamziel commented Mar 2, 2023 •

edited

Loading

azaozz commented Mar 3, 2023

Explore HTML parser based on the Tag Processor #4125

Explore HTML parser based on the Tag Processor #4125

Conversation

adamziel commented Feb 24, 2023 • edited Loading

Description

Understanding a document tree is hard

Let's implement the HTML parsing spec

But let's ditch most of it

Specific things I'd like to ditch from Tree Construction

All insertion modes targeting markup outside of document body

All insertion modes targeting framesets

The text insertion mode

Tag-specific insertion modes

Let's only implement parts of the in body insertion mode

Important findings so far

Object-oriented DOM representation works but is inefficient

Parsing HTML requires a full pass through the HTML document

What if we built an HTML normalizer instead?

Open questions

azaozz commented Feb 28, 2023 • edited Loading

adamziel commented Mar 2, 2023 • edited Loading

azaozz commented Mar 3, 2023

adamziel commented Feb 24, 2023 •

edited

Loading

azaozz commented Feb 28, 2023 •

edited

Loading

adamziel commented Mar 2, 2023 •

edited

Loading