Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore HTML parser based on the Tag Processor #4125

Closed
wants to merge 15 commits into from

Conversation

adamziel
Copy link
Contributor

@adamziel adamziel commented Feb 24, 2023

(This is a continuation of adamziel#1 but opened against wordpress-develop for more visibility. It's exploratory work and not something I'd like to merge in its current shape)

Description

Explores processing HTML as a tree to go beyond WP_HTML_Tag_Processor and enable these highly requested features:

Understanding a document tree is hard

CSS selector, innerHTML, and others all operate on a DOM tree. However, figuring out a DOM tree from HTML markup isn't a straightforward task! Consider:

new DOMParser().parseFromString("<p>Tree <b>structure</p> isn't </b> easy", "text/html").body.innerHTML

"<p>Tree <b>structure</b></p><b> isn't </b> easy"

Let's implement the HTML parsing spec

The WHATWG HTML spec describes in detail how to handle misnested tags, unexpected markup, and other non-normative artifacts commonly seen in HTML markup. One of the fundamental ideas it describes is the adoption agency algorithm used to rewrite the DOM tree.

But let's ditch most of it

HTML spec is huge and contains many features indispensable for web browsers. However, we're not building a browser. We only want to handle non-normative HTML markup fragments.

For example, the parsing spec demands ignoring <tr> tags found outside of a <table>. Here's what happens if we implement that part:

$p = new WP_HTML_Processor('<tr><td><b>Test!</b></td></tr>');
echo $p->get_updated_html();
// <b>Test!</b>

That's not helpful at all!

The spec describes an algorithm for Parsing HTML fragments that takes an HTML string and a context elements. Eventually, we could perhaps use it to treat the markup as if it was nested in a specific tag:

$p = new WP_HTML_Processor('<tr><td><b>Test!</b></td></tr>', '<table>');
echo $p->get_updated_html();
// <tr><td><b>Test!</b></td></tr>

Still, it adds a lot of complexity AND requires developers to be mindful of the $context. I suggest not implementing it at this time.

Specific things I'd like to ditch from Tree Construction

All insertion modes targeting markup outside of document body

Let's skip the following sections and assume we're already in a document body. The parser will be permissive of weird stuff like multiple doctype declarations:

  1. 13.2.6.4.1 The "initial" insertion mode
  2. 13.2.6.4.2 The "before html" insertion mode
  3. 13.2.6.4.3 The "before head" insertion mode
  4. 13.2.6.4.4 The "in head" insertion mode
  5. 13.2.6.4.5 The "in head noscript" insertion mode
  6. 13.2.6.4.6 The "after head" insertion mode
  7. 13.2.6.4.19 The "after body" insertion mode
  8. 13.2.6.4.22 The "after after body" insertion mode

All insertion modes targeting framesets

Come on, who uses framesets?

  1. 13.2.6.4.20 The "in frameset" insertion mode
  2. 13.2.6.4.21 The "after frameset" insertion mode
  3. 13.2.6.4.23 The "after after frameset" insertion mode

The text insertion mode

WP_HTML_Tag_Processor already implements the script nesting rules, so we can safely assume anything not classified as a tag is a text. This insertion mode doesn't offer any extra value.

Tag-specific insertion modes

We're working with HTML fragments and can't assume the context in which it will be displayed. Brownie points – by ditching these insertion modes we don't have to implement foster parenting.

  1. 13.2.6.4.9 The "in table" insertion mode
  2. 13.2.6.4.10 The "in table text" insertion mode
  3. 13.2.6.4.11 The "in caption" insertion mode
  4. 13.2.6.4.12 The "in column group" insertion mode
  5. 13.2.6.4.13 The "in table body" insertion mode
  6. 13.2.6.4.14 The "in row" insertion mode
  7. 13.2.6.4.15 The "in cell" insertion mode
  8. 13.2.6.4.16 The "in select" insertion mode
  9. 13.2.6.4.17 The "in select in table" insertion mode
  10. 13.2.6.4.18 The "in template" insertion mode

Let's only implement parts of the in body insertion mode

That's all we need in order to handle:

  • Misnested formatting elements (<p>1<b>2<i>3</b>4</i>5</p>)
  • Misnested blocks (<b>1<p>2</b>3</p>)
  • Misnested list items (<ul><li>1<li>2</ul>)

Conveniently, this means new nodes are always inserted after the last child of the target node = simpler text-based implementation.

Important findings so far

Object-oriented DOM representation works but is inefficient

I benchmarked it on the HTML parsing spec itself, which is a 12MB HTML document:

I tried parsing the HTML spec page (12MB):

Mem peak usage: 499MB
Time: 30.60s

That's awful but not surprising. This PR builds an actual document tree and uses inefficient operations such as array_splice.

A text-based version similar to WP_HTML_Tag_Processor should be much faster and more memory-efficient. Let's explore one!

Parsing HTML requires a full pass through the HTML document

HTML spec deals with misnested nodes using Adoption Agency Algorithm. In the worst-case scenario, the entire document must be parsed to know even the second node.

Consider this markup:

<b>
 <div>
    <div><!-- 100k tags amounting to 2 MB of normative HTML --></div>
    </b> <!-- suddenly, a rogue </b> -->
  </div>
</b>

The correct DOM would be:

B
DIV
└─ B
      └─ DIV (with 100k tags)

The adoption agency algorithm makes the <div> a direct child of <html> only once we process the misnested </b>.

What if we built an HTML normalizer instead?

Since the entire markup must be processed upfront, this could work just as well:

class WP_HTML_Processor {

     public function __construct( $html, $options ) {
         // Apply HTML parsing rules first, unless explicitly asked not to
         if ( true !== $options['is_normative'] ) {
              $html = WP_HTML_Normalizer::normalize( $html );
         }

         // From now on, we assume normative markup
         $this->html = $html;
     }

     public function next_by_css( $selector );
     public function set_inner_html( $html );

     // ...

cc @dmsnell @ockham

Open questions

  • What can we assume about the context? Even in the in body insertion mode </ul> closers are ignored if there's no matching <ul> opener. I think that's fine, but I'd like to bring it up for discussion.
  • What to do with SVG and MathML foreign elements? If they have no conflicting tag names, then we could perhaps let them be handled by the default in body insertion rules.

cc @ockham @dmsnell

Trac ticket: TODO

…king a tag closer

This commit marks the start of a bookmark one byte before
the tag name start for tag openers, and two bytes before
the tag name for tag closers.

Setting a bookmark on a tag should set its "start" position
before the opening "<", e.g.:

```
<div> Testing a <b>Bookmark</b>
----------------^
```

The current calculation assumes this is always one byte
to the left from $tag_name_starts_at.

However, in tag closers that index points to a solidus
symbol "/":

```
<div> Testing a <b>Bookmark</b>
----------------------------^
```

The bookmark should therefore start two bytes before the tag name:

```
<div> Testing a <b>Bookmark</b>
---------------------------^
```
@adamziel adamziel changed the title Explore HTML parser based on the Tag Processor (Adoption Agency Algorithm) Explore HTML parser based on the Tag Processor Feb 24, 2023
@azaozz
Copy link
Contributor

azaozz commented Feb 28, 2023

Frankly I'm thinking this should be a hard "NO". Don't see the point of having an HTML parser written in PHP. Will (most likely) be large, slow, hard to make, and in need of constant maintenance. I.e will have to be a quite substantial, stand-alone PHP project with enough support and resources to be able to survive.

enable these highly requested features:

These sound good, but... Are they really that "essential"? If they really are, a much better, easier, cheaper, and a lot more forward-compatible way of doing all that would be to require something like headless Chromium, see https://github.com/chrome-php/chrome#interacting-with-dom. Imho this would open a lot of possibilities in WP, and perhaps can be implemented as "progressive enhancement" at first considering the complexity and the experimental nature.

@adamziel
Copy link
Contributor Author

adamziel commented Mar 2, 2023

Frankly I'm thinking this should be a hard "NO". Don't see the point of having an HTML parser written in PHP. Will (most likely) be large, slow, hard to make, and in need of constant maintenance. I.e will have to be a quite substantial, stand-alone PHP project with enough support and resources to be able to survive.

@azaozz the DOM-based approach is a dead-end. I mostly wanted to understand the parsing spec better. I don't like how slow it is and how much memory it consumes either. I'll go ahead and close this PR.

I'm noodling on a simplified version that processes HTML as a stream in adamziel#2 It's really fast and requires very little memory, but stream-processing means it cannot understand non-normative markup in all the same ways as a browser would.

That being said, it can get very close. I'm thinking about supporting markup that's mostly correct, e.g. <ul><li>1<li>2<li>3</ul> or <div>1</span>2</div> while refusing to process anything that's hard to support with a stream parser, e.g. <table><b>ABC</b></table>. I'm not sure where that will go yet.

would be to require something like headless Chromium, see https://github.com/chrome-php/chrome#interacting-with-dom. Imho this would open a lot of possibilities in WP, and perhaps can be implemented as "progressive enhancement" at first considering the complexity and the experimental nature.

Interesting idea! It wouldn't run on most webhosts, though, would it?

@adamziel adamziel closed this Mar 2, 2023
@azaozz
Copy link
Contributor

azaozz commented Mar 3, 2023

That being said, it can get very close.

Yep, very close but no parity. That would mean there will be edge cases. Perhaps few at first, but more and more will get discovered with more usage. So the chances are it will turn into a "good thing but with a long list of exceptions and limitations"...

Interesting idea! It wouldn't run on most webhosts, though, would it?

Hmmmm, not sure. https://github.com/chrome-php/chrome#requirements lists only PHP 7.4+ and a Chromium executable as a requirement. Looking at the actual requirements, it uses a few PHP libraries, most notably the Symphony Filesystem component.

Perhaps some of the requirements may need access to PHP functions that are typically disabled on many hosts (like shell_exec()) but worth a look/try imho. Other than that there may be a need for some permissions juggling, but thinking that would work everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants