-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore HTML parser based on the Tag Processor #4125
Conversation
…king a tag closer This commit marks the start of a bookmark one byte before the tag name start for tag openers, and two bytes before the tag name for tag closers. Setting a bookmark on a tag should set its "start" position before the opening "<", e.g.: ``` <div> Testing a <b>Bookmark</b> ----------------^ ``` The current calculation assumes this is always one byte to the left from $tag_name_starts_at. However, in tag closers that index points to a solidus symbol "/": ``` <div> Testing a <b>Bookmark</b> ----------------------------^ ``` The bookmark should therefore start two bytes before the tag name: ``` <div> Testing a <b>Bookmark</b> ---------------------------^ ```
…closers' into wp_html_processor
Frankly I'm thinking this should be a hard "NO". Don't see the point of having an HTML parser written in PHP. Will (most likely) be large, slow, hard to make, and in need of constant maintenance. I.e will have to be a quite substantial, stand-alone PHP project with enough support and resources to be able to survive.
These sound good, but... Are they really that "essential"? If they really are, a much better, easier, cheaper, and a lot more forward-compatible way of doing all that would be to require something like headless Chromium, see https://github.com/chrome-php/chrome#interacting-with-dom. Imho this would open a lot of possibilities in WP, and perhaps can be implemented as "progressive enhancement" at first considering the complexity and the experimental nature. |
@azaozz the DOM-based approach is a dead-end. I mostly wanted to understand the parsing spec better. I don't like how slow it is and how much memory it consumes either. I'll go ahead and close this PR. I'm noodling on a simplified version that processes HTML as a stream in adamziel#2 It's really fast and requires very little memory, but stream-processing means it cannot understand non-normative markup in all the same ways as a browser would. That being said, it can get very close. I'm thinking about supporting markup that's mostly correct, e.g.
Interesting idea! It wouldn't run on most webhosts, though, would it? |
Yep, very close but no parity. That would mean there will be edge cases. Perhaps few at first, but more and more will get discovered with more usage. So the chances are it will turn into a "good thing but with a long list of exceptions and limitations"...
Hmmmm, not sure. https://github.com/chrome-php/chrome#requirements lists only PHP 7.4+ and a Chromium executable as a requirement. Looking at the actual requirements, it uses a few PHP libraries, most notably the Symphony Filesystem component. Perhaps some of the requirements may need access to PHP functions that are typically disabled on many hosts (like |
(This is a continuation of adamziel#1 but opened against wordpress-develop for more visibility. It's exploratory work and not something I'd like to merge in its current shape)
Description
Explores processing HTML as a tree to go beyond
WP_HTML_Tag_Processor
and enable these highly requested features:set_content_inside_balanced_tags
Understanding a document tree is hard
CSS selector,
innerHTML
, and others all operate on a DOM tree. However, figuring out a DOM tree from HTML markup isn't a straightforward task! Consider:Let's implement the HTML parsing spec
The WHATWG HTML spec describes in detail how to handle misnested tags, unexpected markup, and other non-normative artifacts commonly seen in HTML markup. One of the fundamental ideas it describes is the adoption agency algorithm used to rewrite the DOM tree.
But let's ditch most of it
HTML spec is huge and contains many features indispensable for web browsers. However, we're not building a browser. We only want to handle non-normative HTML markup fragments.
For example, the parsing spec demands ignoring
<tr>
tags found outside of a<table>
. Here's what happens if we implement that part:That's not helpful at all!
The spec describes an algorithm for Parsing HTML fragments that takes an HTML string and a
context
elements. Eventually, we could perhaps use it to treat the markup as if it was nested in a specific tag:Still, it adds a lot of complexity AND requires developers to be mindful of the
$context
. I suggest not implementing it at this time.Specific things I'd like to ditch from Tree Construction
All insertion modes targeting markup outside of document body
Let's skip the following sections and assume we're already in a document body. The parser will be permissive of weird stuff like multiple doctype declarations:
All insertion modes targeting framesets
Come on, who uses framesets?
The text insertion mode
WP_HTML_Tag_Processor
already implements the script nesting rules, so we can safely assume anything not classified as a tag is a text. This insertion mode doesn't offer any extra value.Tag-specific insertion modes
We're working with HTML fragments and can't assume the context in which it will be displayed. Brownie points – by ditching these insertion modes we don't have to implement foster parenting.
Let's only implement parts of the in body insertion mode
That's all we need in order to handle:
<p>1<b>2<i>3</b>4</i>5</p>
)<b>1<p>2</b>3</p>
)<ul><li>1<li>2</ul>
)Conveniently, this means new nodes are always inserted after the last child of the target node = simpler text-based implementation.
Important findings so far
Object-oriented DOM representation works but is inefficient
I benchmarked it on the HTML parsing spec itself, which is a 12MB HTML document:
I tried parsing the HTML spec page (12MB):
That's awful but not surprising. This PR builds an actual document tree and uses inefficient operations such as
array_splice
.A text-based version similar to WP_HTML_Tag_Processor should be much faster and more memory-efficient. Let's explore one!
Parsing HTML requires a full pass through the HTML document
HTML spec deals with misnested nodes using Adoption Agency Algorithm. In the worst-case scenario, the entire document must be parsed to know even the second node.
Consider this markup:
The correct DOM would be:
The adoption agency algorithm makes the
<div>
a direct child of<html>
only once we process the misnested</b>
.What if we built an HTML normalizer instead?
Since the entire markup must be processed upfront, this could work just as well:
cc @dmsnell @ockham
Open questions
</ul>
closers are ignored if there's no matching<ul>
opener. I think that's fine, but I'd like to bring it up for discussion.cc @ockham @dmsnell
Trac ticket: TODO