New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser: Propose new hand-coded parser #8083
Changes from 42 commits
bf4c1e1
bb7ff54
a905f58
1bebf95
e0256b3
070b4f2
987b6e6
92c110d
21132d3
474eab3
4501e9a
6ed9e50
029feb0
5230045
760ad75
ce42f86
3ed3424
a448817
ed917f3
b440a86
76c8d50
e9bd804
1e91266
e80a6d9
45d7c7b
1b7592a
c60b95d
10a2097
6a232a4
cb13b54
ce1864f
f5b97a6
9c72d5e
a57e448
20e6131
0bd5e71
08015d7
1004cbe
3dc74fd
22f10de
a41a995
fe98a4a
9463906
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Extending the Parser | ||
|
||
When the editor is interacting with blocks, these are stored in memory as data structures comprising a few basic properties and attributes. Upon saving a working post we serialize these data structures into a specific HTML structure and save the resultant string into the `post_content` property of the post in the WordPress database. When we load that post back into the editor we have to make the reverse transformation to build those data structures from the serialized format in HTML. | ||
|
||
The process of loading the serialized HTML into the editor is performed by the _block parser_. The formal specification for this transformation is encoded in the parsing expression grammar (PEG) inside the `@wordpress/block-serialization-spec-parser` package. The editor provides a default parser implementation of this grammar but there may be various reasons for replacing that implementation with a custom implementation. We can inject our own custom parser implementation through the appropriate filter. | ||
|
||
## Server-side parser | ||
|
||
Plugins have access to the parser if they want to process posts in their structured form instead of a plain HTML-as-string representation. | ||
|
||
## Client-side parser | ||
|
||
The editor uses the client-side parser while interactively working in a post. The plain HTML-as-string representation is sent to the browser by the backend and then the editor performs the first parse to initialize itself. | ||
|
||
## Filters | ||
|
||
To replace the server-side parser, use the `block_parser_class` filter. The filter transforms the string class name of a parser class. This class is expected to expose a `parse` method. | ||
|
||
_Example:_ | ||
|
||
```php | ||
class EmptyParser { | ||
public function parse( $post_content ) { | ||
// return an empty document | ||
return array(); | ||
} | ||
} | ||
|
||
function my_plugin_select_empty_parser( $prev_parser_class ) { | ||
return 'EmptyParser'; | ||
} | ||
|
||
add_filter( 'block_parser_class', 'my_plugin_select_empty_parser', 10, 1 ); | ||
``` | ||
|
||
> **Note**: At the present time it's not possible to replace the client-side parser. |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
package-lock=false |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
## 1.0.0 | ||
|
||
- Initial release. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
# Block Serialization Default Parser | ||
|
||
This library contains the default block serialization parser implementations for WordPress documents. It provides native PHP and JavaScript parsers that implement the specification from `@wordpress/block-serialization-spec-parser` and which normally operates on the document stored in `post_content`. | ||
|
||
## Installation | ||
|
||
Install the module | ||
|
||
```bash | ||
npm install @wordpress/block-serialization-default-parser --save | ||
``` | ||
|
||
## Usage | ||
|
||
Input post: | ||
```html | ||
<!-- wp:columns {"columns":3} --> | ||
<div class="wp-block-columns has-3-columns"><!-- wp:column --> | ||
<div class="wp-block-column"><!-- wp:paragraph --> | ||
<p>Left</p> | ||
<!-- /wp:paragraph --></div> | ||
<!-- /wp:column --> | ||
|
||
<!-- wp:column --> | ||
<div class="wp-block-column"><!-- wp:paragraph --> | ||
<p><strong>Middle</strong></p> | ||
<!-- /wp:paragraph --></div> | ||
<!-- /wp:column --> | ||
|
||
<!-- wp:column --> | ||
<div class="wp-block-column"></div> | ||
<!-- /wp:column --></div> | ||
<!-- /wp:columns --> | ||
``` | ||
|
||
Parsing code: | ||
```js | ||
import { parse } from '@wordpress/block-serialization-default-parser'; | ||
|
||
parse( post ) === [ | ||
{ | ||
blockName: "core/columns", | ||
attrs: { | ||
columns: 3 | ||
}, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/paragraph", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: "\n<p>Left</p>\n" | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
}, | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/paragraph", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: "\n<p><strong>Middle</strong></p>\n" | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
}, | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n' | ||
} | ||
]; | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest a more readable example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here I was just copying the same example from the spec parser which was pretty minimal but in 66455b4 I made a bigger example |
||
|
||
## Theory | ||
|
||
### What is different about this one from the spec-parser? | ||
|
||
This is a recursive-descent parser that scans linearly once through the input document. Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow. It minimizes data copying and passing through the use of globals for tracking state through the parse. Between every token (a block comment delimiter) we can instrument the parser and intervene should we want to; for example we might put a hard limit on how long we can be parsing a document or provide additional debugging diagnostics for a document. | ||
|
||
The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many questions inherently that we must answer explicitly in this parser. The goal for this implementation is to match the characteristics of the PEG so that it can be directly swapped out and so that the only changes are better runtime performance and memory usage. | ||
|
||
### How does it work? | ||
|
||
Every serialized Gutenberg document is nominally an HTML document which, in addition to normal HTML, may also contain specially designed HTML comments -- the block comment delimiters -- which separate and isolate the blocks serialized in the document. | ||
|
||
This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of: | ||
|
||
- enter a new block; | ||
- exit out of a block. | ||
|
||
Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below. | ||
|
||
The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach: | ||
|
||
- Start each newly opened block with an empty `innerHTML`. | ||
- Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts. | ||
- Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts. | ||
- When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts. | ||
- If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`. | ||
|
||
### I meant, how does it perform? | ||
|
||
This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage: | ||
|
||
- We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens. | ||
- Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead. | ||
- Not copying all those strings means that we'll also skip many memory allocations. | ||
|
||
Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees. | ||
|
||
However, since our "token language" of the block comment delimiters is _regular_ and _can_ be trivially matched with RegExp patterns, we can do that here and then something magical happens: we jump out of PHP or JavaScript and into a highly-optimized RegExp engine written in C or C++ on the host system. We thereby leave the virtual machine and its overhead. | ||
|
||
<br/><br/><p align="center"><img src="https://s.w.org/style/images/codeispoetry.png?1" alt="Code is Poetry." /></p> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
{ | ||
"name": "@wordpress/block-serialization-default-parser", | ||
"version": "1.0.0-rc.0", | ||
"description": "Block serialization specification parser for WordPress posts.", | ||
"author": "The WordPress Contributors", | ||
"license": "GPL-2.0-or-later", | ||
"keywords": [ | ||
"wordpress", | ||
"block", | ||
"parser" | ||
], | ||
"homepage": "https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-default-parser/README.md", | ||
"repository": { | ||
"type": "git", | ||
"url": "https://github.com/WordPress/gutenberg.git" | ||
}, | ||
"bugs": { | ||
"url": "https://github.com/WordPress/gutenberg/issues" | ||
}, | ||
"main": "build/index.js", | ||
"module": "build-module/index.js", | ||
"publishConfig": { | ||
"access": "public" | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this needs a babel-runtime dependency cc @gziolo ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see the following output in the transpiled code: // ES5
var _interopRequireDefault = require("@babel/runtime/helpers/interopRequireDefault");
// ESM
import _slicedToArray from "@babel/runtime/helpers/esm/slicedToArray"; So it needs to be there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added in 9463906. |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this conditional load as suggested by @mcsf. I did some tests and this approach seems to work well.
I don't like the fact that we are requiring an external file in the scope of a function but in this case, it seems the best alternative. The file being included implements a class and in PHP they are global so this definition works even with the include inside the function.
If someone has a better idea for this I'm totally open to change the approach. If for some reason the commit feels wrong or we decide to have this in its open PR after landing the main functionality I'm fine with this commit being discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I have no problem with that