Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Propose new hand-coded parser #8083

Merged
merged 43 commits into from Sep 6, 2018
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
bf4c1e1
Parser: Propose new hand-coded PHP parser
dmsnell Jul 20, 2018
bb7ff54
Fix issue with containing the nested innerHTML
dmsnell Jul 20, 2018
a905f58
Also handle newlines as whitespace
dmsnell Jul 20, 2018
1bebf95
Use classes for some static typing
dmsnell Jul 20, 2018
e0256b3
add type hints
dmsnell Jul 20, 2018
070b4f2
remove needless comment
dmsnell Jul 20, 2018
987b6e6
space where space is due
dmsnell Jul 20, 2018
92c110d
meaningless rename
dmsnell Jul 20, 2018
21132d3
remove needless function call
dmsnell Jul 20, 2018
474eab3
harmonize with spec parser
dmsnell Jul 20, 2018
4501e9a
don't forget freeform HTML before blocks
dmsnell Jul 20, 2018
6ed9e50
account for oddity in spec-parser
dmsnell Jul 20, 2018
029feb0
add some polish, fix a thing
dmsnell Jul 21, 2018
5230045
comment it
dmsnell Jul 21, 2018
760ad75
add JS version too
dmsnell Jul 21, 2018
ce42f86
Change `.` to `[^]` because `/s` isn't well supported in JS
dmsnell Jul 23, 2018
3ed3424
Move code into `/packages` directory, prepare for review
dmsnell Aug 24, 2018
a448817
take out names from RegExp pattern to not fail tests
dmsnell Aug 24, 2018
ed917f3
Fix bug in parser: store HTML soup in stack frames while parsing
dmsnell Aug 25, 2018
b440a86
fix whitespace
dmsnell Aug 25, 2018
76c8d50
fix oddity in spec
dmsnell Aug 25, 2018
e9bd804
match styles
dmsnell Aug 26, 2018
1e91266
use class name filter on server-side parser class
dmsnell Aug 26, 2018
e80a6d9
fix whitespace
dmsnell Aug 26, 2018
45d7c7b
Document extensibility
dmsnell Aug 27, 2018
1b7592a
fix typo in example code
dmsnell Aug 27, 2018
c60b95d
Push failing parsing test
mcsf Aug 29, 2018
10a2097
fix lazy/greedy bug in parser regexp
dmsnell Aug 29, 2018
6a232a4
Docs: Fix typos, links, tweak style.
mcsf Aug 29, 2018
cb13b54
update from PR feedback
dmsnell Aug 30, 2018
ce1864f
trim docs
dmsnell Aug 30, 2018
f5b97a6
Load default block parser, replacing PEG-generated one
mcsf Aug 31, 2018
9c72d5e
Expand `?:` shorthand for PHP 5.2 compat
mcsf Sep 1, 2018
a57e448
add fixtures test for default parser
dmsnell Sep 3, 2018
20e6131
spaces to tabs
dmsnell Sep 3, 2018
0bd5e71
could we need no assoc?
dmsnell Sep 3, 2018
08015d7
fill out return array
dmsnell Sep 3, 2018
1004cbe
put that assoc back in there
dmsnell Sep 3, 2018
3dc74fd
isometrize
dmsnell Sep 3, 2018
22f10de
rename and add 0
dmsnell Sep 3, 2018
a41a995
Conditionally include the parser class
jorgefilipecosta Sep 4, 2018
fe98a4a
Add docblocks
dmsnell Sep 5, 2018
9463906
Standardize the package configuration
gziolo Sep 6, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/extensibility.md
Expand Up @@ -74,3 +74,9 @@ There are some advanced block features which require opt-in support in the theme
## Autocomplete

Autocompleters within blocks may be extended and overridden. See [autocomplete](../docs/extensibility/autocomplete.md).

## Block Parsing and Serialization

Posts in the editor move through a couple of different stages between being stored in `post_content` and appearing in the editor. Since the blocks themselves are data structures that live in memory it takes a parsing and serialization step to transform out from and into the stored format in the database.

Customizing the parser is an advanced topic that you can learn more about in the [Extending the Parser](../docs/extensibility/parser.md) section.
36 changes: 36 additions & 0 deletions docs/extensibility/parser.md
@@ -0,0 +1,36 @@
# Extending the Parser

When the editor is interacting with blocks, these are stored in memory as data structures comprising a few basic properties and attributes. Upon saving a working post we serialize these data structures into a specific HTML structure and save the resultant string into the `post_content` property of the post in the WordPress database. When we load that post back into the editor we have to make the reverse transformation to build those data structures from the serialized format in HTML.

The process of loading the serialized HTML into the editor is performed by the _block parser_. The formal specification for this transformation is encoded in the parsing expression grammar (PEG) inside the `@wordpress/block-serialization-spec-parser` package. The editor provides a default parser implementation of this grammar but there may be various reasons for replacing that implementation with a custom implementation. We can inject our own custom parser implementation through the appropriate filter.

## Server-side parser

Plugins have access to the parser if they want to process posts in their structured form instead of a plain HTML-as-string representation.

## Client-side parser

The editor uses the client-side parser while interactively working in a post. The plain HTML-as-string representation is sent to the browser by the backend and then the editor performs the first parse to initialize itself.

## Filters

To replace the server-side parser, use the `block_parser_class` filter. The filter transforms the string class name of a parser class. This class is expected to expose a `parse` method.

_Example:_

```php
class EmptyParser {
public function parse( $post_content ) {
// return an empty document
return array();
}
}

function my_plugin_select_empty_parser( $prev_parser_class ) {
return 'EmptyParser';
}

add_filter( 'block_parser_class', 'my_plugin_select_empty_parser', 10, 1 );
```

> **Note**: At the present time it's not possible to replace the client-side parser.
6 changes: 6 additions & 0 deletions docs/manifest.json
Expand Up @@ -287,6 +287,12 @@
"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-library/README.md",
"parent": "packages"
},
{
"title": "@wordpress/block-serialization-default-parser",
"slug": "packages-block-serialization-default-parser",
"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-serialization-default-parser/README.md",
"parent": "packages"
},
{
"title": "@wordpress/block-serialization-spec-parser",
"slug": "packages-block-serialization-spec-parser",
Expand Down
16 changes: 14 additions & 2 deletions lib/blocks.php
Expand Up @@ -66,8 +66,20 @@ function gutenberg_parse_blocks( $content ) {
);
}

$parser = new Gutenberg_PEG_Parser;
return $parser->parse( _gutenberg_utf8_split( $content ) );
/**
* Filter to allow plugins to replace the server-side block parser
*
* @since 3.8.0
*
* @param string $parser_class Name of block parser class
*/
$parser_class = apply_filters( 'block_parser_class', 'WP_Block_Parser' );
// Load default block parser for server-side parsing if the default parser class is being used.
if ( 'WP_Block_Parser' === $parser_class ) {
require_once dirname( __FILE__ ) . '/../packages/block-serialization-default-parser/parser.php';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this conditional load as suggested by @mcsf. I did some tests and this approach seems to work well.
I don't like the fact that we are requiring an external file in the scope of a function but in this case, it seems the best alternative. The file being included implements a class and in PHP they are global so this definition works even with the include inside the function.
If someone has a better idea for this I'm totally open to change the approach. If for some reason the commit feels wrong or we decide to have this in its open PR after landing the main functionality I'm fine with this commit being discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I have no problem with that

}
$parser = new $parser_class();
return $parser->parse( $content );
}

/**
Expand Down
9 changes: 8 additions & 1 deletion lib/client-assets.php
Expand Up @@ -275,6 +275,13 @@ function gutenberg_register_scripts_and_styles() {
filemtime( gutenberg_dir_path() . 'build/dom/index.js' ),
true
);
wp_register_script(
'wp-block-serialization-default-parser',
gutenberg_url( 'build/block-serialization-default-parser/index.js' ),
array(),
filemtime( gutenberg_dir_path() . 'build/block-serialization-default-parser/index.js' ),
true
);
wp_register_script(
'wp-block-serialization-spec-parser',
gutenberg_url( 'build/block-serialization-spec-parser/index.js' ),
Expand Down Expand Up @@ -386,7 +393,7 @@ function gutenberg_register_scripts_and_styles() {
array(
'wp-autop',
'wp-blob',
'wp-block-serialization-spec-parser',
'wp-block-serialization-default-parser',
'wp-data',
'wp-deprecated',
'wp-dom',
Expand Down
1 change: 0 additions & 1 deletion lib/load.php
Expand Up @@ -29,7 +29,6 @@
require dirname( __FILE__ ) . '/compat.php';
require dirname( __FILE__ ) . '/plugin-compat.php';
require dirname( __FILE__ ) . '/i18n.php';
require dirname( __FILE__ ) . '/parser.php';
require dirname( __FILE__ ) . '/register.php';


Expand Down
4 changes: 4 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions package.json
Expand Up @@ -20,6 +20,7 @@
"@wordpress/autop": "file:packages/autop",
"@wordpress/blob": "file:packages/blob",
"@wordpress/block-library": "file:packages/block-library",
"@wordpress/block-serialization-default-parser": "file:packages/block-serialization-default-parser",
"@wordpress/block-serialization-spec-parser": "file:packages/block-serialization-spec-parser",
"@wordpress/blocks": "file:packages/blocks",
"@wordpress/components": "file:packages/components",
Expand Down
1 change: 1 addition & 0 deletions packages/block-serialization-default-parser/.npmrc
@@ -0,0 +1 @@
package-lock=false
3 changes: 3 additions & 0 deletions packages/block-serialization-default-parser/CHANGELOG.md
@@ -0,0 +1,3 @@
## 1.0.0

- Initial release.
124 changes: 124 additions & 0 deletions packages/block-serialization-default-parser/README.md
@@ -0,0 +1,124 @@
# Block Serialization Default Parser

This library contains the default block serialization parser implementations for WordPress documents. It provides native PHP and JavaScript parsers that implement the specification from `@wordpress/block-serialization-spec-parser` and which normally operates on the document stored in `post_content`.

## Installation

Install the module

```bash
npm install @wordpress/block-serialization-default-parser --save
```

## Usage

Input post:
```html
<!-- wp:columns {"columns":3} -->
<div class="wp-block-columns has-3-columns"><!-- wp:column -->
<div class="wp-block-column"><!-- wp:paragraph -->
<p>Left</p>
<!-- /wp:paragraph --></div>
<!-- /wp:column -->

<!-- wp:column -->
<div class="wp-block-column"><!-- wp:paragraph -->
<p><strong>Middle</strong></p>
<!-- /wp:paragraph --></div>
<!-- /wp:column -->

<!-- wp:column -->
<div class="wp-block-column"></div>
<!-- /wp:column --></div>
<!-- /wp:columns -->
```

Parsing code:
```js
import { parse } from '@wordpress/block-serialization-default-parser';

parse( post ) === [
{
blockName: "core/columns",
attrs: {
columns: 3
},
innerBlocks: [
{
blockName: "core/column",
attrs: null,
innerBlocks: [
{
blockName: "core/paragraph",
attrs: null,
innerBlocks: [],
innerHTML: "\n<p>Left</p>\n"
}
],
innerHTML: '\n<div class="wp-block-column"></div>\n'
},
{
blockName: "core/column",
attrs: null,
innerBlocks: [
{
blockName: "core/paragraph",
attrs: null,
innerBlocks: [],
innerHTML: "\n<p><strong>Middle</strong></p>\n"
}
],
innerHTML: '\n<div class="wp-block-column"></div>\n'
},
{
blockName: "core/column",
attrs: null,
innerBlocks: [],
innerHTML: '\n<div class="wp-block-column"></div>\n'
}
],
innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n'
}
];
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a more readable example. core/more is pretty quirky, since it's wrapping a core magical string <!--more-->. Anything else, really, as long as it's short and sweet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I was just copying the same example from the spec parser which was pretty minimal but in 66455b4 I made a bigger example


## Theory

### What is different about this one from the spec-parser?

This is a recursive-descent parser that scans linearly once through the input document. Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow. It minimizes data copying and passing through the use of globals for tracking state through the parse. Between every token (a block comment delimiter) we can instrument the parser and intervene should we want to; for example we might put a hard limit on how long we can be parsing a document or provide additional debugging diagnostics for a document.

The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many questions inherently that we must answer explicitly in this parser. The goal for this implementation is to match the characteristics of the PEG so that it can be directly swapped out and so that the only changes are better runtime performance and memory usage.

### How does it work?

Every serialized Gutenberg document is nominally an HTML document which, in addition to normal HTML, may also contain specially designed HTML comments -- the block comment delimiters -- which separate and isolate the blocks serialized in the document.

This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of:

- enter a new block;
- exit out of a block.

Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below.

The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach:

- Start each newly opened block with an empty `innerHTML`.
- Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts.
- Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts.
- When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts.
- If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`.

### I meant, how does it perform?

This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage:

- We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens.
- Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead.
- Not copying all those strings means that we'll also skip many memory allocations.

Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees.

However, since our "token language" of the block comment delimiters is _regular_ and _can_ be trivially matched with RegExp patterns, we can do that here and then something magical happens: we jump out of PHP or JavaScript and into a highly-optimized RegExp engine written in C or C++ on the host system. We thereby leave the virtual machine and its overhead.

<br/><br/><p align="center"><img src="https://s.w.org/style/images/codeispoetry.png?1" alt="Code is Poetry." /></p>
25 changes: 25 additions & 0 deletions packages/block-serialization-default-parser/package.json
@@ -0,0 +1,25 @@
{
"name": "@wordpress/block-serialization-default-parser",
"version": "1.0.0-rc.0",
"description": "Block serialization specification parser for WordPress posts.",
"author": "The WordPress Contributors",
"license": "GPL-2.0-or-later",
"keywords": [
"wordpress",
"block",
"parser"
],
"homepage": "https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-default-parser/README.md",
"repository": {
"type": "git",
"url": "https://github.com/WordPress/gutenberg.git"
},
"bugs": {
"url": "https://github.com/WordPress/gutenberg/issues"
},
"main": "build/index.js",
"module": "build-module/index.js",
"publishConfig": {
"access": "public"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a babel-runtime dependency cc @gziolo ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the following output in the transpiled code:

// ES5
var _interopRequireDefault = require("@babel/runtime/helpers/interopRequireDefault");

// ESM
import _slicedToArray from "@babel/runtime/helpers/esm/slicedToArray";

So it needs to be there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9463906.

}