WordPress · gziolo · Sep 6, 2018 · Jul 20, 2018 · Jul 20, 2018 · Jul 20, 2018
diff --git a/docs/extensibility.md b/docs/extensibility.md
@@ -74,3 +74,9 @@ There are some advanced block features which require opt-in support in the theme
 ## Autocomplete
 
 Autocompleters within blocks may be extended and overridden. See [autocomplete](../docs/extensibility/autocomplete.md).
+
+## Block Parsing and Serialization
+
+Posts in the editor move through a couple of different stages between being stored in `post_content` and appearing in the editor. Since the blocks themselves are data structures that live in memory it takes a parsing and serialization step to transform out from and into the stored format in the database.
+
+Customizing the parser is an advanced topic that you can learn more about in the [Extending the Parser](../docs/extensibility/parser.md) section.
diff --git a/docs/extensibility/parser.md b/docs/extensibility/parser.md
@@ -0,0 +1,36 @@
+# Extending the Parser
+
+When the editor is interacting with blocks, these are stored in memory as data structures comprising a few basic properties and attributes. Upon saving a working post we serialize these data structures into a specific HTML structure and save the resultant string into the `post_content` property of the post in the WordPress database. When we load that post back into the editor we have to make the reverse transformation to build those data structures from the serialized format in HTML.
+
+The process of loading the serialized HTML into the editor is performed by the _block parser_. The formal specification for this transformation is encoded in the parsing expression grammar (PEG) inside the `@wordpress/block-serialization-spec-parser` package. The editor provides a default parser implementation of this grammar but there may be various reasons for replacing that implementation with a custom implementation. We can inject our own custom parser implementation through the appropriate filter.
+
+## Server-side parser
+
+Plugins have access to the parser if they want to process posts in their structured form instead of a plain HTML-as-string representation.
+
+## Client-side parser
+
+The editor uses the client-side parser while interactively working in a post. The plain HTML-as-string representation is sent to the browser by the backend and then the editor performs the first parse to initialize itself.
+
+## Filters
+
+To replace the server-side parser, use the `block_parser_class` filter. The filter transforms the string class name of a parser class. This class is expected to expose a `parse` method.
+
+_Example:_
+
+```php
+class EmptyParser {
+  public function parse( $post_content ) {
+    // return an empty document
+    return array();
+  }
+}
+
+function my_plugin_select_empty_parser( $prev_parser_class ) {
+    return 'EmptyParser';
+}
+
+add_filter( 'block_parser_class', 'my_plugin_select_empty_parser', 10, 1 );
+```
+
+> **Note**: At the present time it's not possible to replace the client-side parser.
diff --git a/docs/manifest.json b/docs/manifest.json
@@ -287,6 +287,12 @@
 		"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-library/README.md",
 		"parent": "packages"
 	},
+	{
+		"title": "@wordpress/block-serialization-default-parser",
+		"slug": "packages-block-serialization-default-parser",
+		"markdown_source": "https://raw.githubusercontent.com/WordPress/gutenberg/master/packages/block-serialization-default-parser/README.md",
+		"parent": "packages"
+	},
 	{
 		"title": "@wordpress/block-serialization-spec-parser",
 		"slug": "packages-block-serialization-spec-parser",

diff --git a/lib/blocks.php b/lib/blocks.php
@@ -66,8 +66,20 @@ function gutenberg_parse_blocks( $content ) {
 		);
 	}
 
-	$parser = new Gutenberg_PEG_Parser;
-	return $parser->parse( _gutenberg_utf8_split( $content ) );
+	/**
+	 * Filter to allow plugins to replace the server-side block parser
+	 *
+	 * @since 3.8.0
+	 *
+	 * @param string $parser_class Name of block parser class
+	 */
+	$parser_class = apply_filters( 'block_parser_class', 'WP_Block_Parser' );
+	// Load default block parser for server-side parsing if the default parser class is being used.
+	if ( 'WP_Block_Parser' === $parser_class ) {
+		require_once dirname( __FILE__ ) . '/../packages/block-serialization-default-parser/parser.php';
+	}
+	$parser = new $parser_class();
+	return $parser->parse( $content );
 }
 
 /**

diff --git a/lib/client-assets.php b/lib/client-assets.php
@@ -275,6 +275,13 @@ function gutenberg_register_scripts_and_styles() {
 		filemtime( gutenberg_dir_path() . 'build/dom/index.js' ),
 		true
 	);
+	wp_register_script(
+		'wp-block-serialization-default-parser',
+		gutenberg_url( 'build/block-serialization-default-parser/index.js' ),
+		array(),
+		filemtime( gutenberg_dir_path() . 'build/block-serialization-default-parser/index.js' ),
+		true
+	);
 	wp_register_script(
 		'wp-block-serialization-spec-parser',
 		gutenberg_url( 'build/block-serialization-spec-parser/index.js' ),
@@ -386,7 +393,7 @@ function gutenberg_register_scripts_and_styles() {
 		array(
 			'wp-autop',
 			'wp-blob',
-			'wp-block-serialization-spec-parser',
+			'wp-block-serialization-default-parser',
 			'wp-data',
 			'wp-deprecated',
 			'wp-dom',

diff --git a/lib/load.php b/lib/load.php
@@ -29,7 +29,6 @@
 require dirname( __FILE__ ) . '/compat.php';
 require dirname( __FILE__ ) . '/plugin-compat.php';
 require dirname( __FILE__ ) . '/i18n.php';
-require dirname( __FILE__ ) . '/parser.php';
 require dirname( __FILE__ ) . '/register.php';
 
 

diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -20,6 +20,7 @@
 		"@wordpress/autop": "file:packages/autop",
 		"@wordpress/blob": "file:packages/blob",
 		"@wordpress/block-library": "file:packages/block-library",
+		"@wordpress/block-serialization-default-parser": "file:packages/block-serialization-default-parser",
 		"@wordpress/block-serialization-spec-parser": "file:packages/block-serialization-spec-parser",
 		"@wordpress/blocks": "file:packages/blocks",
 		"@wordpress/components": "file:packages/components",

diff --git a/packages/block-serialization-default-parser/.npmrc b/packages/block-serialization-default-parser/.npmrc
@@ -0,0 +1 @@
+package-lock=false
diff --git a/packages/block-serialization-default-parser/CHANGELOG.md b/packages/block-serialization-default-parser/CHANGELOG.md
@@ -0,0 +1,3 @@
+## 1.0.0
+
+-   Initial release.
diff --git a/packages/block-serialization-default-parser/README.md b/packages/block-serialization-default-parser/README.md
@@ -0,0 +1,124 @@
+# Block Serialization Default Parser
+
+This library contains the default block serialization parser implementations for WordPress documents. It provides native PHP and JavaScript parsers that implement the specification from `@wordpress/block-serialization-spec-parser` and which normally operates on the document stored in `post_content`.
+
+## Installation
+
+Install the module
+
+```bash
+npm install @wordpress/block-serialization-default-parser --save
+```
+
+## Usage
+
+Input post:
+```html
+<!-- wp:columns {"columns":3} -->
+<div class="wp-block-columns has-3-columns"><!-- wp:column -->
+<div class="wp-block-column"><!-- wp:paragraph -->
+<p>Left</p>
+<!-- /wp:paragraph --></div>
+<!-- /wp:column -->
+
+<!-- wp:column -->
+<div class="wp-block-column"><!-- wp:paragraph -->
+<p><strong>Middle</strong></p>
+<!-- /wp:paragraph --></div>
+<!-- /wp:column -->
+
+<!-- wp:column -->
+<div class="wp-block-column"></div>
+<!-- /wp:column --></div>
+<!-- /wp:columns -->
+```
+
+Parsing code:
+```js
+import { parse } from '@wordpress/block-serialization-default-parser';
+
+parse( post ) === [
+    {
+        blockName: "core/columns",
+        attrs: {
+            columns: 3
+        },
+        innerBlocks: [
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [
+                    {
+                        blockName: "core/paragraph",
+                        attrs: null,
+                        innerBlocks: [],
+                        innerHTML: "\n<p>Left</p>\n"
+                    }
+                ],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            },
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [
+                    {
+                        blockName: "core/paragraph",
+                        attrs: null,
+                        innerBlocks: [],
+                        innerHTML: "\n<p><strong>Middle</strong></p>\n"
+                    }
+                ],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            },
+            {
+                blockName: "core/column",
+                attrs: null,
+                innerBlocks: [],
+                innerHTML: '\n<div class="wp-block-column"></div>\n'
+            }
+        ],
+        innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n'
+    }
+];
+```
+
+## Theory
+
+### What is different about this one from the spec-parser?
+
+This is a recursive-descent parser that scans linearly once through the input document. Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow. It minimizes data copying and passing through the use of globals for tracking state through the parse. Between every token (a block comment delimiter) we can instrument the parser and intervene should we want to; for example we might put a hard limit on how long we can be parsing a document or provide additional debugging diagnostics for a document.
+
+The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many questions inherently that we must answer explicitly in this parser. The goal for this implementation is to match the characteristics of the PEG so that it can be directly swapped out and so that the only changes are better runtime performance and memory usage.
+
+### How does it work?
+
+Every serialized Gutenberg document is nominally an HTML document which, in addition to normal HTML, may also contain specially designed HTML comments -- the block comment delimiters -- which separate and isolate the blocks serialized in the document.
+
+This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of:
+
+ - enter a new block;
+ - exit out of a block.
+
+Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below.
+
+The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach:
+
+ - Start each newly opened block with an empty `innerHTML`.
+ - Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts.
+ - Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts.
+ - When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts.
+ - If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`.
+
+### I meant, how does it perform?
+
+This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage:
+
+ - We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens.
+ - Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead.
+ - Not copying all those strings means that we'll also skip many memory allocations.
+
+Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees.
+
+However, since our "token language" of the block comment delimiters is _regular_ and _can_ be trivially matched with RegExp patterns, we can do that here and then something magical happens: we jump out of PHP or JavaScript and into a highly-optimized RegExp engine written in C or C++ on the host system. We thereby leave the virtual machine and its overhead.
+
+<br/><br/><p align="center"><img src="https://s.w.org/style/images/codeispoetry.png?1" alt="Code is Poetry." /></p>
diff --git a/packages/block-serialization-default-parser/package.json b/packages/block-serialization-default-parser/package.json
@@ -0,0 +1,25 @@
+{
+  "name": "@wordpress/block-serialization-default-parser",
+  "version": "1.0.0-rc.0",
+  "description": "Block serialization specification parser for WordPress posts.",
+  "author": "The WordPress Contributors",
+  "license": "GPL-2.0-or-later",
+  "keywords": [
+    "wordpress",
+    "block",
+    "parser"
+  ],
+  "homepage": "https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-default-parser/README.md",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/WordPress/gutenberg.git"
+  },
+  "bugs": {
+    "url": "https://github.com/WordPress/gutenberg/issues"
+  },
+  "main": "build/index.js",
+  "module": "build-module/index.js",
+  "publishConfig": {
+    "access": "public"
+  }
+}