New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Propose new hand-coded parser #8083

Merged
merged 43 commits into from Sep 6, 2018

Conversation

@dmsnell
Contributor

dmsnell commented Jul 20, 2018

For some time we've needed a more performant PHP parser for the first
stage of parsing the post_content document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

Updates

  • This now also includes a copy of the parser in JS whose performance is also quite good.
  • The files have been moved into the /packages directory - I still need some help understanding where it all belongs and how to make the package work

This provides a setup fixture for #6831 wherein we are testing alternate
parser implementations - https://comparator-yizlfvqafz.now.sh

Distinctives

  • designed as a basic recursive-descent
  • but doesn't recurse on the call-stack, recurses via trampoline
  • moves linearly through document in one pass
  • relies on RegExp for tokenization

Note I expect us to discover implementation bugs during the initial rollout of this parser. We have run it through our document library and unit tests but real posts are surely getting into more complicated constructions. We can deal with these as they come but we should expect these.

Todo

  • nested blocks include the nested content in their innerHTML
    this needs to go away
  • create test fixture - https://comparator-yizlfvqafz.now.sh
  • figure out where to save this file
  • phpunit tests

Benchmark

For posterity's sake I ran the merged parser through the parser comparator and compared it against the auto-generated spec parser. Here are the results from my laptop

                                    ms                        MB    
                                Spec  Default  Speedup    Spec  Default  Comparison
demo-post.html                    29.58   0.23   130     38.56   16.43     43%
early-adopting-the-future.html   263.83   1.01   262     36.84   17.10     46%
moby-dick-parsed.html           5012.13  11.55   434     75.41   25.18     33%
pygmalian-raw-html.html          330.35   0.24  1366    116.72   16.90     14%
redesigning-chrome-desktop.html  211.42   1.22   173     37.22   16.51     44%
shortcode-shortcomings.html       71.28   0.36   198     34.07   16.98     50%
web-at-maximum-fps.html          161.35   0.87   186     33.12   16.32     49%

The tests were done on my late 2013 rMBP quad core 2.6 GHz laptop. According to the Intel Power Gadget the CPU was running at 3.6 GHz the entire time. Each document was parsed with each parser at least 47 times and the runs were at random and each run was randomly chosen to parse the document between one and five times in a row before returning the results. Runtime and memory use were measured inside a runner script running in Docker as described in the parser comparator.

@dmsnell

This comment has been minimized.

Show comment
Hide comment
@dmsnell

dmsnell Jul 20, 2018

Contributor

I'm pretty sure that the next steps from here involve pondering the data structure of the stack. We have enough working knowledge now to know what we need to track and how we can pop that from the stack to the output.

Done

Contributor

dmsnell commented Jul 20, 2018

I'm pretty sure that the next steps from here involve pondering the data structure of the stack. We have enough working knowledge now to know what we need to track and how we can pop that from the stack to the output.

Done

@dmsnell dmsnell changed the title from Parser: Propose new hand-coded PHP parser to Parser: Propose new hand-coded parser Jul 21, 2018

@mcsf mcsf referenced this pull request Jul 27, 2018

Open

Overview of Short-term Parsing Enhancements #8244

3 of 11 tasks complete
@pento

Noice! Let's get this in sooner rather than later, so we can make inroads on the things depending on having a faster parser. 🙂

I've left some comments, here are a few random notes that have occurred to me, as well:

  • It feels a little weird to be putting the PHP parser on NPM, but we don't really use Packagist at all, sooo... 🤷‍♂️ Let's stick with NPM for now, we can potentially explore doing Packagist/composer things later.
  • phpcs.xml.dist needs to be updated to scan the new PHP code. I mentioned a couple of coding standards issues in the comments, but PHPCS should pick up the rest.
  • Combined with switching the parser in gutenberg_parse_blocks(), phpunit/class-parsing-test.php should be updated to use gutenberg_parse_blocks(), rather than Gutenberg_PEG_Parser.

With this performance improvement, it seems like we could change do_blocks() to parse the content, instead of using the dynamic blocks regex.

@dmsnell dmsnell referenced this pull request Aug 26, 2018

Closed

WIP: Add dynamic block #6170

Show outdated Hide outdated packages/block-serialization-default-parser/package.json
@@ -88,6 +88,7 @@ const gutenbergPackages = [
'autop',
'blob',
'blocks',
'block-serialization-default-parser',
'block-serialization-spec-parser',

This comment has been minimized.

@gziolo

gziolo Aug 27, 2018

Member

Can we stop bundling the other one if we don't use it in Gutenberg anymore?

@gziolo

gziolo Aug 27, 2018

Member

Can we stop bundling the other one if we don't use it in Gutenberg anymore?

This comment has been minimized.

@dmsnell

dmsnell Aug 27, 2018

Contributor

a good question. I don't want to kill the PEG parser since that maintains the spec in a way no hand-written implementation can.

in my comparator PRs I'm trying to move towards a system that will automatically run the implementations against the specification in something like a CI job so that we can have our formal specification without worrying about the implementation diverging (for example, if someone makes a change to the implementation without changing the spec first)

that is, I think we want to keep the spec-parser wherever we need it - mainly I think we want to strip it from the default load of Gutenberg but whether we build it, what do you think?

@dmsnell

dmsnell Aug 27, 2018

Contributor

a good question. I don't want to kill the PEG parser since that maintains the spec in a way no hand-written implementation can.

in my comparator PRs I'm trying to move towards a system that will automatically run the implementations against the specification in something like a CI job so that we can have our formal specification without worrying about the implementation diverging (for example, if someone makes a change to the implementation without changing the spec first)

that is, I think we want to keep the spec-parser wherever we need it - mainly I think we want to strip it from the default load of Gutenberg but whether we build it, what do you think?

This comment has been minimized.

@gziolo

gziolo Aug 27, 2018

Member

The package with transpiled code is going to be there anyway. It's really up to you and how you want to use it. If you are fine with referencing it as a regular npm package then you don't need it. If you want to consume it as part of e2e test or something which requires all Gutenberg build files then you can leave it as is. I just wanted to raise the awareness.

@gziolo

gziolo Aug 27, 2018

Member

The package with transpiled code is going to be there anyway. It's really up to you and how you want to use it. If you are fine with referencing it as a regular npm package then you don't need it. If you want to consume it as part of e2e test or something which requires all Gutenberg build files then you can leave it as is. I just wanted to raise the awareness.

This comment has been minimized.

@dmsnell

dmsnell Aug 27, 2018

Contributor

thanks - this is mainly just out of my expertise at this point. if you are willing to make a decision on it or can tell me what we should do then that would help me out.

it seems like several people want these parser tests to be written with jest and somehow in the normal suite - I don't know what that means here for this decision

@dmsnell

dmsnell Aug 27, 2018

Contributor

thanks - this is mainly just out of my expertise at this point. if you are willing to make a decision on it or can tell me what we should do then that would help me out.

it seems like several people want these parser tests to be written with jest and somehow in the normal suite - I don't know what that means here for this decision

Show outdated Hide outdated lib/client-assets.php
Show outdated Hide outdated lib/blocks.php
@@ -378,6 +378,6 @@ const createParse = ( parseImplementation ) =>
*
* @return {Array} Block list.
*/
export const parseWithGrammar = createParse( grammarParse );
export const parseWithGrammar = createParse( defaultParse );

This comment has been minimized.

@gziolo

gziolo Aug 27, 2018

Member

Should we offer a filter for JS implementation, too?

@gziolo

gziolo Aug 27, 2018

Member

Should we offer a filter for JS implementation, too?

This comment has been minimized.

@dmsnell

dmsnell Aug 27, 2018

Contributor

yes but I wasn't sure if this PR was the right one for it. that is, filtering out the PHP side seemed somewhat straightforward while filtering the JS side seemed more complicated since we have to take into account things like loading the parser bundles and making sure they are available before the editor loads

do you think we need to do it all here in this PR?

@dmsnell

dmsnell Aug 27, 2018

Contributor

yes but I wasn't sure if this PR was the right one for it. that is, filtering out the PHP side seemed somewhat straightforward while filtering the JS side seemed more complicated since we have to take into account things like loading the parser bundles and making sure they are available before the editor loads

do you think we need to do it all here in this PR?

This comment has been minimized.

@gziolo

gziolo Aug 27, 2018

Member

It's totally fine as its own PR, I just wanted to ensure we tackle both PHP and JS side of things.

@gziolo

gziolo Aug 27, 2018

Member

It's totally fine as its own PR, I just wanted to ensure we tackle both PHP and JS side of things.

@gziolo

This comment has been minimized.

Show comment
Hide comment
@gziolo

gziolo Aug 27, 2018

Member

8c7e42c looks great, I left one comment which is a tiny thing that affects only PHP 5.2...

Member

gziolo commented Aug 27, 2018

8c7e42c looks great, I left one comment which is a tiny thing that affects only PHP 5.2...

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Aug 28, 2018

Contributor

I'm getting a tokenization bug while testing with a personal post. Digging…

Contributor

mcsf commented Aug 28, 2018

I'm getting a tokenization bug while testing with a personal post. Digging…

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Aug 29, 2018

Contributor

@dmsnell: I've pushed a failing test for the parser. The gist of it is that I think the tokenizer is too greedy when looking for the end of an attributes group ({"some":"json"}). Thus, a document with two self-closing attribute-equipped blocks, not necessarily consecutive, breaks the parser:

<!-- wp:block {"ref":313} /-->
<!-- wp:block {"ref":482} /-->

This makes the parser throw a syntax error in the JSON.parse call:

SyntaxError: Unexpected token / in JSON at position 19

We should guarantee handling of any bad JSON here, but that's not the real issue. The issue is in the tokenizer, as the following fragment was returned as a match for attrsMatch:

{\"ref\":313} /--><!-- wp:block {\"ref\":482}

Note that, in contrast, the following input is correctly parsed:

<!-- wp:block {"ref":313} -->
<!-- /wp:block -->
<!-- wp:block {"ref":482} /-->

I used the following debugger patch:

diff --git a/packages/block-serialization-default-parser/src/index.js b/packages/block-serialization-default-parser/src/index.js
index 9c1983f22..007edd2b5 100644
--- a/packages/block-serialization-default-parser/src/index.js
+++ b/packages/block-serialization-default-parser/src/index.js
@@ -172,7 +172,7 @@ function nextToken() {
 	const namespace = namespaceMatch || 'core/';
 	const name = namespace + nameMatch;
 	const hasAttrs = !! attrsMatch;
-	const attrs = hasAttrs ? JSON.parse( attrsMatch ) : null;
+	const attrs = hasAttrs ? safeParse( attrsMatch ) : null;
 
 	// This state isn't allowed
 	// This is an error
@@ -192,6 +192,17 @@ function nextToken() {
 	return [ 'block-opener', name, attrs, startedAt, length ];
 }
 
+function safeParse( json ) {
+	let r;
+	try {
+		r = JSON.parse( json );
+	} catch ( e ) {
+		console.error( `Input of length ${ json.length }`, json );
+		throw e;
+	}
+	return r;
+}
+
 function addFreeform( rawLength ) {
 	const length = rawLength ? rawLength : document.length - offset;
Contributor

mcsf commented Aug 29, 2018

@dmsnell: I've pushed a failing test for the parser. The gist of it is that I think the tokenizer is too greedy when looking for the end of an attributes group ({"some":"json"}). Thus, a document with two self-closing attribute-equipped blocks, not necessarily consecutive, breaks the parser:

<!-- wp:block {"ref":313} /-->
<!-- wp:block {"ref":482} /-->

This makes the parser throw a syntax error in the JSON.parse call:

SyntaxError: Unexpected token / in JSON at position 19

We should guarantee handling of any bad JSON here, but that's not the real issue. The issue is in the tokenizer, as the following fragment was returned as a match for attrsMatch:

{\"ref\":313} /--><!-- wp:block {\"ref\":482}

Note that, in contrast, the following input is correctly parsed:

<!-- wp:block {"ref":313} -->
<!-- /wp:block -->
<!-- wp:block {"ref":482} /-->

I used the following debugger patch:

diff --git a/packages/block-serialization-default-parser/src/index.js b/packages/block-serialization-default-parser/src/index.js
index 9c1983f22..007edd2b5 100644
--- a/packages/block-serialization-default-parser/src/index.js
+++ b/packages/block-serialization-default-parser/src/index.js
@@ -172,7 +172,7 @@ function nextToken() {
 	const namespace = namespaceMatch || 'core/';
 	const name = namespace + nameMatch;
 	const hasAttrs = !! attrsMatch;
-	const attrs = hasAttrs ? JSON.parse( attrsMatch ) : null;
+	const attrs = hasAttrs ? safeParse( attrsMatch ) : null;
 
 	// This state isn't allowed
 	// This is an error
@@ -192,6 +192,17 @@ function nextToken() {
 	return [ 'block-opener', name, attrs, startedAt, length ];
 }
 
+function safeParse( json ) {
+	let r;
+	try {
+		r = JSON.parse( json );
+	} catch ( e ) {
+		console.error( `Input of length ${ json.length }`, json );
+		throw e;
+	}
+	return r;
+}
+
 function addFreeform( rawLength ) {
 	const length = rawLength ? rawLength : document.length - offset;
@dmsnell

This comment has been minimized.

Show comment
Hide comment
@dmsnell

dmsnell Aug 29, 2018

Contributor

the tokenizer is too greedy when looking for the end of an attributes group

excellent find @mcsf! you are right - I let in a greedy match when I had no reason to! that's been taken out by the addition of the ? to make the (?!-->). group un-greedy as it should be. I'm embarrassed that I let it in but so glad you found it and added the failing tests!

un-greedy modifier added in 9c85a60

also I rebased the branch

Contributor

dmsnell commented Aug 29, 2018

the tokenizer is too greedy when looking for the end of an attributes group

excellent find @mcsf! you are right - I let in a greedy match when I had no reason to! that's been taken out by the addition of the ? to make the (?!-->). group un-greedy as it should be. I'm embarrassed that I let it in but so glad you found it and added the failing tests!

un-greedy modifier added in 9c85a60

also I rebased the branch

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Aug 29, 2018

Contributor

Thanks for the fixes, @dmsnell! I just pushed 05ee910, a revision of the docs.

Concerning parser.md, my main impression is that part of it could be removed or be made more succinct if we link to the right resources for more on the language and parsing/serialization stack. Otherwise, I feel we're repeating ourselves and diluting our lang resources.

Contributor

mcsf commented Aug 29, 2018

Thanks for the fixes, @dmsnell! I just pushed 05ee910, a revision of the docs.

Concerning parser.md, my main impression is that part of it could be removed or be made more succinct if we link to the right resources for more on the language and parsing/serialization stack. Otherwise, I feel we're repeating ourselves and diluting our lang resources.

@mcsf mcsf added this to the 3.8 milestone Aug 30, 2018

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Aug 30, 2018

Contributor

Slating this for 3.8. We should merge this right after the 3.7 release (today-ish) to make sure any oddities can be spotted in master.

Contributor

mcsf commented Aug 30, 2018

Slating this for 3.8. We should merge this right after the 3.7 release (today-ish) to make sure any oddities can be spotted in master.

@mcsf

Hopefully the last bits of feedback. :)

Show outdated Hide outdated docs/extensibility/parser.md
Show outdated Hide outdated lib/blocks.php
Show outdated Hide outdated lib/client-assets.php
Show outdated Hide outdated packages/block-serialization-default-parser/README.md
Show outdated Hide outdated lib/blocks.php
@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Aug 31, 2018

Contributor

Thanks for the fixes! Pushed a rebase. Now that 3.7 has been released, let's get this merged ASAP. I'll circle back to this PR very soon and hopefully approve it. Other eyes are also welcome.

Contributor

mcsf commented Aug 31, 2018

Thanks for the fixes! Pushed a rebase. Now that 3.7 has been released, let's get this merged ASAP. I'll circle back to this PR very soon and hopefully approve it. Other eyes are also welcome.

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Sep 1, 2018

Contributor

I pushed 299ddf7 in order to load on the new parser class on the server, otherwise calling gutenberg_parse_block fails. This is a bit naïve, though: we probably don't need the parser most of the time, especially if the block_parser_class filter is being used. Can we get away with a require_once fit inside gutenberg_parse_blocks after a check for $parser_class === 'BSDP_Parser'?

Edit: if it's useful, this is what I also use to manually try the parser: https://gist.github.com/mcsf/1f67c86a5e40dbbf630def4625348b7c

Finally, we need phpunit tests for this package. They can mimic what we have in phpunit/class-parsing-test.php for the PEG parser.

Contributor

mcsf commented Sep 1, 2018

I pushed 299ddf7 in order to load on the new parser class on the server, otherwise calling gutenberg_parse_block fails. This is a bit naïve, though: we probably don't need the parser most of the time, especially if the block_parser_class filter is being used. Can we get away with a require_once fit inside gutenberg_parse_blocks after a check for $parser_class === 'BSDP_Parser'?

Edit: if it's useful, this is what I also use to manually try the parser: https://gist.github.com/mcsf/1f67c86a5e40dbbf630def4625348b7c

Finally, we need phpunit tests for this package. They can mimic what we have in phpunit/class-parsing-test.php for the PEG parser.

Show outdated Hide outdated lib/blocks.php
@pento

This comment has been minimized.

Show comment
Hide comment
@pento

pento Sep 3, 2018

Member

Whee, I finally got back to this PR! Nice work @dmsnell, this is looking really good. Apart from the one little docblock fix, the only other change I had was the class name: unless we're going to magically come up with a new parser to replace this one between now and 5.0, let's name it WP_Block_Parser, so it fits into the WP naming scheme, and we can avoid yet another rename when merge comes.

Member

pento commented Sep 3, 2018

Whee, I finally got back to this PR! Nice work @dmsnell, this is looking really good. Apart from the one little docblock fix, the only other change I had was the class name: unless we're going to magically come up with a new parser to replace this one between now and 5.0, let's name it WP_Block_Parser, so it fits into the WP naming scheme, and we can avoid yet another rename when merge comes.

@pento

This comment has been minimized.

Show comment
Hide comment
@pento

pento Sep 3, 2018

Member

Oh, one more thing: the classes, members and methods need docblocks. 🙂

Member

pento commented Sep 3, 2018

Oh, one more thing: the classes, members and methods need docblocks. 🙂

$parser_class = apply_filters( 'block_parser_class', 'WP_Block_Parser' );
// Load default block parser for server-side parsing if the default parser class is being used.
if ( 'WP_Block_Parser' === $parser_class ) {
require_once dirname( __FILE__ ) . '/../packages/block-serialization-default-parser/parser.php';

This comment has been minimized.

@jorgefilipecosta

jorgefilipecosta Sep 4, 2018

Member

I added this conditional load as suggested by @mcsf. I did some tests and this approach seems to work well.
I don't like the fact that we are requiring an external file in the scope of a function but in this case, it seems the best alternative. The file being included implements a class and in PHP they are global so this definition works even with the include inside the function.
If someone has a better idea for this I'm totally open to change the approach. If for some reason the commit feels wrong or we decide to have this in its open PR after landing the main functionality I'm fine with this commit being discarded.

@jorgefilipecosta

jorgefilipecosta Sep 4, 2018

Member

I added this conditional load as suggested by @mcsf. I did some tests and this approach seems to work well.
I don't like the fact that we are requiring an external file in the scope of a function but in this case, it seems the best alternative. The file being included implements a class and in PHP they are global so this definition works even with the include inside the function.
If someone has a better idea for this I'm totally open to change the approach. If for some reason the commit feels wrong or we decide to have this in its open PR after landing the main functionality I'm fine with this commit being discarded.

This comment has been minimized.

@dmsnell

dmsnell Sep 5, 2018

Contributor

thanks! I have no problem with that

@dmsnell

dmsnell Sep 5, 2018

Contributor

thanks! I have no problem with that

@jorgefilipecosta

I did a set of tests with complex nesting scenarios and everything went great. I think the parsers is working fine 👍
@dmsnell let me know if you need any help with adding the documentation to the classes or any other pending task that I'm missing.

dmsnell added some commits Jul 20, 2018

Parser: Propose new hand-coded PHP parser
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
@jorgefilipecosta

Thank you for your effort in getting this PR to the finish line, I think we are ready to merge this 👍
The core of the parser seems to work correctly and if later we find improvements/alternative approaches, we can iterate on other PR's.
Awesome work!

@pento

pento approved these changes Sep 6, 2018

It is time. Click the button.

@gziolo gziolo merged commit 694a19b into master Sep 6, 2018

2 checks passed

codecov/project No report found to compare against
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@gziolo gziolo deleted the parser/rd-trampoline-php branch Sep 6, 2018

@gziolo

This comment has been minimized.

Show comment
Hide comment
@gziolo

gziolo Sep 6, 2018

Member

💥
done
😅

Member

gziolo commented Sep 6, 2018

💥
done
😅

@mtias

This comment has been minimized.

Show comment
Hide comment
@mtias

mtias Sep 6, 2018

Contributor

Thanks for all the work @dmsnell ! Nice to see this coming together.

Contributor

mtias commented Sep 6, 2018

Thanks for all the work @dmsnell ! Nice to see this coming together.

@aduth

This comment has been minimized.

Show comment
Hide comment
@aduth

aduth Sep 6, 2018

Member

This now also includes a copy of the parser in JS whose performance is also quite good.

Do we have numbers for this, as far as performance comparison to the previous parser?

Member

aduth commented Sep 6, 2018

This now also includes a copy of the parser in JS whose performance is also quite good.

Do we have numbers for this, as far as performance comparison to the previous parser?

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Sep 11, 2018

Contributor

Thanks, everyone, for seeing this through!

Contributor

mcsf commented Sep 11, 2018

Thanks, everyone, for seeing this through!

@mcsf

This comment has been minimized.

Show comment
Hide comment
@mcsf

mcsf Sep 11, 2018

Contributor

Concerning the requiring of the PHP implementation, #9791 needs investigating.

Contributor

mcsf commented Sep 11, 2018

Concerning the requiring of the PHP implementation, #9791 needs investigating.

@aduth

This comment has been minimized.

Show comment
Hide comment
@aduth

aduth Sep 17, 2018

Member

Potential regression noted at #9968

Member

aduth commented Sep 17, 2018

Potential regression noted at #9968

dmsnell added a commit that referenced this pull request Sep 18, 2018

Parser (Fix): Output freeform content before void blocks
Resolves #9968

It was noted that a classic block preceding a void block would
disappear in the editor while if that same classic block preceded
the long-form non-void representation of an empty block then things
would load as expected.

This behavior was determined to originate in the new default parser
in #8083 and the bug was that with void blocks we weren't sending
any preceding HTML soup/freeform content into the output list.

In this patch I've duplicated some code from the block-closing
function of the parser to spit out this content when a void block
is at the top-level of the document.

This bug did not appear when void blocks are nested because it's
the parent block that eats HTML soup. In the case of the top-level
void however we were immediately pushing that void block to the
output list and neglecting the freeform HTML.

I've added a few tests to verify and demonstrate this behavior.
Actually, since I wasn't sure what was wrong I wrote the tests first
to try and understand the behaviors and bugs. There are a few tests
that are thus not entirely essential but worthwhile to have in here.

@dmsnell dmsnell referenced this pull request Sep 18, 2018

Merged

Parser (Fix): Output freeform content before void blocks #9984

4 of 4 tasks complete

dmsnell added a commit that referenced this pull request Sep 18, 2018

Parser (Fix): Output freeform content before void blocks (#9984)
* Parser (Fix): Output freeform content before void blocks

Resolves #9968

It was noted that a classic block preceding a void block would
disappear in the editor while if that same classic block preceded
the long-form non-void representation of an empty block then things
would load as expected.

This behavior was determined to originate in the new default parser
in #8083 and the bug was that with void blocks we weren't sending
any preceding HTML soup/freeform content into the output list.

In this patch I've duplicated some code from the block-closing
function of the parser to spit out this content when a void block
is at the top-level of the document.

This bug did not appear when void blocks are nested because it's
the parent block that eats HTML soup. In the case of the top-level
void however we were immediately pushing that void block to the
output list and neglecting the freeform HTML.

I've added a few tests to verify and demonstrate this behavior.
Actually, since I wasn't sure what was wrong I wrote the tests first
to try and understand the behaviors and bugs. There are a few tests
that are thus not entirely essential but worthwhile to have in here.
@mcsf

I hadn't realized this before — as my primary testing interface was the WP API (gist), through which everything is serialized into the same shape — but I now fear that we're not providing a consistent interface with the parser in its current state.

See my inline comments. Consumers of gutenberg_parse_blocks may make mistakes because of these discrepancies, and I fear they may already have: #10041.

cc @dmsnell

if ( isset( $stack_top->leading_html_start ) ) {
$this->output[] = array(
'attrs' => array(),

This comment has been minimized.

@mcsf

mcsf Sep 19, 2018

Contributor

(copy-pasting a comment that I added in the more recent #9984) In this same file I'm seeing conflicting shapes for attrs:

'attrs' => array(), // here
'attrs' => new stdClass(), // in `add_freeform`
@mcsf

mcsf Sep 19, 2018

Contributor

(copy-pasting a comment that I added in the more recent #9984) In this same file I'm seeing conflicting shapes for attrs:

'attrs' => array(), // here
'attrs' => new stdClass(), // in `add_freeform`

This comment has been minimized.

@dmsnell

dmsnell Sep 19, 2018

Contributor

good call - I know there are some lingering inconsistencies too around null vs. {} in the spec grammar. a good follow-up PR that's been on my TODO list

@dmsnell

dmsnell Sep 19, 2018

Contributor

good call - I know there are some lingering inconsistencies too around null vs. {} in the spec grammar. a good follow-up PR that's been on my TODO list

* @since 3.8.0
* @var WP_Block_Parser_Block[]
*/
public $output;

This comment has been minimized.

@mcsf

mcsf Sep 19, 2018

Contributor

I'm concerned about this promise that $output is an array of WP_Block_Parser_Block, since freeform fragments are added as [associative] arrays and not class instances.

@mcsf

mcsf Sep 19, 2018

Contributor

I'm concerned about this promise that $output is an array of WP_Block_Parser_Block, since freeform fragments are added as [associative] arrays and not class instances.

This comment has been minimized.

@dmsnell

dmsnell Sep 19, 2018

Contributor

we can definitely consider wiping the output clean of its classes - I didn't at first because it seemed benign to retain them, but if we sacrifice a little performance we can json_decode( json_encode( $output ) ) and clear it up

@dmsnell

dmsnell Sep 19, 2018

Contributor

we can definitely consider wiping the output clean of its classes - I didn't at first because it seemed benign to retain them, but if we sacrifice a little performance we can json_decode( json_encode( $output ) ) and clear it up

This comment has been minimized.

@mcsf

mcsf Sep 20, 2018

Contributor

@jorgefilipecosta mentioned implementing an ArrayObject interface in our classes so that one can traverse our parser output natively, rather than doing the JSON dance. What do you think?

@mcsf

mcsf Sep 20, 2018

Contributor

@jorgefilipecosta mentioned implementing an ArrayObject interface in our classes so that one can traverse our parser output natively, rather than doing the JSON dance. What do you think?

This comment has been minimized.

@dmsnell

dmsnell Sep 20, 2018

Contributor

That's a good question. It means more divide between the PHP and JS versions of the parser. What's the JSON dance? Wouldn't having ArrayObject be somewhat superfluous?

// this already works with arrays and objects!
$blocks = parse( $document );
$blocks = array_map( $blocks, $my_transformer );

we probably want to fix the bug as a separate thing from adding interfaces. I'm skeptical of the value of the latter if the former is resolved.

@dmsnell

dmsnell Sep 20, 2018

Contributor

That's a good question. It means more divide between the PHP and JS versions of the parser. What's the JSON dance? Wouldn't having ArrayObject be somewhat superfluous?

// this already works with arrays and objects!
$blocks = parse( $document );
$blocks = array_map( $blocks, $my_transformer );

we probably want to fix the bug as a separate thing from adding interfaces. I'm skeptical of the value of the latter if the former is resolved.

This comment has been minimized.

@mcsf

mcsf Sep 20, 2018

Contributor

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

we probably want to fix the bug as a separate thing from adding interfaces

So this is the actual issue: #10047. It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

@mcsf

mcsf Sep 20, 2018

Contributor

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

we probably want to fix the bug as a separate thing from adding interfaces

So this is the actual issue: #10047. It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

This comment has been minimized.

@jorgefilipecosta

jorgefilipecosta Sep 20, 2018

Member

Classes offer some advantages we can publish abstract classes that contain the fields plugins can safely access, and other parsers can extend this general classes. Simple arrays don't offer this guarantees.

But now we have a problem some plugins are dependent on using simple arrays, even if this bug was already caught I'm not sure we can change the API to use classes.

So I think our options are revert back and use arrays, or advance and change our API to use classes. In the second case to be back-compatible with existing implementation accessing using the array syntax, I think our only solution is ArrayObject. It allows us to temporarily return something that behaves like a class for new implementations and an array for old implementations, in this case, we would add the deprecation messages saying we now return objects.

@jorgefilipecosta

jorgefilipecosta Sep 20, 2018

Member

Classes offer some advantages we can publish abstract classes that contain the fields plugins can safely access, and other parsers can extend this general classes. Simple arrays don't offer this guarantees.

But now we have a problem some plugins are dependent on using simple arrays, even if this bug was already caught I'm not sure we can change the API to use classes.

So I think our options are revert back and use arrays, or advance and change our API to use classes. In the second case to be back-compatible with existing implementation accessing using the array syntax, I think our only solution is ArrayObject. It allows us to temporarily return something that behaves like a class for new implementations and an array for old implementations, in this case, we would add the deprecation messages saying we now return objects.

This comment has been minimized.

@dmsnell

dmsnell Sep 20, 2018

Contributor

It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

to me this is just evidence that the work to make all attribute reporting consistent is necessary. some attributes are null, some are objects

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

that would be in the parser and wouldn't have to be manually performed. in fact, the classes are only even there for performance, so we can test the change of sorting everything in plain old objects vs. converting at the end. if it's a degradation then we can simply remove the classes if we want to preserve the simpler interface.

@dmsnell

dmsnell Sep 20, 2018

Contributor

It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

to me this is just evidence that the work to make all attribute reporting consistent is necessary. some attributes are null, some are objects

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

that would be in the parser and wouldn't have to be manually performed. in fact, the classes are only even there for performance, so we can test the change of sorting everything in plain old objects vs. converting at the end. if it's a degradation then we can simply remove the classes if we want to preserve the simpler interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment