Block API: Consider encoding-normalized text as equivalent #11771

aduth · 2018-11-12T17:54:48Z

This pull request seeks to improve the block validation step to allow more leniency for effectively equivalent text encoded in varying forms.

The changes here were authored in such a way where there may be a slight performance benefit over master, both in a reduction of bundle size (an approximate 18% reduction gzipped on the blocks module) and in optimizing for an early return of equality if normalization (whitespace or encoding) is not necessary to determine equivalence of text sequences.

Implementation notes:

In the process of implementing further text normalization here, it was discovered that the underlying simple-html-tokenizer performs its own entities substitution when encountering text tokens in an HTML string. For the purposes of validation, this was considered to be redundant and was thus swapped with a stub entity parser in the included changes. Note that this is the change which enables the significant drop in bundle size. Note also as an aside that there's desire to consolidate to a single parser between the blocks parse and validator parse, so the use of simple-html-tokenizer may or may not persist far into the future.

Testing instructions:

Verify that block invalidation is not triggered by encoding variations.

For example, inserting the following HTML as the contents of a post (in Text Mode, Classic Editor Text tab, or directly in the database) should not be presented as an invalid block when next viewing the Visual Mode of the editor:

<!-- wp:paragraph -->
<p>This works. &#128517;</p>
<!-- /wp:paragraph -->

cc @MarkRH

aduth · 2018-11-12T18:21:43Z

The current test failures are legitimate. In disabling simple-html-tokenizer's built-in entity normalization, it surfaces more differences, notably in attributes values, which aren't covered by the normalizations now taking place in isEquivalentTextTokens for text.

I'll need to think more on how best to address this, because the normalizations aren't strictly the same for text and attributes. Further, we have a few specific handlers on attribute value equivalence (e.g. class, style).

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities. This would still fix the issue, and retain the bundle size reduction, but perhaps at some cost of performance; this is arguable though, as it only impacts encoded HTML.

aduth · 2018-11-13T22:33:08Z

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities.

This is what I decided to do in the latest commit.

youknowriad

LGTM 👍

Related to this change in Gutenberg: WordPress/gutenberg#11771

Block API: Consider encoding-normalized text as equivalent

c48a0f9

aduth requested a review from pento November 12, 2018 17:54

Block API: Decode entities in validator custom EntityParser

74f099c

mtias added this to the 4.4 milestone Nov 12, 2018

aduth mentioned this pull request Nov 12, 2018

Gutenberg 3.8 - 4.1.1 Paragraph Block Thinks Modified Externally if an Emoji Is Used. #9906

Closed

jasmussen mentioned this pull request Nov 13, 2018

Invalid block warning not clickable #11764

Closed

youknowriad modified the milestones: 4.4, 4.5 Nov 15, 2018

youknowriad approved these changes Nov 15, 2018

View reviewed changes

youknowriad modified the milestones: 4.5, 4.4 Nov 15, 2018

youknowriad merged commit 1237243 into master Nov 15, 2018

youknowriad deleted the fix/9906-emoji-validation branch November 15, 2018 16:25

vindl added a commit to Automattic/wp-calypso that referenced this pull request Nov 20, 2018

Fix failing unit tests

4deaec3

Related to this change in Gutenberg: WordPress/gutenberg#11771

aduth mentioned this pull request Jan 25, 2019

Fix: Unexpected block validation error with unescaped ampersands. #13406

Closed

gziolo mentioned this pull request Apr 29, 2019

Blocks: Upgrade simple-html-tokenizer dependency #15246

Merged

aduth added the [Feature] Block Validation/Deprecation Handling block validation to determine accuracy and deprecation label Jan 6, 2020

aduth mentioned this pull request Jan 6, 2020

Try adding more permissive block validation modes #19188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block API: Consider encoding-normalized text as equivalent #11771

Block API: Consider encoding-normalized text as equivalent #11771

aduth commented Nov 12, 2018 •

edited

aduth commented Nov 12, 2018

aduth commented Nov 13, 2018

youknowriad left a comment

Block API: Consider encoding-normalized text as equivalent #11771

Block API: Consider encoding-normalized text as equivalent #11771

Conversation

aduth commented Nov 12, 2018 • edited

aduth commented Nov 12, 2018

aduth commented Nov 13, 2018

youknowriad left a comment

Choose a reason for hiding this comment

aduth commented Nov 12, 2018 •

edited