New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block API: Consider encoding-normalized text as equivalent #11771

Merged
merged 2 commits into from Nov 15, 2018

Conversation

Projects
None yet
3 participants
@aduth
Member

aduth commented Nov 12, 2018

Fixes #9906

This pull request seeks to improve the block validation step to allow more leniency for effectively equivalent text encoded in varying forms.

The changes here were authored in such a way where there may be a slight performance benefit over master, both in a reduction of bundle size (an approximate 18% reduction gzipped on the blocks module) and in optimizing for an early return of equality if normalization (whitespace or encoding) is not necessary to determine equivalence of text sequences.

Implementation notes:

In the process of implementing further text normalization here, it was discovered that the underlying simple-html-tokenizer performs its own entities substitution when encountering text tokens in an HTML string. For the purposes of validation, this was considered to be redundant and was thus swapped with a stub entity parser in the included changes. Note that this is the change which enables the significant drop in bundle size. Note also as an aside that there's desire to consolidate to a single parser between the blocks parse and validator parse, so the use of simple-html-tokenizer may or may not persist far into the future.

Testing instructions:

Verify that block invalidation is not triggered by encoding variations.

For example, inserting the following HTML as the contents of a post (in Text Mode, Classic Editor Text tab, or directly in the database) should not be presented as an invalid block when next viewing the Visual Mode of the editor:

<!-- wp:paragraph -->
<p>This works. &#128517;</p>
<!-- /wp:paragraph -->

cc @MarkRH

@aduth

This comment has been minimized.

Member

aduth commented Nov 12, 2018

The current test failures are legitimate. In disabling simple-html-tokenizer's built-in entity normalization, it surfaces more differences, notably in attributes values, which aren't covered by the normalizations now taking place in isEquivalentTextTokens for text.

I'll need to think more on how best to address this, because the normalizations aren't strictly the same for text and attributes. Further, we have a few specific handlers on attribute value equivalence (e.g. class, style).

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities. This would still fix the issue, and retain the bundle size reduction, but perhaps at some cost of performance; this is arguable though, as it only impacts encoded HTML.

@aduth

This comment has been minimized.

Member

aduth commented Nov 13, 2018

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities.

This is what I decided to do in the latest commit.

@youknowriad youknowriad modified the milestones: 4.4, 4.5 Nov 15, 2018

@youknowriad

LGTM 👍

@youknowriad youknowriad modified the milestones: 4.5, 4.4 Nov 15, 2018

@youknowriad youknowriad merged commit 1237243 into master Nov 15, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@youknowriad youknowriad deleted the fix/9906-emoji-validation branch Nov 15, 2018

vindl added a commit to Automattic/wp-calypso that referenced this pull request Nov 20, 2018

Fix failing unit tests
Related to this change in Gutenberg:
WordPress/gutenberg#11771

grey-rsi pushed a commit to OnTheGoSystems/gutenberg that referenced this pull request Nov 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment