Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classic + Custom HTML blocks: Convert to Blocks removes valid inline formatting #6102

Closed
ZebulanStanphill opened this issue Apr 10, 2018 · 15 comments · Fixed by #11539
Closed
Assignees
Labels
Backwards Compatibility Issues or PRs that impact backwards compatability [Feature] Block Conversion Block transforms from one block to another [Feature] Blocks Overall functionality of blocks [Feature] Raw Handling Related to the ability to convert content to blocks, paste handling, etc [Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable

Comments

@ZebulanStanphill
Copy link
Member

ZebulanStanphill commented Apr 10, 2018

Issue Overview

The Paragraph block, as well as other textual blocks, allow <abbr>, <b>, <code>, <i>, <kbd>, <mark>, <span>, <time>, and various other inline formatting and semantic tags to be added using the "Edit as HTML" option, and these are considered valid and are not removed on save.

However, when converting a Classic block to standard blocks using the "Convert to Blocks" option, the paragraphs, lists, and blockquotes are stripped of some inline formatting tags, and a few are converted to tags that have a different semantic meaning.

I think inline formatting that is considered valid by the standard blocks should not be removed when converting a Classic block to standard blocks. Otherwise, you are stripping out formatting from the original post when you do not need to.

Steps to Reproduce (for bugs)

  1. Create a post using the Classic Editor or insert a Classic block in the Gutenberg editor.
  2. Insert the following:
<abbr>abbr</abbr> <b>b</b> <br>br <bdi>bdi</bdi> <bdo dir="rtl">bdo</bdo> <cite>cite</cite> <code>code</code> <data value="value">data</data> <dfn>dfn</dfn> <em>em</em> <i>i</i> <kbd>kbd</kbd> <mark>mark</mark> <q>q</q> <ruby>ruby <rb>rb</rb> <rp>rp</rp> <rt>rt</rt> <rtc>rtc</rtc> <rp>rp</rp></ruby> <s>s</s> <samp>samp</samp> <small>small</small> <span style="color:red">span</span> <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> <time datetime="2018">time</time> <u>u</u> <var>var</var> <wbr>wbr
  1. Save the post.
  2. Open the post in the Gutenberg editor.
  3. Convert the Classic block to standard blocks using the Convert to Blocks option.
  4. Notice how some of the inline formatting has been removed, and some of it has been converted to other elements with different semantic meaning, e.g. <b> tags are converted to <strong> tags.

Expected Behavior

Converting a Classic block to standard blocks using the "Convert to Blocks" option should preserve all inline formatting tags that are considered valid by the resulting blocks.

Current Behavior

Converting a Classic block to standard blocks using the "Convert to Blocks" option removes several valid HTML5 tags, and converts some tags to other tags with different semantic meanings, e.g. <b> tags are converted to <strong> tags. The previously given sample input above is transformed into the following:

<p><abbr>abbr</abbr> <strong>b</strong> <br/>br bdi bdo cite <code>code</code> data dfn <em>em</em> <em>i</em> kbd mark q ruby rb rp rt rtc rp s samp small span <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> time u var wbr</p>

Other Notes

If a Classic block contains obsolete HTML tags like <color>, then converting them to <span> tags upon conversion to standard blocks seems like a good idea.

<b> and <i> tags are converted to <strong> and <em> respectively. However, it should be considered that there are valid cases where a <b> or <i> are used semantically as per the HTML5 specification. Additionally, <span> tags containing font-weight:bold or font-style:italic are converted to <strong> or <em> tags respectively, and this seems like a bad idea. Most of the time, when someone makes text bold or italic using a <span> tag, they are doing it for purely stylistic reasons and not semantic reasons, so converting those to <strong> and <em> tags seems like a bad idea.

Related Issues and/or PRs

@danielbachhuber danielbachhuber added [Feature] Block Conversion Block transforms from one block to another [Feature] Blocks Overall functionality of blocks labels Apr 10, 2018
@danielbachhuber danielbachhuber added this to the Merge Proposal: Editor milestone May 18, 2018
@mtias mtias added the [Feature] Raw Handling Related to the ability to convert content to blocks, paste handling, etc label Jul 19, 2018
@danielbachhuber danielbachhuber added the Backwards Compatibility Issues or PRs that impact backwards compatability label Jul 25, 2018
@ZebulanStanphill
Copy link
Member Author

@danielbachhuber I did a full test with all valid HTML5 inline semantic elements and here are the results.

I put the following into a Classic block:

<abbr>abbr</abbr> <b>b</b> <br>br <bdi>bdi</bdi> <bdo dir="rtl">bdo</bdo> <cite>cite</cite> <code>code</code> <data value="value">data</data> <dfn>dfn</dfn> <em>em</em> <i>i</i> <kbd>kbd</kbd> <mark>mark</mark> <q>q</q> <ruby>ruby <rb>rb</rb> <rp>rp</rp> <rt>rt</rt> <rtc>rtc</rtc> <rp>rp</rp></ruby> <s>s</s> <samp>samp</samp> <small>small</small> <span style="color:red">span</span> <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> <time datetime="2018">time</time> <u>u</u> <var>var</var> <wbr>wbr

I then converted it to a Paragraph block. Everything should have been the same except for the added <p> tags. However, some tags were stripped out, and others were converted to different tags:

<p><abbr>abbr</abbr> <strong>b</strong> <br/>br bdi bdo cite <code>code</code> data dfn <em>em</em> <em>i</em> kbd mark q ruby rb rp rt rtc rp s samp small span <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> time u var wbr</p>

And yes, <b>, <i>, <small>, and <u> are all valid HTML tags with semantic meaning.

The current behavior seems to have improved from when I first made the issue, but a lot of tags are still stripped out. Personally, I think this should be resolved before the Try Callout goes out.

@danielbachhuber
Copy link
Member

#6878 and #7604 are both related.

The first is relevant because there's still some amount of the HTML5 spec we need to accommodate for: #6878 (comment)

The second because, in the scenario where we're stripping out HTML, we need to improve the user experience.

I don't think this needs to be resolved prior to the Try Callout though, for two reasons:

  1. The user can undo the conversion process.
  2. This is a much larger architectural conversation that there's not necessarily a "fix" for.

@ZebulanStanphill
Copy link
Member Author

ZebulanStanphill commented Jul 27, 2018

@danielbachhuber Okay, fair enough. 👍

@ZebulanStanphill ZebulanStanphill changed the title Converting Classic block to standard blocks removes valid inline formatting Classic block: Convert to Blocks removes valid inline formatting Aug 4, 2018
@fumikito
Copy link

@ZebulanStanphill @danielbachhuber

F.Y.I.
I tried some fix phrasingContentReducer for ruby tag.

const phrasingContentSchema = {
	strong: {},
	em: {},
	del: {},
	ins: {},
	a: { attributes: [ 'href', 'target', 'rel' ] },
	code: {},
	abbr: { attributes: [ 'title' ] },
	sub: {},
	sup: {},
	br: {},
	ruby: {
		children: {
			rt: {
				children: {
					'#text': {}
				}
			},
			'#text': {}
		}
	},
	'#text': {},
};

Now it works and classic block conversion keeps expected HTML.

2018-09-18 15 56 47

Fundamentally, ruby tag is not supported by current WordPress(filtered by TinyMCE) and I need a plugin to enter ruby. ruby is rarely used but in Japanese.

In my opinion, JS filter hook for phrasingContentReducer will help plugins for i18n or semantic freaks like <q cite="some source">Some quotes</q>, <ins datetime="date-of-insertion">Inserted text</ins>, and so on.

@mtias mtias modified the milestones: Merge: Editor, WordPress 5.0 Oct 3, 2018
@mtias mtias added the [Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable label Oct 3, 2018
@michakrapp
Copy link

I got a similar behaviour with the <address> element.
When converting to blocks link markup inside of it get completely strip off.

Unfortunately in this case it gets attached to a youtube link before them and are not "visible" to the user after the converting.

Test:

  • new post with classic block
  • switch to html mode of classic block
  • add html text shown below
  • switch to visual mode
  • select "convert to blocks"

HTML before: (links exist)

<p>https://www.youtube.com/watch?v=2DkaLmUHYOw&amp;rel=0</p>
<address><a href="http://www.wope.net" target="_blank" mce_href="http://www.wope.net">www.wope.net</a></address>
<address style="text-align: justify;" mce_style="text-align: justify;"><a href="http://www.mediaresourcegroup.de" target="_blank" mce_href="http://www.mediaresourcegroup.de">www.mediaresourcegroup.de</a><p></p>
<p>Foto/Text/Video: Markus Wilmsmann</p>
</address>

HTML after: (links are added to youtube link)

<!-- wp:core-embed/youtube {"url":"https://www.youtube.com/watch?v=2DkaLmUHYOw\u0026rel=0www.wope.netwww.mediaresourcegroup.de"} -->
<figure class="wp-block-embed-youtube wp-block-embed"><div class="wp-block-embed__wrapper">
https://www.youtube.com/watch?v=2DkaLmUHYOw&amp;rel=0www.wope.netwww.mediaresourcegroup.de
</div></figure>
<!-- /wp:core-embed/youtube -->

<!-- wp:paragraph -->
<p>Foto/Text/Video: Markus Wilmsmann</p>
<!-- /wp:paragraph -->

@susanpaigen
Copy link

As an everyday user, preserving inline markup during "convert to blocks" is critical to making the transition to Gutenberg. I manage a college website with around 350 active pages. At least 80 - 90% of them use inline HTML for styling, primarily . If that all gets wiped out, "convert to blocks" is not an option. Having to manually create each block, and then copy and paste each chunk of content manually, will be a very time-consuming project! This is not blog content, with headings, straightforward paragraphs, and a few images. Many of our pages are very information heavy, and text formatting is an important tool for keeping pages scannable and not too intimidating.

@ZebulanStanphill ZebulanStanphill changed the title Classic block: Convert to Blocks removes valid inline formatting Classic + Custom HTML blocks: Convert to Blocks removes valid inline formatting Oct 31, 2018
@aduth
Copy link
Member

aduth commented Nov 5, 2018

This is made challenging by the fact that at the root of this conversion feature is the raw handler, which is also responsible for adapting pasted content from external sources to blocks. The HTML received by these sources (e.g. Google Docs) is littered with attributes and markup which makes sense only in the context of Google Docs, and not as a standalone document. This is a major part of the reason the filtering is as strict as it is.

For example, consider an input document:

https://docs.google.com/document/d/1wwL4H10bDUDAiYVAczLthGCdwxsNZ9YhWpWwO-YId_w/edit?usp=sharing

Whose copied text yields the pasted HTML:

<meta charset='utf-8'><meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-6a3f5726-7fff-f826-dcd7-8eef2659d75a"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://drive.google.com/drive/u/0/folders/1k4bWkN088Hte1mehmPkKZHbois4Zjsar" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">(View All Agendas)</span></a></p><br /><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Slack Transcript:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Summary Notes:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Note Taker:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><br /><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><hr /></p><br /><br /></b>

So while filtering the phrasing content schema could provide some improvement to classic block conversion, it might also inadvertently reintroduce some of the unwanted markup from these other sources.

Perhaps then if we consider the classic conversion to be exempt to filtering, or conversely the paste handling to be exceptionally strict, we could do one of the following:

  • Bypass schema validation for raw handling of classic conversion
  • Allow a 'raw' transform (specifically paragraph) to receive the "purpose" of transform (paste vs. convert) as an argument to vary the schema it produces.

cc @iseulde : Since I'm not as familiar with the raw handling, have these sorts of approaches been considered previously?

@ellatrix
Copy link
Member

ellatrix commented Nov 5, 2018

Sounds like this is a duplicate of #6878. We should close one of these. I'll look into distinguishing paste from block conversion asap. @aduth bypassing the schema validation, or partly, seems fine. We can assume the data has already been validated and is there by intention.

@aduth
Copy link
Member

aduth commented Nov 5, 2018

Let's close in favor of #6878 for technical implementation.

@danielbachhuber
Copy link
Member

Let's keep this open until the issue is fully addressed, and then we can use this issue to verify the implementation is complete.

Here's a prior example of an issue that was closed prematurely #6648 (comment)

@ellatrix
Copy link
Member

ellatrix commented Nov 5, 2018

@danielbachhuber This is a complete duplicate. Resolving one will resolve the other. The other issue you mention seems to contain several action items.

@ellatrix
Copy link
Member

ellatrix commented Nov 5, 2018

I'll close the other one then in favour of this.

@ellatrix ellatrix self-assigned this Nov 5, 2018
@susanpaigen
Copy link

  1. As a front end user, it seems like 'paste' and 'convert to blocks' should be structurally different actions.
  2. However, there also needs to be a way to 'paste' and keep formatting and css, so you can copy formatted chunks of content from one page to another. With the current behavior, the only way I can see to do this would be to make almost every block 'custom html'.
  3. Also need a way to preserve
    markup, including classes.
  4. From a broader perspective - maybe what's needed is a version of 'convert to blocks', either built in or a plugin, that gives the user much more granular control over which type of block each chunk is converted to and what is preserved in the process.

@lkraav
Copy link

lkraav commented Nov 6, 2018

I've also found "the cleanse" helpful in some cases. Would certainly be nice if there was a choice for "retain or clean" markup. But every if statement costs $10K as we all know.. :)

@ellatrix ellatrix mentioned this issue Nov 6, 2018
4 tasks
@ellatrix
Copy link
Member

ellatrix commented Nov 6, 2018

PR: #11539.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backwards Compatibility Issues or PRs that impact backwards compatability [Feature] Block Conversion Block transforms from one block to another [Feature] Blocks Overall functionality of blocks [Feature] Raw Handling Related to the ability to convert content to blocks, paste handling, etc [Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants