Issues with data processor noticed during initial implementation #2305

szymonkups · 2016-09-20T20:57:05Z

Here's a list of issues that I've found during the work on data processor initial implementation. Some of the issues are just differences between HTML rendered by GitHub and marked library.

GitHub is sometimes adding new lines between rendered HTML elements but marked library does not. This is actually not a big deal (I think) since we don't want to have text nodes in our view containing only newline char. For example, this blockquote:

> foo bar

Is rendered as:

<blockquote>
<p>foo bar</p>
</blockquote>

but marked library renders it as:

<blockquote><p>foo bar</p></blockquote>

Every paragraph is proceeded by two newlines. This markdown:

this is paragraph
# this is header

is rendered by GitHub as:

<p>this is paragraph</p>

<h1>this is header</h1>

but marked renders it as:

<p>this is paragraph</p><h1>this is header</h1>

Additional newlines are also added after rendering each   element:

<p>first<br>
second<br>
third</p>

GitHub is adding newline at the end of each rendered code block:

<pre><code>code block
</code></pre>

Where marked renders without it:

<pre><code>code block</code></pre>

To-markdown does not process empty code blocks, so this from the view:

<pre><code class="lang-js"></code></pre>

will be converted to an empty markdown string.

All reference links are correctly converted to <a> elements. But when converting back they are not represented as reference links but direct ones. For example:

Foo [bar][1].
[1]: /url/  "Title"

Is converted to view element:

<p>Foo <a href="/url/" title="Title">bar</a>.</p>

So when it is converted back to markdown it looks like this:

Foo [bar](/url/ "Title").

Marked is trimming white spaces from list items. For example:

*    item1
*    item2

is rendered on GitHub as:

<ul>
<li>   item1</li>
<li>   item2</li>
</ul>

but marked trims white spaces from the beginning:

<ul><li>item1</li><li>item2</li></ul>

The same happens with ordered lists.

Some strange things happens with nested strong and emphasis. For example, this markdown:

*test **test** test*

is rendered by GitHub as:

<p><em>test *</em>test** test*</p>

but marked renders it as:

<p><em>test <strong>test</strong> test</em></p>

The text was updated successfully, but these errors were encountered:

szymonkups · 2016-09-21T07:13:07Z

To-markdown have problems with escaping special characters.
Example markdown:

\_test\_

Is rendered by GitHub and marked as:

<p>_test_</p>

And there is no problem with that, but when converting this HTML back to markdown it is represented as:

_test_

So when another conversion will happen it will be converted to emphasis.

fredck · 2016-09-21T08:40:03Z

It sounds like many of these issues are not really important for us, right? I mean, they're mainly about white-spaces and does not make a difference when rendered.

If I'm right, is it worth tracking them here? Maybe we should at least mark those that we're ok with.

Reinmar · 2016-09-21T10:20:48Z

I agree – whitespaces in HTML can be ignored. White spaces in MD are more tricky cause they can affect how GH renders the content later on.

OTOH, I don't know how well the engine cleans up insignificant whitespaces in loaded HTML so we may have useless text nodes between s or at block boundaries (inside s) which may break stuff. The most tricky ones are those inside s becuase they won't be filtered out by the schema and must be filtered out by a different algorithm. Cause if they end up in the model, they will be rendered as visible spaces (a mix of  s and normal spaces) as soon as we handle rendering of multiple whitespaces.

And I'd write that this is the engine's problem, not MdDP's, but I'm unsure about it. If the DP produces a view (not a DOM), should that view be already normalised? I guess not, the normalisation should happen later, inside the engine. So perhaps we don't have to care about excessive whitespaces in the DPs.

tl;dr: I agree with Fred, but we must remeber that the engine may not be ready yet to handle some things.

fredck · 2016-09-21T13:36:40Z

The DP role is creating valid HTML.  Text (with plain whitespace to be discarded) is definitely valid and the DP should not care about it. In the model, though, for sure the node value must be Text when we load this.

Reinmar · 2016-09-21T13:46:11Z

I reported a ticket for this: https://github.com/ckeditor/ckeditor5-engine/issues/606.

scofalik · 2016-09-28T11:08:19Z

I moved the comment to the issue linked above https://github.com/ckeditor/ckeditor5-engine/issues/606.

fredck · 2016-09-28T12:07:33Z

I don't see how DP must be related to "view". It's all about "data". If for some reason it has been connected to the creation of a ViewDocumentFragment in the process, that decision may have been taken for other reasons, but not to satisfy view requirements.

If DP deals with whitespace sequences that need to be preserved, it has the job to make it the HTML way, for example by replacing such spaces with a non-breaking space character. It should certainly not expect that such spaces would be magically preserved.

All the above is very important because DPs will, many times, be created on top of "(Some Format) to HTML" conversion libraries.

Reinmar · 2016-09-28T12:26:12Z

All the above is very important because DPs will, many times, be created on top of "(Some Format) to HTML" conversion libraries.

If you create such a converter, then you use the HtmlDP for the final HTML->view transformation, as @szymonkups did in https://github.com/ckeditor/ckeditor5-markdown-gfm/blob/71a5d43888ab0202016b7ec3c3274775c2794c63/src/gfmdataprocessor.js#L27.

If every DP was meant to convert formatX<->DOM, then we'd lose a lot of flexibility, because why requiring that a DOM is an intermediary step? There can be DPs converting formatX straight to the view.

At the same time, we realised that only upon rendering to the DOM (or reading back) we know how to handle some specific, DOM-related quirks, like spaces rendering. See @scofalik's comment for more info.

To sum up – how's this working in MdDP's case? An external library converts MD to HTML (string), then we take that string and pass through HtmlDP, which takes care of normalising spaces (the MD->HTML converter can try to format the HTML – e.g. add indentation – and this must be cleared).

fredck · 2016-09-28T12:52:22Z

This gives the taste that all DPs will end up doing the same trick that the MdDP did, even if just for spaces normalization... the thin line between flexibility and over-engineering.

scofalik · 2016-09-28T13:25:09Z

I don't feel we are over-engineering anything. We are clearing whitespace at the only possible and sane place in the code.

Please take a look at discussion here https://github.com/ckeditor/ckeditor5-engine/issues/379#issuecomment-249537890

To quickly sum it up:

Model is meaningful data and it has to be clear. It means that each space character put by user should be saved as normal space character. Keep in mind that by clicking space users intention is to create a space character.
The need of   at the beginning/end of block element or between spaces is seen by us as a quirk (user never intended to insert   it's just because the HTML and browsers we need to change some spaced to nonbreakable ones).
View should also keep meaningful data and no quirks. Sill, the view should be similiar to DOM.

This causes two conclusions:

There should be no whitespaces in view that miraculously disappear when view is rendered to DOM.
View should be the entry point for data processors.

I understand that MD processor works like it works beacuse @szymonkups wanted to use existing library, which is fine. It certainly feels like other processors may follow this way.

Why we don't want to keep trash in view?

It's semantically incorrect and we don't want to keep quirks in view.  s are quirks. If HTML works other way, we would not have  s.
It's easier for algorithms if we have spaces when user clicked space and   when user intentionally wants to insert nonbreakable space (we may introduce shift + space or similar keystroke for this) and we remove invisible spaces. Consider those examples:
- Assume view/DOM foo^. Now you click space and browser inserts  . The view becomes foo ^. Then user types a letter. The browser changes   to . If we keep view same as DOM, it's impossible to have a feature that would intentionally insert and keep  because both scenarios are indistinguishable.
- Assume view/DOM foo {bar} bom. If selected range is deleted some kind of post-processor would have to take care of changing space to  . You can't leave it to renderer or something, because it would treat those spaces as whitespaces and would remove one of them (this happens now). In other words, renderer would not know which whitespaces are for removing and which are to be changed to  .

In other words, it's easier for us to work on clean model and view and keep the data that user input there. The less magic we do is better. Now we know that all spaces in the model and view are important and renderer has to only change some of them to   because browsers works like they work.

fredck · 2016-09-28T13:36:36Z

If every DP was meant to convert formatX<->DOM, then we'd lose a lot of flexibility, because why requiring that a DOM is an intermediary step? There can be DPs converting formatX straight to the view.

The above is what I was referring to, when I talked about flexibility.

It's semantically incorrect and we don't want to keep quirks in view. s are quirks. If HTML works other way, we would not have s.

This doesn't solve the problem that you brought to discussion:

Whitespace meaning "nothing" is a characteristic of HTML. In other language they might have a meaning, so we cannot arbitrary remove any chain of whitespaces at any time, because "they mean nothing".

Anyway, there is nothing to be changed on the approach you guys want to take. Let's just assume that we are eventually complicating things a bit.... but really just a bit ;)

scofalik · 2016-09-28T13:43:13Z

By complicating you mean doing it right? :)

Edit: I am not mocking you, I really don't see how can we do it right in other way... and I don't understand your approach, maybe I am missing something, but AFAICS what you propose would just bring more problems.

fredck · 2016-09-28T13:57:38Z

I was ready for a "Whatever" comment :D

vladikoff · 2017-11-21T19:33:03Z

We encountered several issues if we tried to 'sync' up the same editor data across multiple windows (or could be tabs for other users).

For example the user-entered paragraphs would go missing in mozilla/notes#407 and list styles would get confused due to missing line breaks that signify a "new list": mozilla/notes#421

I had to fork the plugin and make toData store HTML instead of markdown: vladikoff/ckeditor5-markdown-gfm@407840c

Reinmar · 2017-11-22T14:58:45Z

I had to fork the plugin and make toData store HTML instead of markdown: vladikoff/ckeditor5-markdown-gfm@407840c

Doesn't it mean that you switched to storing HTML and need this data processor only to load legacy data?

vladikoff · 2017-11-22T15:02:50Z

@Reinmar A lot of users asked for "Markdown" support so we want to process their markdown input such as ## Cool Heading and **bold** and maybe links in the future. However we don't want to store things as Markdown because that will limit us to only Markdown features (strips away important HTML formatting) and gets us locked down to Markdown's antics.

Reinmar · 2017-11-22T20:31:30Z

But do you mean that those users enter markdown in the visual mode? Like with the autoformatting? Because if that's only about a new content, then autoformatting should be enough. If about loading a content created with the previous editor, then vladikoff/ckeditor5-markdown-gfm@407840c looks fine.

CKEditorBot · 2023-10-02T05:15:45Z

There has been no activity on this issue for the past year. We've marked it as stale and will close it in 30 days. We understand it may be relevant, so if you're interested in the solution, leave a comment or reaction under this issue.

CKEditorBot · 2023-11-02T05:14:22Z

We've closed your issue due to inactivity over the last year. We understand that the issue may still be relevant. If so, feel free to open a new one (and link this issue to it).

mlewand transferred this issue from ckeditor/ckeditor5-markdown-gfm Oct 8, 2019

mlewand added the package:markdown-gfm label Oct 8, 2019

mlewand added this to the unknown milestone Oct 8, 2019

pomek removed this from the unknown milestone Feb 21, 2022

CKEditorBot added the status:stale label Oct 2, 2023

CKEditorBot added the resolution:expired This issue was closed due to lack of feedback. label Nov 2, 2023

CKEditorBot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with data processor noticed during initial implementation #2305

Issues with data processor noticed during initial implementation #2305

szymonkups commented Sep 20, 2016

szymonkups commented Sep 21, 2016

fredck commented Sep 21, 2016

Reinmar commented Sep 21, 2016

fredck commented Sep 21, 2016

Reinmar commented Sep 21, 2016

scofalik commented Sep 28, 2016 •

edited

Loading

fredck commented Sep 28, 2016

Reinmar commented Sep 28, 2016

fredck commented Sep 28, 2016

scofalik commented Sep 28, 2016 •

edited

Loading

fredck commented Sep 28, 2016

scofalik commented Sep 28, 2016 •

edited

Loading

fredck commented Sep 28, 2016

vladikoff commented Nov 21, 2017

Reinmar commented Nov 22, 2017

vladikoff commented Nov 22, 2017

Reinmar commented Nov 22, 2017

CKEditorBot commented Oct 2, 2023

CKEditorBot commented Nov 2, 2023

Issues with data processor noticed during initial implementation #2305

Issues with data processor noticed during initial implementation #2305

Comments

szymonkups commented Sep 20, 2016

szymonkups commented Sep 21, 2016

fredck commented Sep 21, 2016

Reinmar commented Sep 21, 2016

fredck commented Sep 21, 2016

Reinmar commented Sep 21, 2016

scofalik commented Sep 28, 2016 • edited Loading

fredck commented Sep 28, 2016

Reinmar commented Sep 28, 2016

fredck commented Sep 28, 2016

scofalik commented Sep 28, 2016 • edited Loading

fredck commented Sep 28, 2016

scofalik commented Sep 28, 2016 • edited Loading

fredck commented Sep 28, 2016

vladikoff commented Nov 21, 2017

Reinmar commented Nov 22, 2017

vladikoff commented Nov 22, 2017

Reinmar commented Nov 22, 2017

CKEditorBot commented Oct 2, 2023

CKEditorBot commented Nov 2, 2023

scofalik commented Sep 28, 2016 •

edited

Loading

scofalik commented Sep 28, 2016 •

edited

Loading

scofalik commented Sep 28, 2016 •

edited

Loading