Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with data processor noticed during initial implementation #2305

Closed
szymonkups opened this issue Sep 20, 2016 · 19 comments
Closed

Issues with data processor noticed during initial implementation #2305

szymonkups opened this issue Sep 20, 2016 · 19 comments
Labels
package:markdown-gfm resolution:expired This issue was closed due to lack of feedback. status:stale

Comments

@szymonkups
Copy link
Contributor

Here's a list of issues that I've found during the work on data processor initial implementation. Some of the issues are just differences between HTML rendered by GitHub and marked library.


GitHub is sometimes adding new lines between rendered HTML elements but marked library does not. This is actually not a big deal (I think) since we don't want to have text nodes in our view containing only newline char. For example, this blockquote:

> foo bar

Is rendered as:

<blockquote>
<p>foo bar</p>
</blockquote>

but marked library renders it as:

<blockquote><p>foo bar</p></blockquote>

Every paragraph is proceeded by two newlines. This markdown:

this is paragraph
# this is header

is rendered by GitHub as:

<p>this is paragraph</p>

<h1>this is header</h1>

but marked renders it as:

<p>this is paragraph</p><h1>this is header</h1>

Additional newlines are also added after rendering each <br/> element:

<p>first<br>
second<br>
third</p>

GitHub is adding newline at the end of each rendered code block:

<pre><code>code block
</code></pre>

Where marked renders without it:

<pre><code>code block</code></pre>

To-markdown does not process empty code blocks, so this from the view:

<pre><code class="lang-js"></code></pre>

will be converted to an empty markdown string.


All reference links are correctly converted to <a> elements. But when converting back they are not represented as reference links but direct ones. For example:

Foo [bar][1].
[1]: /url/  "Title"

Is converted to view element:

<p>Foo <a href="/url/" title="Title">bar</a>.</p>

So when it is converted back to markdown it looks like this:

Foo [bar](/url/ "Title").

Marked is trimming white spaces from list items. For example:

*    item1
*    item2

is rendered on GitHub as:

<ul>
<li>   item1</li>
<li>   item2</li>
</ul>

but marked trims white spaces from the beginning:

<ul><li>item1</li><li>item2</li></ul>

The same happens with ordered lists.


Some strange things happens with nested strong and emphasis. For example, this markdown:

*test **test** test*

is rendered by GitHub as:

<p><em>test *</em>test** test*</p>

but marked renders it as:

<p><em>test <strong>test</strong> test</em></p>
@szymonkups
Copy link
Contributor Author

To-markdown have problems with escaping special characters.
Example markdown:

\_test\_

Is rendered by GitHub and marked as:

<p>_test_</p>

And there is no problem with that, but when converting this HTML back to markdown it is represented as:

_test_

So when another conversion will happen it will be converted to emphasis.

@fredck
Copy link
Contributor

fredck commented Sep 21, 2016

It sounds like many of these issues are not really important for us, right? I mean, they're mainly about white-spaces and does not make a difference when rendered.

If I'm right, is it worth tracking them here? Maybe we should at least mark those that we're ok with.

@Reinmar
Copy link
Member

Reinmar commented Sep 21, 2016

I agree – whitespaces in HTML can be ignored. White spaces in MD are more tricky cause they can affect how GH renders the content later on.

OTOH, I don't know how well the engine cleans up insignificant whitespaces in loaded HTML so we may have useless text nodes between <p>s or at block boundaries (inside <p>s) which may break stuff. The most tricky ones are those inside <p>s becuase they won't be filtered out by the schema and must be filtered out by a different algorithm. Cause if they end up in the model, they will be rendered as visible spaces (a mix of &nbsp;s and normal spaces) as soon as we handle rendering of multiple whitespaces.

And I'd write that this is the engine's problem, not MdDP's, but I'm unsure about it. If the DP produces a view (not a DOM), should that view be already normalised? I guess not, the normalisation should happen later, inside the engine. So perhaps we don't have to care about excessive whitespaces in the DPs.

tl;dr: I agree with Fred, but we must remeber that the engine may not be ready yet to handle some things.

@fredck
Copy link
Contributor

fredck commented Sep 21, 2016

The DP role is creating valid HTML. <p>          Text</p> (with plain whitespace to be discarded) is definitely valid and the DP should not care about it. In the model, though, for sure the node value must be Text when we load this.

@Reinmar
Copy link
Member

Reinmar commented Sep 21, 2016

I reported a ticket for this: https://github.com/ckeditor/ckeditor5-engine/issues/606.

@scofalik
Copy link
Contributor

scofalik commented Sep 28, 2016

I moved the comment to the issue linked above https://github.com/ckeditor/ckeditor5-engine/issues/606.

@fredck
Copy link
Contributor

fredck commented Sep 28, 2016

I don't see how DP must be related to "view". It's all about "data". If for some reason it has been connected to the creation of a ViewDocumentFragment in the process, that decision may have been taken for other reasons, but not to satisfy view requirements.

If DP deals with whitespace sequences that need to be preserved, it has the job to make it the HTML way, for example by replacing such spaces with a non-breaking space character. It should certainly not expect that such spaces would be magically preserved.

All the above is very important because DPs will, many times, be created on top of "(Some Format) to HTML" conversion libraries.

@Reinmar
Copy link
Member

Reinmar commented Sep 28, 2016

All the above is very important because DPs will, many times, be created on top of "(Some Format) to HTML" conversion libraries.

If you create such a converter, then you use the HtmlDP for the final HTML->view transformation, as @szymonkups did in https://github.com/ckeditor/ckeditor5-markdown-gfm/blob/71a5d43888ab0202016b7ec3c3274775c2794c63/src/gfmdataprocessor.js#L27.

If every DP was meant to convert formatX<->DOM, then we'd lose a lot of flexibility, because why requiring that a DOM is an intermediary step? There can be DPs converting formatX straight to the view.

At the same time, we realised that only upon rendering to the DOM (or reading back) we know how to handle some specific, DOM-related quirks, like spaces rendering. See @scofalik's comment for more info.

To sum up – how's this working in MdDP's case? An external library converts MD to HTML (string), then we take that string and pass through HtmlDP, which takes care of normalising spaces (the MD->HTML converter can try to format the HTML – e.g. add indentation – and this must be cleared).

@fredck
Copy link
Contributor

fredck commented Sep 28, 2016

This gives the taste that all DPs will end up doing the same trick that the MdDP did, even if just for spaces normalization... the thin line between flexibility and over-engineering.

@scofalik
Copy link
Contributor

scofalik commented Sep 28, 2016

I don't feel we are over-engineering anything. We are clearing whitespace at the only possible and sane place in the code.

Please take a look at discussion here https://github.com/ckeditor/ckeditor5-engine/issues/379#issuecomment-249537890

To quickly sum it up:

  1. Model is meaningful data and it has to be clear. It means that each space character put by user should be saved as normal space character. Keep in mind that by clicking space users intention is to create a space character.
  2. The need of &nbsp; at the beginning/end of block element or between spaces is seen by us as a quirk (user never intended to insert &nbsp; it's just because the HTML and browsers we need to change some spaced to nonbreakable ones).
  3. View should also keep meaningful data and no quirks. Sill, the view should be similiar to DOM.

This causes two conclusions:

  1. There should be no whitespaces in view that miraculously disappear when view is rendered to DOM.
  2. View should be the entry point for data processors.

I understand that MD processor works like it works beacuse @szymonkups wanted to use existing library, which is fine. It certainly feels like other processors may follow this way.

Why we don't want to keep trash in view?

  1. It's semantically incorrect and we don't want to keep quirks in view. &nbsp;s are quirks. If HTML works other way, we would not have &nbsp;s.
  2. It's easier for algorithms if we have spaces when user clicked space and &nbsp; when user intentionally wants to insert nonbreakable space (we may introduce shift + space or similar keystroke for this) and we remove invisible spaces. Consider those examples:
    • Assume view/DOM <p>foo^</p>. Now you click space and browser inserts &nbsp;. The view becomes <p>foo&nbsp;^</p>. Then user types a letter. The browser changes &nbsp; to . If we keep view same as DOM, it's impossible to have a feature that would intentionally insert and keep&nbsp; because both scenarios are indistinguishable.
    • Assume view/DOM <p>foo {bar} bom</p>. If selected range is deleted some kind of post-processor would have to take care of changing space to &nbsp;. You can't leave it to renderer or something, because it would treat those spaces as whitespaces and would remove one of them (this happens now). In other words, renderer would not know which whitespaces are for removing and which are to be changed to &nbsp;.

In other words, it's easier for us to work on clean model and view and keep the data that user input there. The less magic we do is better. Now we know that all spaces in the model and view are important and renderer has to only change some of them to &nbsp; because browsers works like they work.

@fredck
Copy link
Contributor

fredck commented Sep 28, 2016

If every DP was meant to convert formatX<->DOM, then we'd lose a lot of flexibility, because why requiring that a DOM is an intermediary step? There can be DPs converting formatX straight to the view.

The above is what I was referring to, when I talked about flexibility.

It's semantically incorrect and we don't want to keep quirks in view.  s are quirks. If HTML works other way, we would not have  s.

This doesn't solve the problem that you brought to discussion:

Whitespace meaning "nothing" is a characteristic of HTML. In other language they might have a meaning, so we cannot arbitrary remove any chain of whitespaces at any time, because "they mean nothing".

Anyway, there is nothing to be changed on the approach you guys want to take. Let's just assume that we are eventually complicating things a bit.... but really just a bit ;)

@scofalik
Copy link
Contributor

scofalik commented Sep 28, 2016

By complicating you mean doing it right? :)

Edit: I am not mocking you, I really don't see how can we do it right in other way... and I don't understand your approach, maybe I am missing something, but AFAICS what you propose would just bring more problems.

@fredck
Copy link
Contributor

fredck commented Sep 28, 2016

I was ready for a "Whatever" comment :D

@vladikoff
Copy link

We encountered several issues if we tried to 'sync' up the same editor data across multiple windows (or could be tabs for other users).

For example the user-entered paragraphs would go missing in mozilla/notes#407 and list styles would get confused due to missing line breaks that signify a "new list": mozilla/notes#421

I had to fork the plugin and make toData store HTML instead of markdown: vladikoff/ckeditor5-markdown-gfm@407840c

@Reinmar
Copy link
Member

Reinmar commented Nov 22, 2017

I had to fork the plugin and make toData store HTML instead of markdown: vladikoff/ckeditor5-markdown-gfm@407840c

Doesn't it mean that you switched to storing HTML and need this data processor only to load legacy data?

@vladikoff
Copy link

@Reinmar A lot of users asked for "Markdown" support so we want to process their markdown input such as ## Cool Heading and **bold** and maybe links in the future. However we don't want to store things as Markdown because that will limit us to only Markdown features (strips away important HTML formatting) and gets us locked down to Markdown's antics.

@Reinmar
Copy link
Member

Reinmar commented Nov 22, 2017

But do you mean that those users enter markdown in the visual mode? Like with the autoformatting? Because if that's only about a new content, then autoformatting should be enough. If about loading a content created with the previous editor, then vladikoff/ckeditor5-markdown-gfm@407840c looks fine.

@mlewand mlewand transferred this issue from ckeditor/ckeditor5-markdown-gfm Oct 8, 2019
@mlewand mlewand added this to the unknown milestone Oct 8, 2019
@pomek pomek removed this from the unknown milestone Feb 21, 2022
@CKEditorBot
Copy link
Collaborator

There has been no activity on this issue for the past year. We've marked it as stale and will close it in 30 days. We understand it may be relevant, so if you're interested in the solution, leave a comment or reaction under this issue.

@CKEditorBot
Copy link
Collaborator

We've closed your issue due to inactivity over the last year. We understand that the issue may still be relevant. If so, feel free to open a new one (and link this issue to it).

@CKEditorBot CKEditorBot added the resolution:expired This issue was closed due to lack of feedback. label Nov 2, 2023
@CKEditorBot CKEditorBot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package:markdown-gfm resolution:expired This issue was closed due to lack of feedback. status:stale
Projects
None yet
Development

No branches or pull requests

8 participants