Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify order of entity/numeric character references vs. delimiter runs #572

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mgeier
Copy link
Contributor

@mgeier mgeier commented Apr 7, 2019

@jgm
Copy link
Member

jgm commented Apr 7, 2019

Entity and numeric character references are replaced before
deciding whether a [delimiter run] can open and/or close
emphasis.

But they aren't, at least in the reference implementations.

% ./build/src/cmark 
i*ii*i
<p>i*ii*i</p>

And I don't think they should be, though that is surely debatable.

@mgeier
Copy link
Contributor Author

mgeier commented Apr 8, 2019

Yeah, sorry, I should have been more verbose on this one. I was suggesting what the spec should be, regardless what some implementation does.

And yes, I want to debate, please!

AFAIU, this case was previously undefined by the spec, and different implementations have different behavior.

http://johnmacfarlane.net/babelmark2/?text=i*%26%23105%3Bi*i

pandoc and markdig, among others, have the behavior I'm suggesting.

commonmark.js and MD4C (and cmark which is not in Babelmark) have the behavior you are showing.

I think my suggestion makes more sense. If an author uses a numeric character reference, they may have whatever reasons for that, but in the end they want it to be displayed as a certain letter or symbol or number or whatever the numeric reference represents. If it represents a "normal" letter, an author would expect it to behave like a letter and not like a punctuation character. Therefore, the references must be resolved before deciding whether a delimiter run can open and/or close emphasis.

I don't really see any argument for your suggestion other than it might be easier to implement in some given implementation.

The situation gets more complicated when the numeric character reference represents a "Unicode whitespace". Then, the decision whether a "delimiter run" is "left-flanking" or "right-flanking" should also be made after the references are resolved.

An example would be i*&#32;i*i, which should not be rendered as "emphasis", because the left * is not "left-flanking":

http://johnmacfarlane.net/babelmark2/?text=i*%26%2332%3Bi*i

commonmark.js, MD4C and markdig get this right, but I guess this may be accidental. In this example pandoc gets it wrong.

It looks like markdig is the only lib that gets it IMHO right in both examples.

@jgm
Copy link
Member

jgm commented Apr 8, 2019

pandoc and markdig, among others, have the behavior I'm suggesting.

pandoc isn't implementing commonmark yet for its default markdown parser, so its behavior isn't relevant here.

I agree, though, that this is an issue that should be clarified, one way or the other, in the spec.

I also agree that from a user's point of view, it might be surprising if whether you use &ouml; or ö makes a difference to whether something is treated as emphasis.

Here are three things to be said in favor of the present behavior of the reference implementation:

  1. It might actually be an advantage that you can influence flankingness by deciding to use an entity. There could be cases where you don't want emphasis and you can avoid it by using an entity, or where you do want emphasis and you can get it by using an entity. This would be a good example:

    *&#32;i*
    

    This gives you a way to include a space at the beginning of an emphasized section if you really want to.

  2. Flankingness is meant to mirror our visual impression of the grouping that the emphasis delimiters make. For example, in a*b;*c* we "see" the grouping of asterisks around c as more significant than the grouping around b;. This "seeing" is based on our apprehension of the characters on the page, and from this point of view it doesn't matter much if & is part of an entity.

  3. The current behavior is somewhat more efficient to implement, since we don't have to scan back or forward for entities when determining the flankingness of a delimiter run. However, I'm not sure how significant this is; we'd have to try implementing the other approach and measure to know for sure, and we'd have to consider performance for pathological cases like

    "&ClockwiseContourIntegrAK;*&ClockwiseContourIntegrAK;" * 3000
    

I'd be interested in getting feedback from others about this issue. @kivikakk @mity

@jgm jgm added this to the 0.30 milestone Apr 8, 2019
@mity
Copy link

mity commented Apr 8, 2019

The 1st thing I'd like to highlight is that IMHO using the entities should never ever be usable as an alternative Markdown syntax construction. I.e. e.g. &#x2a;foo&#x2a; is like \*foo\* and certainly not *foo*.

All Markdown syntax marks are using ASCII and should be easy to write without entities. Using entities automatically means it is a literal character, not Markdown syntax construction.

Now, for the flankingness:

It might actually be an advantage that you can influence flankingness by deciding to use an entity.

Actually I see that as a poor/counter-intuitive tool for that. If we need such a tool, there should be some clearer syntax for explicit enforcing left and/or right flanking.

The current behavior is somewhat more efficient to implement, since we don't have to scan back or forward for entities when determining the flankingness of a delimiter run. However, I'm not sure how significant this is; we'd have to try implementing the other approach and measure to know for sure.

The performance impact would depend on the implementation. I believe that for MD4C it would be negligible, because entities are resolved before the emphasis, and I can easily ask in O(1) whether there is a preceding entity mark just before/after the currently analyzed emph. delimiter run.

If the implementation would need to scan the input text, it might be a problem because when you would scan backward (for preceding entity), it is hard to say whether the initial & is escaped or not (there might be arbitrarily long run of \ before &). If I ignore that, the text scanning would be negligible too, because for the flankingness you always need only to study at most one entity in the left neighborhood, and one in the right neighborhood of the delimiter run. So if there is a reasonable limit for entity length (e.g. at most 32 characters between & and ;), we are still O(1) and all that stuff should still fit in a CPU cache line.

But I see the difficulty of this issue elsewhere. For MD4C the issue is "what kind of entity and what it means for the flanking", because MD4C parser leaves the translation of entities to the renderer. For parser anything like &[:alpha:][:alnum:]{1-47}; is a named entity but the parser does not know anything about its character.

It leaves dealing with it to the renderer, because it knows more about desired properties of the output (e.g. its encoding). Consider e.g. ASCII-only renderer which transforms &copy; to (c). Or HTML renderer which outputs them verbatim and postpones the work to the browser.

Now, if we want the parser to use them differently for flanking analysis, then it also needs to know list of valid (named) entities and their properties; to distinguish a white-space (e.g. &nbsp;) versus a punctuation (e.g. &bsemi;) versus a letter (e.g. &ouml;).

It could no longer be just a rendering side business to know the entity and its properties. And it would likely be horrible from design PoV if both the parser and the renderer have to maintain their own lists of entity names, even if for different purposes...

@jgm
Copy link
Member

jgm commented Apr 8, 2019

The 1st thing I'd like to highlight is that IMHO using the entities should never ever be usable as an alternative Markdown syntax construction. I.e. e.g. *foo* is like *foo* and certainly not foo.

All Markdown syntax marks are using ASCII and should be easy to write without entities. Using entities automatically means it is a literal character, not Markdown syntax construction.

Yes, this is now explicit in the spec. (It wasn't before 0.29.) However, the wording only covers actual delimiters, so it leaves this issue unresolved.

But I see the difficulty of this issue elsewhere. For MD4C the issue is "what kind of entity and what it means for the flanking", because MD4C parser leaves the translation of entities to the renderer. For parser anything like &[:alpha:][:alnum:]{1-47}; is a named entity but the parser does not know anything about its character.

That's a very good point. We used to require parsers to translate entities to characters, but we no longer do -- partly because this seems an unreasonable thing to require of a parser that is just going to pass the entity on to HTML. It would mean that parsers have to implement a lookup table for all the valid HTML5 entities. (EDIT: Actually, the current spec does refer to valid HTML5 entities, so parsers do need to keep a table. But that is something I've wanted to get rid of; see the entities branch.)

@kivikakk
Copy link
Contributor

kivikakk commented Apr 9, 2019

It might actually be an advantage that you can influence flankingness by deciding to use an entity.

This seems like a bad use case; if there's a design goal to let users influence flankingness, this should not be the solution, per @mity's comment. However:

Flankingness is meant to mirror our visual impression of the grouping that the emphasis delimiters make.

This accords well with the design goal of CommonMark, per §1.1 (Gruber quote):

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible.

In that sense, the example given in point one *&#32;i* does intuitively read like it should all be emphasised, per current behaviour.

I'm inclined to suggest entities should be treated as opaque for the purposes of determining flankingness, especially because we refuse to interpret them as Markdown structural elements. It seems that we should either do both, or neither.

@mity
Copy link

mity commented Apr 9, 2019

I'm inclined to suggest entities should be treated as opaque for the purposes of determining flankingness, especially because we refuse to interpret them as Markdown structural elements. It seems that we should either do both, or neither.

I am not sure whether I fully agree with this. The flanking rules are not that much an answer for "is or isn't there a Markdown structural element before or after this delimiter run?" The purpose is more "Is this delimiter run at the beginning of a word, the end of a word or inside of a word?"

As such it is somewhat exceptional on its own, because word boundary has no explicit Markdown structural syntax.

Furthermore, there are actually only three things where the implementation has to really understand the Unicode (as distilled from the specification):

  1. For (case-insensitive) matching of a link reference with corresponding link reference definition, Unicode case folding is used.

  2. For detection of word boundary when processing emphasis and strong emphasis, some classification of Unicode character (whitespace, punctuation) is used.

  3. For translating HTML entities (e.g. &amp;) and numeric character references (e.g. &#35; or &#xcab;) into their Unicode equivalents.

No other Markdown structural element has to take Unicode into account.

And this discussion is about situations where two of these exceptional cases (points 2 and 3) meet each other. So these situations are already very exceptional from POV of the specification and nothing we do can make it more exceptional then it already is.

EDIT: To clarify/summarize what I wanted to say: There is no precedens and no analogy in the specification to follow.

@Crissov
Copy link
Contributor

Crissov commented Apr 9, 2019

All implementations covered by Babelmark 3 agree on treating entity references similar to backslash escapes:

&gt; no quote

\> no quote
. 
<p>> no quote</p>
<p>> no quote</p>

They widely disagree about flanking behavior, even implementations claiming CM conformance yield different results. (Okay, itʼs only Markdig.)

_&#32;perhaps emphasis_

_\ perhaps emphasis_

_ no emphasis_
. 
<p><em>perhaps emphasis</em></p>
<p><em>\ perhaps emphasis</em></p>
<p>_ no emphasis_</p>

@mity
Copy link

mity commented Apr 9, 2019

I spent an hour or two thinking about this issue. And to be honest, I am still very indecisive in this case. I see good arguments for both ways.

Maybe big part why I am indecisive is the fact I do not have any idea whether it really is a real issue or just more a theoretical problem searching.

I am Czech. In my language we have all those letters with funny diacritic marks like "ěščřžýáíů". We also quite commonly meet other Latin letters with different diacritic e.g. in personal or geographic names (e.g. Slovak, German, Polish). That makes me in general a fan of I18N support in SW. But the truth is I never ever used HTML entity to express any of those letters. It feels very unnatural to me to use escape sequences for the letters (and I believe to people around me as well). Even before wide-spreading of Unicode we used ISO_8869-2 or Windows codepage 1250 (and before them some other strange encodings like Kamenicky) for it, depending on a context, rather then some strange entities.

I did often used entities in those cases outside of the set of "normal" letters like e.g. &copy; or &nbsp;. Things which are not "part of normal word". And for those, the current CommonMark specification more or less works.

But the world is a big place and maybe people talking (from my POV) an exotic language can see it very differently.

So from practical point of view, I think we should go the harder way if there is a language or culture or group of people who generally use the entities for letters which by no means form word boundary. And that's in my eyes the main question. So, is there?

BTW, even then there might be situation when it won't fully work. AFAIK, there is some African language family which uses the character ! for some very specific click or lip-clap sound, so ! actually may be a punctuation or a letter (and also a mathematical symbol), depending on a context or locale or user's intention. And I do not think we would ever be capable to fully solve really hard cases like that. So we will have to stop growing the complexity of specification and implementations somewhere. The question is just "where".

@jgm
Copy link
Member

jgm commented Apr 9, 2019 via email

@mity
Copy link

mity commented Apr 9, 2019

Well, yes. But the tab char is ASCII. The specification does not demand recognition of other white-space characters with code-points >= 128 for the line indentation, right?

@jgm
Copy link
Member

jgm commented Apr 10, 2019 via email

@jgm
Copy link
Member

jgm commented Apr 10, 2019

But you might have non-ASCII characters before the tab char.

I'll concede, though, that this only really matters for getting source position information. In places where you need to compute tab stops for purposes of block parsing (e.g. list items), only ASCII characters can occur before the tab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants