-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify order of entity/numeric character references vs. delimiter runs #572
base: master
Are you sure you want to change the base?
Clarify order of entity/numeric character references vs. delimiter runs #572
Conversation
But they aren't, at least in the reference implementations.
And I don't think they should be, though that is surely debatable. |
Yeah, sorry, I should have been more verbose on this one. I was suggesting what the spec should be, regardless what some implementation does. And yes, I want to debate, please! AFAIU, this case was previously undefined by the spec, and different implementations have different behavior. http://johnmacfarlane.net/babelmark2/?text=i*%26%23105%3Bi*i
I think my suggestion makes more sense. If an author uses a numeric character reference, they may have whatever reasons for that, but in the end they want it to be displayed as a certain letter or symbol or number or whatever the numeric reference represents. If it represents a "normal" letter, an author would expect it to behave like a letter and not like a punctuation character. Therefore, the references must be resolved before deciding whether a delimiter run can open and/or close emphasis. I don't really see any argument for your suggestion other than it might be easier to implement in some given implementation. The situation gets more complicated when the numeric character reference represents a "Unicode whitespace". Then, the decision whether a "delimiter run" is "left-flanking" or "right-flanking" should also be made after the references are resolved. An example would be http://johnmacfarlane.net/babelmark2/?text=i*%26%2332%3Bi*i
It looks like |
pandoc isn't implementing commonmark yet for its default markdown parser, so its behavior isn't relevant here. I agree, though, that this is an issue that should be clarified, one way or the other, in the spec. I also agree that from a user's point of view, it might be surprising if whether you use Here are three things to be said in favor of the present behavior of the reference implementation:
I'd be interested in getting feedback from others about this issue. @kivikakk @mity |
The 1st thing I'd like to highlight is that IMHO using the entities should never ever be usable as an alternative Markdown syntax construction. I.e. e.g. All Markdown syntax marks are using ASCII and should be easy to write without entities. Using entities automatically means it is a literal character, not Markdown syntax construction. Now, for the flankingness:
Actually I see that as a poor/counter-intuitive tool for that. If we need such a tool, there should be some clearer syntax for explicit enforcing left and/or right flanking.
The performance impact would depend on the implementation. I believe that for MD4C it would be negligible, because entities are resolved before the emphasis, and I can easily ask in If the implementation would need to scan the input text, it might be a problem because when you would scan backward (for preceding entity), it is hard to say whether the initial But I see the difficulty of this issue elsewhere. For MD4C the issue is "what kind of entity and what it means for the flanking", because MD4C parser leaves the translation of entities to the renderer. For parser anything like It leaves dealing with it to the renderer, because it knows more about desired properties of the output (e.g. its encoding). Consider e.g. ASCII-only renderer which transforms Now, if we want the parser to use them differently for flanking analysis, then it also needs to know list of valid (named) entities and their properties; to distinguish a white-space (e.g. It could no longer be just a rendering side business to know the entity and its properties. And it would likely be horrible from design PoV if both the parser and the renderer have to maintain their own lists of entity names, even if for different purposes... |
Yes, this is now explicit in the spec. (It wasn't before 0.29.) However, the wording only covers actual delimiters, so it leaves this issue unresolved.
That's a very good point. |
This seems like a bad use case; if there's a design goal to let users influence flankingness, this should not be the solution, per @mity's comment. However:
This accords well with the design goal of CommonMark, per §1.1 (Gruber quote):
In that sense, the example given in point one I'm inclined to suggest entities should be treated as opaque for the purposes of determining flankingness, especially because we refuse to interpret them as Markdown structural elements. It seems that we should either do both, or neither. |
I am not sure whether I fully agree with this. The flanking rules are not that much an answer for "is or isn't there a Markdown structural element before or after this delimiter run?" The purpose is more "Is this delimiter run at the beginning of a word, the end of a word or inside of a word?" As such it is somewhat exceptional on its own, because word boundary has no explicit Markdown structural syntax. Furthermore, there are actually only three things where the implementation has to really understand the Unicode (as distilled from the specification):
No other Markdown structural element has to take Unicode into account. And this discussion is about situations where two of these exceptional cases (points 2 and 3) meet each other. So these situations are already very exceptional from POV of the specification and nothing we do can make it more exceptional then it already is. EDIT: To clarify/summarize what I wanted to say: There is no precedens and no analogy in the specification to follow. |
All implementations covered by Babelmark 3 agree on treating entity references similar to backslash escapes: > no quote
\> no quote
.
<p>> no quote</p>
<p>> no quote</p> They widely disagree about flanking behavior, even implementations claiming CM conformance yield different results. (Okay, itʼs only Markdig.) _ perhaps emphasis_
_\ perhaps emphasis_
_ no emphasis_
.
<p><em>perhaps emphasis</em></p>
<p><em>\ perhaps emphasis</em></p>
<p>_ no emphasis_</p> |
I spent an hour or two thinking about this issue. And to be honest, I am still very indecisive in this case. I see good arguments for both ways. Maybe big part why I am indecisive is the fact I do not have any idea whether it really is a real issue or just more a theoretical problem searching. I am Czech. In my language we have all those letters with funny diacritic marks like "ěščřžýáíů". We also quite commonly meet other Latin letters with different diacritic e.g. in personal or geographic names (e.g. Slovak, German, Polish). That makes me in general a fan of I18N support in SW. But the truth is I never ever used HTML entity to express any of those letters. It feels very unnatural to me to use escape sequences for the letters (and I believe to people around me as well). Even before wide-spreading of Unicode we used ISO_8869-2 or Windows codepage 1250 (and before them some other strange encodings like Kamenicky) for it, depending on a context, rather then some strange entities. I did often used entities in those cases outside of the set of "normal" letters like e.g. But the world is a big place and maybe people talking (from my POV) an exotic language can see it very differently. So from practical point of view, I think we should go the harder way if there is a language or culture or group of people who generally use the entities for letters which by no means form word boundary. And that's in my eyes the main question. So, is there? BTW, even then there might be situation when it won't fully work. AFAIK, there is some African language family which uses the character |
Martin Mitáš <notifications@github.com> writes:
Furthermore, there are actually only three things where the implementation has to really understand the Unicode (as distilled from the specification):
Actually there's also
4. Proper tab resolution -- here you can't just operate on bytes, you have to count characters.
|
Well, yes. But the tab char is ASCII. The specification does not demand recognition of other white-space characters with code-points >= 128 for the line indentation, right? |
Martin Mitáš <notifications@github.com> writes:
Well, yes. But the tab char is ASCII. The specification does not demand recognition of other white-space characters with code-points >= 128 for the line indentation, right?
The tab char is indeed ASCII. But you might have
non-ASCII characters before the tab char.
If all you know is that you have
BYTE BYTE BYTE BYTE BYTE TAB
then you don't know whether the TAB counts for 4
spaces, or 3, or 2, or 1. The five bytes could
be five characters, or three, or two...
|
I'll concede, though, that this only really matters for getting source position information. In places where you need to compute tab stops for purposes of block parsing (e.g. list items), only ASCII characters can occur before the tab. |
See #474 and https://talk.commonmark.org/t/when-exactly-should-numeric-character-references-be-replaced/2121