Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
As noted in this thread, it might be desirable to change what the spec says about entities.
Arguably the spec should not require that entities be replaced (in the parsing phase) by unicode characters. A replacement will be necessary for some output formats, but there is no reason why an implementation that only targets HTML should do the replacement at all, and even an implementation that targets multiple formats might choose to handle entities in the renderer, or in an intermediate AST filter. And some implementations might want to preserve entities in the output.
Currently the spec requires replacement for entities in a certain list. It would also simplify things not to have such a list.
Some experiments along these lines in the entities branch of jgm/cmark.
Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).
Looks good so far!
Hmm. Isn't a "link title" in CommonMark just a fancy way to write the attribute value literal that ends up in the, well,
And given that CommonMark does not even look at, let alone does any conversion or replacement for attribute value literals in "HTML tags" anyway (and nor does Markdown in general, IIRC): leaving the "link title" string alone (maybe apart from checking for literal
Following are some thoughts of mine on this matter.
Regarding syntactically recognizing entity and character references, the spec should spell out that references of the "usual" form are recognized.
The following is basically copied from the XML syntax, except for the
Note that XML requires the terminating
The character class Digit simply comprises the ten decimal digits, while the name start character and name character classes differ among versions of HTML, XML, etc.
Using the XML definition restricted to ISO 646 (which is what CommonMark currently, implicitly, but incompletely does—eg, it disallows
Here Letter would just be the basic 52 upper and lower case letters of the ISO 646 repertoire.
In my opinion, a good argument could be made for allowing to omit the terminating
is equivalent to
Or because it allows "joining lines" (exploiting the "lazy continuation line" rule, of course):
is equivalent to
If one defines an entity
is equivalent (after replacement, using
This is what ISO 8879 SGML has always supported (even in "Minimal SGML Documents"), and I tend to find it useful. But it might be too much for authors accustomed to HTML/XML rules …
The insane decision in the HTML5 "syntax" to allow omitting
I agree that the spec should not require (but indeed allow) replacing entity references with (which? whatever?) replacement texts.
And, as I have argued, it seems wise to also forbid replacing numeric character references (at least for the ISO 646 repertoire), to preserve the distinction between eg,
As far as the spec talks about the parsing result in terms of an AST (or—equivalently?—its representation as a CommonMark-DTD-valid XML document instance), some "entity reference" node type would suffice for unreplaced entity references, similar to your
However, it might be useful to include an optional character number just in case that "resolution" of character entity references (in the parser) is desired. The pre-defined XML entities
I find placing the entity name in a
If the parser would (be allowed to) replace entity references with something other than a Unicode character—that is, really handle general entities, not just character entities—, then the replacement text would directly be inserted (without delimiters or its own node) into the regular character data content, that is: into the
And similarly for character references (lumping numeric and hex together, for this distinction is IMO negligible):
One could possibly unite the
Let me to remind there are more such contexts:
+++ Martin Mitáš [Dec 04 16 13:52 ]:
Let me to remind there are more such contexts: * Link title (included for the sake of completeness here) * Link destination (see Example 308) * Image ALT string (usually rendered differently from links; also note the difference in handling of nested versus non-nested image) * Info string in code fence line (see Example 309)
Actually not the Image ALT string (or as we call it the link description), since this is represented in cmark as a list of inlines, and we can just use ENTITY nodes there. The problem really only arises for the other three contexts, where we just have a raw string.