New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entities #442

Open
jgm opened this Issue Dec 2, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@jgm
Member

jgm commented Dec 2, 2016

As noted in this thread, it might be desirable to change what the spec says about entities.

Arguably the spec should not require that entities be replaced (in the parsing phase) by unicode characters. A replacement will be necessary for some output formats, but there is no reason why an implementation that only targets HTML should do the replacement at all, and even an implementation that targets multiple formats might choose to handle entities in the renderer, or in an intermediate AST filter. And some implementations might want to preserve entities in the output.

Currently the spec requires replacement for entities in a certain list. It would also simplify things not to have such a list.

@jgm

This comment has been minimized.

Show comment
Hide comment
@jgm

jgm Dec 2, 2016

Member

Some experiments along these lines in the entities branch of jgm/cmark.
This creates a CMARK_NODE_ENTITY node type and does conversions in the man and latex renderers only.

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Member

jgm commented Dec 2, 2016

Some experiments along these lines in the entities branch of jgm/cmark.
This creates a CMARK_NODE_ENTITY node type and does conversions in the man and latex renderers only.

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

@jgm jgm closed this Dec 2, 2016

@jgm jgm reopened this Dec 2, 2016

@tin-pot

This comment has been minimized.

Show comment
Hide comment
@tin-pot

tin-pot Dec 3, 2016

Looks good so far!

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Hmm. Isn't a "link title" in CommonMark just a fancy way to write the attribute value literal that ends up in the, well, title attribute?

And given that CommonMark does not even look at, let alone does any conversion or replacement for attribute value literals in "HTML tags" anyway (and nor does Markdown in general, IIRC): leaving the "link title" string alone (maybe apart from checking for literal < and & to guard against XML's touchiness) would seem to simply be consistent and justified behaviour in my view.

Following are some thoughts of mine on this matter.


Lexis

Regarding syntactically recognizing entity and character references, the spec should spell out that references of the "usual" form are recognized.

The following is basically copied from the XML syntax, except for the "&x" vs "&X" alternative in hex character reference.

Note that XML requires the terminating ";" character—and omits the (actual, not what HTML5 "terminology" says it is!) named character reference like &#SPACE; of SGML, which never took off outside SGML.

reference = entity reference
          | character reference ;

entity reference = "&" , name , ";" ;

character reference = numeric character reference
                    | hex character reference ;

numeric character reference = "&#" , number , ";" ;

hex character reference = ( "&x" | "&X" ) , hex number , ";" ;

hex number = hex digit , { hex digit } ;

hex digit = Digit | "A".."F" | "a".."f" ;

number = Digit , { Digit } ;

name = name start character , { name character } ;

The character class Digit simply comprises the ten decimal digits, while the name start character and name character classes differ among versions of HTML, XML, etc.

Using the XML definition restricted to ISO 646 (which is what CommonMark currently, implicitly, but incompletely does—eg, it disallows . in tag name) is probably good enough:

name start character = Letter | ":" | "_" ;

name character = name start character 
               | "-" | "." | Digit ;

Here Letter would just be the basic 52 upper and lower case letters of the ISO 646 repertoire.


In my opinion, a good argument could be made for allowing to omit the terminating ";" in certain cases, because it is either convenient, for example

you could&nbsp&ndash always&nbsp&ndash write like this!

is equivalent to

you could&nbsp;&ndash; always&nbsp;&ndash; write like this!

Or because it allows "joining lines" (exploiting the "lazy continuation line" rule, of course):

you could write about hyphen&shy
ation like this.

is equivalent to

you could write about hyphen&shy;ation like this.

If one defines an entity null with an empty replacement text, this provides an actual "line joining" feature:

you could just join two line to&null
gether like this

is equivalent (after replacement, using <!ENTITY null "">) to

you could just join two line together like this

This is what ISO 8879 SGML has always supported (even in "Minimal SGML Documents"), and I tend to find it useful. But it might be too much for authors accustomed to HTML/XML rules …

The insane decision in the HTML5 "syntax" to allow omitting ";" after some random set of entity names (presumably for compatibility reasons with some existing browsers?) is of course not something one should adopt. I wasn't even aware of that until now! But don't get me started about the HTML5 "syntax" anyway! ;-)


Processing

I agree that the spec should not require (but indeed allow) replacing entity references with (which? whatever?) replacement texts.

And, as I have argued, it seems wise to also forbid replacing numeric character references (at least for the ISO 646 repertoire), to preserve the distinction between eg, | (a literal U+007C VERTICAL LINE) and &#124;. This might be essential for further processing in a tool pipeline.

As far as the spec talks about the parsing result in terms of an AST (or—equivalently?—its representation as a CommonMark-DTD-valid XML document instance), some "entity reference" node type would suffice for unreplaced entity references, similar to your CMARK_NODE_ENTITY node type.

However, it might be useful to include an optional character number just in case that "resolution" of character entity references (in the parser) is desired. The pre-defined XML entities lt, gt, amp, quot, and apos would be obvious candidates for this. In DTD parlance, this node could look like

<!ELEMENT EntityRef EMPTY>
<!-- `name` (a NAME) is the entity name,
     `charnum` (a NUMBER) is the optional UCS code point if this 
     was recognized as a character entity reference -->
<!ATTLIST EntityRef
          name      NMTOKEN  #REQUIRED
          charnum   NMTOKEN  #IMPLIED>

I find placing the entity name in a NAME-typed attribute, alongside the optional code point in a NUMBER-typed one, more natural in XML, which does however only knows NMTOKEN. But of course this doesn't constrain the structure of the CMARK_NODE_ENTITY node.

If the parser would (be allowed to) replace entity references with something other than a Unicode character—that is, really handle general entities, not just character entities—, then the replacement text would directly be inserted (without delimiters or its own node) into the regular character data content, that is: into the CMARK_NODE_TEXT content rsp. the content of the <text> element in the XML representation. (This is consistent with ESIS and XML Infoset rules for "replaced" entities.)

And similarly for character references (lumping numeric and hex together, for this distinction is IMO negligible):

<!ELEMENT CharRef EMPTY>
<!-- `charnum` (a NUMBER) is the decimal UCS code point, whether given in the
     source document as a decimal or hexadecimal numeral -->
<!ATTLIST CharRef
          charnum   NMTOKEN  #REQUIRED>

One could possibly unite the CharRef and EntityRef element/node types into one type, but I'm not sure if I'd like that better. (That's basically what I do in an experimental and hacked-up clone of libsoldout where, in the commonmark branch, I live out my obsession with SGML shorthand syntax …)

tin-pot commented Dec 3, 2016

Looks good so far!

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Hmm. Isn't a "link title" in CommonMark just a fancy way to write the attribute value literal that ends up in the, well, title attribute?

And given that CommonMark does not even look at, let alone does any conversion or replacement for attribute value literals in "HTML tags" anyway (and nor does Markdown in general, IIRC): leaving the "link title" string alone (maybe apart from checking for literal < and & to guard against XML's touchiness) would seem to simply be consistent and justified behaviour in my view.

Following are some thoughts of mine on this matter.


Lexis

Regarding syntactically recognizing entity and character references, the spec should spell out that references of the "usual" form are recognized.

The following is basically copied from the XML syntax, except for the "&x" vs "&X" alternative in hex character reference.

Note that XML requires the terminating ";" character—and omits the (actual, not what HTML5 "terminology" says it is!) named character reference like &#SPACE; of SGML, which never took off outside SGML.

reference = entity reference
          | character reference ;

entity reference = "&" , name , ";" ;

character reference = numeric character reference
                    | hex character reference ;

numeric character reference = "&#" , number , ";" ;

hex character reference = ( "&x" | "&X" ) , hex number , ";" ;

hex number = hex digit , { hex digit } ;

hex digit = Digit | "A".."F" | "a".."f" ;

number = Digit , { Digit } ;

name = name start character , { name character } ;

The character class Digit simply comprises the ten decimal digits, while the name start character and name character classes differ among versions of HTML, XML, etc.

Using the XML definition restricted to ISO 646 (which is what CommonMark currently, implicitly, but incompletely does—eg, it disallows . in tag name) is probably good enough:

name start character = Letter | ":" | "_" ;

name character = name start character 
               | "-" | "." | Digit ;

Here Letter would just be the basic 52 upper and lower case letters of the ISO 646 repertoire.


In my opinion, a good argument could be made for allowing to omit the terminating ";" in certain cases, because it is either convenient, for example

you could&nbsp&ndash always&nbsp&ndash write like this!

is equivalent to

you could&nbsp;&ndash; always&nbsp;&ndash; write like this!

Or because it allows "joining lines" (exploiting the "lazy continuation line" rule, of course):

you could write about hyphen&shy
ation like this.

is equivalent to

you could write about hyphen&shy;ation like this.

If one defines an entity null with an empty replacement text, this provides an actual "line joining" feature:

you could just join two line to&null
gether like this

is equivalent (after replacement, using <!ENTITY null "">) to

you could just join two line together like this

This is what ISO 8879 SGML has always supported (even in "Minimal SGML Documents"), and I tend to find it useful. But it might be too much for authors accustomed to HTML/XML rules …

The insane decision in the HTML5 "syntax" to allow omitting ";" after some random set of entity names (presumably for compatibility reasons with some existing browsers?) is of course not something one should adopt. I wasn't even aware of that until now! But don't get me started about the HTML5 "syntax" anyway! ;-)


Processing

I agree that the spec should not require (but indeed allow) replacing entity references with (which? whatever?) replacement texts.

And, as I have argued, it seems wise to also forbid replacing numeric character references (at least for the ISO 646 repertoire), to preserve the distinction between eg, | (a literal U+007C VERTICAL LINE) and &#124;. This might be essential for further processing in a tool pipeline.

As far as the spec talks about the parsing result in terms of an AST (or—equivalently?—its representation as a CommonMark-DTD-valid XML document instance), some "entity reference" node type would suffice for unreplaced entity references, similar to your CMARK_NODE_ENTITY node type.

However, it might be useful to include an optional character number just in case that "resolution" of character entity references (in the parser) is desired. The pre-defined XML entities lt, gt, amp, quot, and apos would be obvious candidates for this. In DTD parlance, this node could look like

<!ELEMENT EntityRef EMPTY>
<!-- `name` (a NAME) is the entity name,
     `charnum` (a NUMBER) is the optional UCS code point if this 
     was recognized as a character entity reference -->
<!ATTLIST EntityRef
          name      NMTOKEN  #REQUIRED
          charnum   NMTOKEN  #IMPLIED>

I find placing the entity name in a NAME-typed attribute, alongside the optional code point in a NUMBER-typed one, more natural in XML, which does however only knows NMTOKEN. But of course this doesn't constrain the structure of the CMARK_NODE_ENTITY node.

If the parser would (be allowed to) replace entity references with something other than a Unicode character—that is, really handle general entities, not just character entities—, then the replacement text would directly be inserted (without delimiters or its own node) into the regular character data content, that is: into the CMARK_NODE_TEXT content rsp. the content of the <text> element in the XML representation. (This is consistent with ESIS and XML Infoset rules for "replaced" entities.)

And similarly for character references (lumping numeric and hex together, for this distinction is IMO negligible):

<!ELEMENT CharRef EMPTY>
<!-- `charnum` (a NUMBER) is the decimal UCS code point, whether given in the
     source document as a decimal or hexadecimal numeral -->
<!ATTLIST CharRef
          charnum   NMTOKEN  #REQUIRED>

One could possibly unite the CharRef and EntityRef element/node types into one type, but I'm not sure if I'd like that better. (That's basically what I do in an experimental and hacked-up clone of libsoldout where, in the commonmark branch, I live out my obsession with SGML shorthand syntax …)

@mity

This comment has been minimized.

Show comment
Hide comment
@mity

mity Dec 4, 2016

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Let me to remind there are more such contexts:

  • Link title (included for the sake of completeness here)
  • Link destination (see Example 308)
  • Image ALT string (usually rendered differently from links; also note the difference in handling of nested versus non-nested image)
  • Info string in code fence line (see Example 309)

mity commented Dec 4, 2016

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Let me to remind there are more such contexts:

  • Link title (included for the sake of completeness here)
  • Link destination (see Example 308)
  • Image ALT string (usually rendered differently from links; also note the difference in handling of nested versus non-nested image)
  • Info string in code fence line (see Example 309)
@jgm

This comment has been minimized.

Show comment
Hide comment
@jgm

jgm Dec 5, 2016

Member
Member

jgm commented Dec 5, 2016

jgm added a commit that referenced this issue Mar 16, 2017

Changes to entities section.
We no longer use the HTML5 entity list.  Instead, we recognize
any potential character entity of length 1-32 letters.

Entities are carried through unchanged to HTML rather than being
converted to UTF-8.

Entities in URLs are also left unchanged rather than being URL-encoded.

See
https://talk.commonmark.org/t/spec-issues-character-entity-references/2306
and
#442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment