Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named HTML entities with multiple codepoints not parsed correctly #47

Closed
robinst opened this issue Jun 13, 2015 · 3 comments
Closed

Named HTML entities with multiple codepoints not parsed correctly #47

robinst opened this issue Jun 13, 2015 · 3 comments

Comments

@robinst
Copy link
Contributor

robinst commented Jun 13, 2015

See the following example: http://spec.commonmark.org/dingus/?text=%26ngE%3B%0A%0A%26gE%3B

≧̸ should be rendered as "≧̸" (U+02267 U+00338), but it's actually rendered as "≧" (which is the same as ≧)

It looks like other such named entities are also not handled correctly.

Would probably also be good to add such an entity to the spec so that implementations are checked for this.

@jgm
Copy link
Member

jgm commented Jun 13, 2015

I see that in cmark, src/html_unescape.gperf, the same
sequence is given for ngE and gE.

In commonmark.js, lib/html5-entities.js, gE is 8807 and
ngE is the same.

In commonmark.js, there is the further problem that the
entity decoding code seems to assume that each entity will
map on to a single code point.

I agree that a case like this should be in the spec, too.

+++ Robin Stocker [Jun 13 15 01:11 ]:

See the following example:
[1]http://spec.commonmark.org/dingus/?text=%26ngE%3B%0A%0A%26gE%3B

≧̸ should be rendered as "≧̸" (U+02267 U+00338), but it's actually
rendered as "≧" (which is the same as ≧)

It looks like other such named entities are also not handled correctly.

Would probably also be good to add such an entity to the spec so that
implementations are checked for this.


Reply to this email directly or [2]view it on GitHub.

References

  1. http://spec.commonmark.org/dingus/?text=≧̸

  1. Named HTML entities with multiple codepoints not parsed correctly #47

jgm added a commit that referenced this issue Jun 13, 2015
Removed html5-entities.js.
Added dependency on entities in package.json.

This fixes the problem reported in #47, but I want to
keep that issue open until cmark and the spec are also
fixed.
jgm added a commit to commonmark/cmark that referenced this issue Jun 13, 2015
The old one had many errors.
The new one is derived from the list in the npm entities package.
Since the sequences can now be longer (multi-code-point), we
have bumped the length limit from 4 to 8, which also affects
houdini_html_u.c.

An example of the kind of error that was fixed in given
in commonmark/commonmark.js#47: `≧̸` should be rendered as "≧̸" (U+02267
U+00338), but it's actually rendered as "≧" (which is the same as
`≧`).
jgm added a commit to commonmark/commonmark-spec that referenced this issue Jun 13, 2015
@jgm
Copy link
Member

jgm commented Jun 13, 2015

OK, everything should be fixed now! Thanks for pointing out the problem.

@jgm jgm closed this as completed Jun 13, 2015
@robinst
Copy link
Contributor Author

robinst commented Jun 15, 2015

Thanks for the quick reaction! :)

talum pushed a commit to github/cmark-gfm that referenced this issue Sep 14, 2021
The old one had many errors.
The new one is derived from the list in the npm entities package.
Since the sequences can now be longer (multi-code-point), we
have bumped the length limit from 4 to 8, which also affects
houdini_html_u.c.

An example of the kind of error that was fixed in given
in commonmark/commonmark.js#47: `≧̸` should be rendered as "≧̸" (U+02267
U+00338), but it's actually rendered as "≧" (which is the same as
`≧`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants