Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commonmark renderer hangs on certain unicode inputs #27

Closed
jgm opened this issue Apr 15, 2015 · 2 comments
Closed

commonmark renderer hangs on certain unicode inputs #27

jgm opened this issue Apr 15, 2015 · 2 comments

Comments

@jgm
Copy link
Member

jgm commented Apr 15, 2015

Found with some help from afl. This is a renderer bug, not a parser bug, as it only affects commonmark (and sometimes man) output.

python3 -c 'print("\uffff")' | cmark -t commonmark
@jgm jgm closed this as completed in 520d5a6 Apr 15, 2015
@nwellnhof
Copy link
Contributor

The problem is that utf8proc_iterate is stricter than utf8proc_valid. The former also checks for non-characters (U+FDD0..U+FDEF and the last two code points of every plane). After working with Unicode extensively over the last years, my opinion is that libraries like libcmark shouldn't treat non-characters in a special way. Also see the following questions from the Unicode FAQ:

Q: Are noncharacters prohibited in interchange?

A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.

Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

(Emphasis mine.)

If it's OK, I'll submit a pull request that removes special handling of non-characters in utf8_iterate which will result in simply passing them through. This should make utf8proc_valid and utf8proc_iterate agree on the validity of code points, solving the root cause of this issue. Checking the return value of utf8proc_iterate is a good idea nevertheless.

@jgm
Copy link
Member Author

jgm commented Apr 16, 2015

+++ Nick Wellnhofer [Apr 16 15 05:13 ]:

If it's OK, I'll submit a pull request that removes special handling of
non-characters in utf8_iterate which will result in simply passing them
through. This should make utf8proc_valid and utf8proc_iterate agree on
the validity of code points, solving the root cause of this issue.
Checking the return value of utf8proc_iterate is a good idea
nevertheless.

Yes, I like this idea! Please go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants