New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctify Unicode-related errors and omissions in "Lexical structure" #304
Comments
srutzky
referenced
this issue
in srutzky/csharplang
Jul 19, 2019
Large portions of the Unicode info here is either missing, incorrect, or misleading. The changes made this time, outlined below, deal mostly with Unicode escape sequences, but also touch on identifiers in general. Details on why these changes are being made, why the ANTLR grammar and text descriptions are being specified in the new way, example code where applicable, and links to Unicode.org as reference / support where applicable, are posted in the following issue: https://github.com/dotnet/csharplang/issues/2672 1. Broke `unicode_escape_sequence` token into two tokens: `unicode_bmp_escape_sequence` and `unicode_supplementary_escape_sequence`. This not only accurately reflects reality, but it also helps distinguish in the ANTLR grammar (not just in text comments) the set of Unicode characters that can be used in character literals and identifiers vs those that are invalid for those two cases. 2. Specify in ANTLR grammar the constrained usage of "\U" that creates BMP characters and is thus equivalent to "\u" where "\u" is valid to use. 3. Specify in ANTLR grammar how to create supplementary characters using both "\u" to specify an actual surrogate pair, and "\U" to specify a Unicode code point / UTF-32 code unit (they are synonymous). Along with the updates I recently made to the ["Escape Sequences" section of the C# Strings documentation](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/#string-escape-sequences) to show usable range and an example, this should prevent any more misinterpreting of "\U". 4. Introduce the concept of "Supplementary Characters". That term has never been used in this document, which I believe has lead some to mistakenly use the term "surrogate pairs" (which was mentioned here only once, though not incorrectly) to mean "supplementary characters". Surrogate Pairs is a UTF-16 encoding-specific construct. It does not refer to the set of characters, but instead to how they are physically encoded, but only in UTF-16. Neither UTF-8 nor UTF-32 have surrogate pairs, yet they both certainly support all supplementary characters. 5. Clean-up some inconsistent terminology and formatting: 1. "code point" vs "value" 2. U+1234 vs `U+1234` 3. U+1234 vs 0x1234 6. Provide an accurate and thorough description of BMP range/characters vs Supplementary range/characters, what constitutes a valid surrogate pair, that surrogate pairs are an encoding construct specific to UTF-16, and how all of that relates to `char` vs `string` literals and usage of "`\u`" and "`\U`". 7. Clarify that only BMP characters (direct or via `unicode_bmp_escape_sequence` ) can be used in identifiers. Previously, the ANTLR grammar did not prevent supplementary characters from being used, and there was not text description stating that they couldn't be used. However, they cannot be used (at least not in LINQPad 5 or VS 2015 -- related request: dotnet#1742 -- though "gmcs 4.6.2" via Mono does seem to allow them, possibly due to incomplete supplementary character definition here). 8. Added new token for "`hex_digit_except_zero`" to be used in the new "`unicode_supplementary_escape_sequence`" token, else the definition for "unicode_supplementary_escape_sequence" would have been rather messy / harder to read. 9. Adjusted definition of "`hex_digit`" to use new "`hex_digit_except_zero`" token. All previous uses of "`hex_digit`" are unaffected. 10. Some English grammatical improvements to "Note:" at beginning of **Character literals** section. 11. Updated ANTLR grammar for **Character literals** to specify only BMP characters can be used. Previously the grammar did not prevent supplementary characters from being used, and it was only noted in the text description. 12. Removed the "A Unicode character escape sequence ... must be in the range U+0000 to U+FFFF." sentence from the **Character literals** section as it is now doubly redundant, given the sentence about "code points above U+FFFF will error", _and_ the ANTLR grammar finally restriction the values to only the allowable range. --- For more info on this project, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)
partially related to dotnet/roslyn#13474 |
srutzky
changed the title
[WIP] Correctify Unicode-related errors and omissions in "Lexical structure"
Correctify Unicode-related errors and omissions in "Lexical structure"
Jul 21, 2019
moving to dotnet/csharpstandard Note that @gafter refers to the associated PR in #155 (I suggest we close that issue as a duplicate, as the discussion is primarily in this issue). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
Large portions of the Unicode info in the "Lexical structure" documentation are either missing, incorrect, or misleading. The document should reflect, in technically specific terms, that which is, and has always been, the behavior of C#. None of the changes here reflect any change(s) to the language; these changes merely bring the specification inline with the language it describes.
The main area of concern is with Unicode escape sequences, especially related to supplementary characters. And, it has been this way since the very beginning (almost 20 years now?). Earliest examples I could find are:
2001-12: ECMA-334: C# Language Specification, 1st edition (section 9.4.1; page 68 out of 498 of PDF / page 54 of document, line 29)
2002-12-02: C# Language Specification | 2.4.1 Unicode character escape sequences (via archive.org)
2002-04-04: C# Programmer's Reference | string (via archive.org)
Has a note stating:
which is good that is shows how to construct a supplementary character via a surrogate pair, but incorrectly stated as being an "eight-digit escape code", which it technically isn't, but does help to misinterpret what to do with the 8 hex digits in the incorrectly specified
\UHHHHHHHH
.It has not helped that there was no example of the "\U" escape sequence anywhere, and the only description was in the "String Escape Sequences" section of the C# Strings documentation which said, "for surrogate pairs.", which is not exactly correct, but definitely misleading (I recently corrected that page). That wording, coupled with the lack of any examples to guide interpretation, leads one to think that you specify the two surrogate code points as there are 8 hex digits available and that's what is required for two UTF-16 surrogate code points. This misunderstanding has not stayed within the confines of this document as it has shown up in:
\Ud869ded6
.Cha cha cha changes to this page
A singular definition for Unicode escape sequences is inadequate as the various uses of it require more nuance. Break
unicode_escape_sequence
token into two tokens:unicode_bmp_escape_sequence
andunicode_supplementary_escape_sequence
. This not only accurately reflects reality, but it also helps distinguish in the ANTLR grammar (not just in text comments) the set of Unicode characters that can be used in character literals and identifiers vs those that are invalid for those two cases (i.e. BMP vs Supplementary).Specify in ANTLR grammar the constrained usage of "\U" that creates BMP characters and is thus equivalent to "\u" where "\u" is valid to use (i.e.
\u
is equivalent to\U0000
).Specify in ANTLR grammar how to create supplementary characters using both "\u" to specify an actual surrogate pair, and "\U" to specify a Unicode code point / UTF-32 code unit (they are synonymous). Along with the updates I recently made to the "Escape Sequences" section of the C# Strings documentation to show usable range and an example, this should prevent any more misinterpreting of "\U".
Having the sequence start with
\U00
instead of just\U
might initially strike some as "odd" since all other sequences are a single character (after the back-slash), but the00
truly is a static portion of the sequence, given that the highest code point is U+10FFFF. Those first two digits can never, ever be any value other than0
, and the third digit can only ever be either0
or1
."Appendix C (of The Unicode® Standard, Version 12.0): Relationship to ISO/IEC 10646" (Page 11 of PDF, 942 in the document; emphasis mine)
That wording has been in place since Unicode version 6 came out in 2011. Previously, since at least version 4 came out in 2003 (not in version 3.0, and can't seem to find 3.2 anywhere), section C.2 has stated:
Please note that
\x
can be used interchangeably with\u
in character and string literals (as shown directly below). However, I did not incorporate this into the updates due to\x
not being interchangeable with\u
in identifiers (\x
doesn't even work in identifiers when specifying\x41
). Thus, it would have added quite a bit of complication, and for no real gain since\x
doesn't provide any benefit over\u
, and would likely just increase confusion. I am only noting this behavior here to make clear that it was a conscious decision to omit it so that hopefully it does not get added in the future with the thought that it was merely an oversight.Introduce the concept of "Supplementary Characters". That term has never been used in this document, which I believe has lead some to mistakenly use the term "surrogate pairs" (which was mentioned here only once, though not incorrectly) to mean "supplementary characters". Surrogate Pairs is a UTF-16 encoding-specific construct. It does not refer to the set of characters, but instead to how they are physically encoded, but only in UTF-16. Neither UTF-8 nor UTF-32 have surrogate pairs, yet they both certainly support all supplementary characters.
The following was taken from the "Basic Questions" FAQ on Unicode.org:
Clean-up some inconsistent terminology and formatting:
U+1234
Provide an accurate and thorough description of BMP range/characters vs Supplementary range/characters, what constitutes a valid surrogate pair, that surrogate pairs are an encoding construct specific to UTF-16, and how all of that relates to
char
vsstring
literals and usage of "\u
" and "\U
".Clarify that only BMP characters (direct or via
unicode_bmp_escape_sequence
) can be used in identifiers. Previously, the ANTLR grammar did not prevent supplementary characters from being used, and there was no text description stating that they couldn't be used. However, they cannot be used (at least not in LINQPad 5 or VS 2015 -- related request: Champion: "Permit surrogate pairs and wide Unicode-escaped code points in identifiers" csharplang#1742 -- though "gmcs 4.6.2" (Mono project) does seem to allow them, possibly due to incomplete supplementary character definition here).The following example code proves two things about identifiers:
\u
and\U
(or\U00
) escape sequences work\u
and\U
are completely interchangeable (as highlighted by the second set of statements: variable defined using\U000016EF
yet referenced using\u16EF
)The following works when compiled using "gmcs" (Mono project), but gives me an error in LINQPad 5 and VS 2015 (I know, not the latest, but still):
Run the example above on IDEone.com
For the record, while it was never claimed that a hexadecimal escape would work in an identifier, the following two statements each result in a compile error, proving that those escapes indeed do not work in this context:
Add new token for "
hex_digit_except_zero
" to be used in the new "unicode_supplementary_escape_sequence
" token, else the definition for "unicode_supplementary_escape_sequence" would have been rather messy / harder to read.Adjust definition of "
hex_digit
" to use new "hex_digit_except_zero
" token. All previous uses of "hex_digit
" will be unaffected.Some English grammatical improvements to "Note:" at beginning of Character literals section.
Update ANTLR grammar for Character literals to specify only BMP characters can be used. The current grammar does not prevent supplementary characters from being used (even if the underlying code does), and it's only noted in the text description.
The following was already known, but documenting for completeness:
Consolidate statements regarding acceptable range in the Character literals section. Remove the "A Unicode character escape sequence ... must be in the range U+0000 to U+FFFF." sentence as it will be doubly redundant, given: the new sentence about "code points above U+FFFF will error", and b) the ANTLR grammar will finally restrict the values to only the allowable range.
For clarity and to reduce any potential for confusion, state that supplementary characters are allowed in string literals.
Next Step
The changes noted above, and implemented in the associated PR, need to find their way into the specification published by ECMA, as that contains the same errors and omissions.
Done?
No, not done. There are still some areas that need correction, but they are out of scope for now, and can be handled separately (and might require additional discussion).
For more info on this project, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)
The text was updated successfully, but these errors were encountered: