Correctify Unicode-related errors and omissions in "Lexical structure" #304

srutzky · 2019-07-19T22:38:05Z

Background

Large portions of the Unicode info in the "Lexical structure" documentation are either missing, incorrect, or misleading. The document should reflect, in technically specific terms, that which is, and has always been, the behavior of C#. None of the changes here reflect any change(s) to the language; these changes merely bring the specification inline with the language it describes.

The main area of concern is with Unicode escape sequences, especially related to supplementary characters. And, it has been this way since the very beginning (almost 20 years now?). Earliest examples I could find are:

2001-12: ECMA-334: C# Language Specification, 1st edition (section 9.4.1; page 68 out of 498 of PDF / page 54 of document, line 29)
2002-12-02: C# Language Specification | 2.4.1 Unicode character escape sequences (via archive.org)
2002-04-04: C# Programmer's Reference | string (via archive.org)
Has a note stating:

Eight-digit Unicode escape codes are also recognized: \udddd\udddd.

which is good that is shows how to construct a supplementary character via a surrogate pair, but incorrectly stated as being an "eight-digit escape code", which it technically isn't, but does help to misinterpret what to do with the 8 hex digits in the incorrectly specified \UHHHHHHHH.

It has not helped that there was no example of the "\U" escape sequence anywhere, and the only description was in the "String Escape Sequences" section of the C# Strings documentation which said, "for surrogate pairs.", which is not exactly correct, but definitely misleading (I recently corrected that page). That wording, coupled with the lack of any examples to guide interpretation, leads one to think that you specify the two surrogate code points as there are 8 hex digits available and that's what is required for two UTF-16 surrogate code points. This misunderstanding has not stayed within the confines of this document as it has shown up in:

The C# "Strings" documentation (previously mentioned), which I have recently fixed.
F# documentation, which I have recently fixed
F# source code comments, which I have recently fixed.
An article on CodeProject.com describing various ways to escape things in C#, both referring to supplementary characters as "surrogate pairs", and even giving an example of \Ud869ded6.

Cha cha cha changes to this page

A singular definition for Unicode escape sequences is inadequate as the various uses of it require more nuance. Break unicode_escape_sequence token into two tokens: unicode_bmp_escape_sequence and unicode_supplementary_escape_sequence. This not only accurately reflects reality, but it also helps distinguish in the ANTLR grammar (not just in text comments) the set of Unicode characters that can be used in character literals and identifiers vs those that are invalid for those two cases (i.e. BMP vs Supplementary).
Specify in ANTLR grammar the constrained usage of "\U" that creates BMP characters and is thus equivalent to "\u" where "\u" is valid to use (i.e. \u is equivalent to \U0000 ).
```
// "unicode_bmp_escape_sequence"
// Both lines return: Ᏹ
Console.WriteLine('\u13F1'); 
Console.WriteLine('\U000013F1'); 
```
Specify in ANTLR grammar how to create supplementary characters using both "\u" to specify an actual surrogate pair, and "\U" to specify a Unicode code point / UTF-32 code unit (they are synonymous). Along with the updates I recently made to the "Escape Sequences" section of the C# Strings documentation to show usable range and an example, this should prevent any more misinterpreting of "\U".
```
// "unicode_supplementary_escape_sequence"
// Both lines return: 👽
Console.WriteLine("\uD83D\uDC7D");
Console.WriteLine("\U0001F47D");
```
Having the sequence start with \U00 instead of just \U might initially strike some as "odd" since all other sequences are a single character (after the back-slash), but the 00 truly is a static portion of the sequence, given that the highest code point is U+10FFFF. Those first two digits can never, ever be any value other than 0, and the third digit can only ever be either 0 or 1.

"Appendix C (of The Unicode® Standard, Version 12.0): Relationship to ISO/IEC 10646" (Page 11 of PDF, 942 in the document; emphasis mine)

C.2 Encoding Forms in ISO/IEC 10646
ISO/IEC 10646:2011 has significantly revised its discussion of encoding forms, compared to earlier editions of that standard. The terminology for encoding forms (and encoding schemes) in 10646 now matches exactly the terminology used in the Unicode Standard. Furthermore, 10646 is now described in terms of a codespace U+0000..U+10FFFF, instead of a 31-bit codespace, as in earlier editions. This convergence in codespace description has eliminated any discrepancies in possible interpretation of the numeric values greater than 0x10FFFF.

That wording has been in place since Unicode version 6 came out in 2011. Previously, since at least version 4 came out in 2003 (not in version 3.0, and can't seem to find 3.2 anywhere), section C.2 has stated:

The Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters to 10646 will be constrained to the BMP or the first 14 supplementary
planes. ... It also guarantees interoperability with implementations of the Unicode Standard,
for which only code positions 0..10FFFF16 are meaningful.

Please note that \x can be used interchangeably with \u in character and string literals (as shown directly below). However, I did not incorporate this into the updates due to \x not being interchangeable with \u in identifiers (\x doesn't even work in identifiers when specifying \x41). Thus, it would have added quite a bit of complication, and for no real gain since \x doesn't provide any benefit over \u, and would likely just increase confusion. I am only noting this behavior here to make clear that it was a conscious decision to omit it so that hopefully it does not get added in the future with the thought that it was merely an oversight.
```
// All three lines return: 👽
Console.WriteLine("\xD83D\xDC7D");
Console.WriteLine("\uD83D\xDC7D");
Console.WriteLine("\xD83D\uDC7D");
```
Introduce the concept of "Supplementary Characters". That term has never been used in this document, which I believe has lead some to mistakenly use the term "surrogate pairs" (which was mentioned here only once, though not incorrectly) to mean "supplementary characters". Surrogate Pairs is a UTF-16 encoding-specific construct. It does not refer to the set of characters, but instead to how they are physically encoded, but only in UTF-16. Neither UTF-8 nor UTF-32 have surrogate pairs, yet they both certainly support all supplementary characters.

The following was taken from the "Basic Questions" FAQ on Unicode.org:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).
Clean-up some inconsistent terminology and formatting:
1. "code point" vs "value"
2. U+1234 vs U+1234
3. U+1234 vs 0x1234
Provide an accurate and thorough description of BMP range/characters vs Supplementary range/characters, what constitutes a valid surrogate pair, that surrogate pairs are an encoding construct specific to UTF-16, and how all of that relates to char vs string literals and usage of "\u" and "\U".
Clarify that only BMP characters (direct or via unicode_bmp_escape_sequence ) can be used in identifiers. Previously, the ANTLR grammar did not prevent supplementary characters from being used, and there was no text description stating that they couldn't be used. However, they cannot be used (at least not in LINQPad 5 or VS 2015 -- related request: Champion: "Permit surrogate pairs and wide Unicode-escaped code points in identifiers" csharplang#1742 -- though "gmcs 4.6.2" (Mono project) does seem to allow them, possibly due to incomplete supplementary character definition here).

The following example code proves two things about identifiers:
1. both \u and \U (or \U00) escape sequences work
2. \u and \U are completely interchangeable (as highlighted by the second set of statements: variable defined using \U000016EF yet referenced using \u16EF)
```
string \u16EE = "U+16EE == \u16EE";
Console.WriteLine(\u16EE);

string \U000016EF = "U+16EF == \x16EF";
Console.WriteLine(\u16EF);
```
The following works when compiled using "gmcs" (Mono project), but gives me an error in LINQPad 5 and VS 2015 (I know, not the latest, but still):
```
using System;

public class Test
{
    public static void Main()
    {
        string A\U0001D4A2 = "g";
        Console.WriteLine(A\U0001D4A2);
    }
}
```
Run the example above on IDEone.com

For the record, while it was never claimed that a hexadecimal escape would work in an identifier, the following two statements each result in a compile error, proving that those escapes indeed do not work in this context:
```
string \x41 = "does not work";
string \x0041 = "does not work";
```
Add new token for "hex_digit_except_zero" to be used in the new "unicode_supplementary_escape_sequence" token, else the definition for "unicode_supplementary_escape_sequence" would have been rather messy / harder to read.
Adjust definition of "hex_digit" to use new "hex_digit_except_zero" token. All previous uses of "hex_digit" will be unaffected.
Some English grammatical improvements to "Note:" at beginning of Character literals section.
Update ANTLR grammar for Character literals to specify only BMP characters can be used. The current grammar does not prevent supplementary characters from being used (even if the underlying code does), and it's only noted in the text description.

The following was already known, but documenting for completeness:
```
// Extraterrestrial Alien
char alien = '👽';
// CS1012: Too many characters in character literal
```
Consolidate statements regarding acceptable range in the Character literals section. Remove the "A Unicode character escape sequence ... must be in the range U+0000 to U+FFFF." sentence as it will be doubly redundant, given: the new sentence about "code points above U+FFFF will error", and b) the ANTLR grammar will finally restrict the values to only the allowable range.
For clarity and to reduce any potential for confusion, state that supplementary characters are allowed in string literals.

Next Step

The changes noted above, and implemented in the associated PR, need to find their way into the specification published by ECMA, as that contains the same errors and omissions.

Done?

No, not done. There are still some areas that need correction, but they are out of scope for now, and can be handled separately (and might require additional discussion).

For more info on this project, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

The text was updated successfully, but these errors were encountered:

Large portions of the Unicode info here is either missing, incorrect, or misleading. The changes made this time, outlined below, deal mostly with Unicode escape sequences, but also touch on identifiers in general. Details on why these changes are being made, why the ANTLR grammar and text descriptions are being specified in the new way, example code where applicable, and links to Unicode.org as reference / support where applicable, are posted in the following issue: https://github.com/dotnet/csharplang/issues/2672 1. Broke `unicode_escape_sequence` token into two tokens: `unicode_bmp_escape_sequence` and `unicode_supplementary_escape_sequence`. This not only accurately reflects reality, but it also helps distinguish in the ANTLR grammar (not just in text comments) the set of Unicode characters that can be used in character literals and identifiers vs those that are invalid for those two cases. 2. Specify in ANTLR grammar the constrained usage of "\U" that creates BMP characters and is thus equivalent to "\u" where "\u" is valid to use. 3. Specify in ANTLR grammar how to create supplementary characters using both "\u" to specify an actual surrogate pair, and "\U" to specify a Unicode code point / UTF-32 code unit (they are synonymous). Along with the updates I recently made to the ["Escape Sequences" section of the C# Strings documentation](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/#string-escape-sequences) to show usable range and an example, this should prevent any more misinterpreting of "\U". 4. Introduce the concept of "Supplementary Characters". That term has never been used in this document, which I believe has lead some to mistakenly use the term "surrogate pairs" (which was mentioned here only once, though not incorrectly) to mean "supplementary characters". Surrogate Pairs is a UTF-16 encoding-specific construct. It does not refer to the set of characters, but instead to how they are physically encoded, but only in UTF-16. Neither UTF-8 nor UTF-32 have surrogate pairs, yet they both certainly support all supplementary characters. 5. Clean-up some inconsistent terminology and formatting: 1. "code point" vs "value" 2. U+1234 vs `U+1234` 3. U+1234 vs 0x1234 6. Provide an accurate and thorough description of BMP range/characters vs Supplementary range/characters, what constitutes a valid surrogate pair, that surrogate pairs are an encoding construct specific to UTF-16, and how all of that relates to `char` vs `string` literals and usage of "`\u`" and "`\U`". 7. Clarify that only BMP characters (direct or via `unicode_bmp_escape_sequence` ) can be used in identifiers. Previously, the ANTLR grammar did not prevent supplementary characters from being used, and there was not text description stating that they couldn't be used. However, they cannot be used (at least not in LINQPad 5 or VS 2015 -- related request: dotnet#1742 -- though "gmcs 4.6.2" via Mono does seem to allow them, possibly due to incomplete supplementary character definition here). 8. Added new token for "`hex_digit_except_zero`" to be used in the new "`unicode_supplementary_escape_sequence`" token, else the definition for "unicode_supplementary_escape_sequence" would have been rather messy / harder to read. 9. Adjusted definition of "`hex_digit`" to use new "`hex_digit_except_zero`" token. All previous uses of "`hex_digit`" are unaffected. 10. Some English grammatical improvements to "Note:" at beginning of **Character literals** section. 11. Updated ANTLR grammar for **Character literals** to specify only BMP characters can be used. Previously the grammar did not prevent supplementary characters from being used, and it was only noted in the text description. 12. Removed the "A Unicode character escape sequence ... must be in the range U+0000 to U+FFFF." sentence from the **Character literals** section as it is now doubly redundant, given the sentence about "code points above U+FFFF will error", _and_ the ANTLR grammar finally restriction the values to only the allowable range. --- For more info on this project, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)

ufcpp · 2019-07-20T07:41:37Z

partially related to dotnet/roslyn#13474
It's a known issue that Roslyn does not recognize surrogate pairs in identifiers.

BillWagner · 2021-04-30T19:06:54Z

moving to dotnet/csharpstandard

Note that @gafter refers to the associated PR in #155 (I suggest we close that issue as a duplicate, as the discussion is primarily in this issue).

srutzky changed the title ~~[WIP] Correctify Unicode-related errors and omissions in "Lexical structure"~~ Correctify Unicode-related errors and omissions in "Lexical structure" Jul 21, 2019

BillWagner transferred this issue from dotnet/csharplang Apr 30, 2021

srutzky mentioned this issue Apr 30, 2021

Correctify Identifier definition to conform to Unicode standard in "Lexical structure" #305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctify Unicode-related errors and omissions in "Lexical structure" #304

Correctify Unicode-related errors and omissions in "Lexical structure" #304

srutzky commented Jul 19, 2019

ufcpp commented Jul 20, 2019

BillWagner commented Apr 30, 2021 •

edited

Correctify Unicode-related errors and omissions in "Lexical structure" #304

Correctify Unicode-related errors and omissions in "Lexical structure" #304

Comments

srutzky commented Jul 19, 2019

Background

Cha cha cha changes to this page

Next Step

Done?

ufcpp commented Jul 20, 2019

BillWagner commented Apr 30, 2021 • edited

BillWagner commented Apr 30, 2021 •

edited