Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Correctify Unicode-related errors and omissions in "Lexical structure" #2672
Large portions of the Unicode info in the "Lexical structure" documentation are either missing, incorrect, or misleading. The document should reflect, in technically specific terms, that which is, and has always been, the behavior of C#. None of the changes here reflect any change(s) to the language; these changes merely bring the specification inline with the language it describes.
The main area of concern is with Unicode escape sequences, especially related to supplementary characters. And, it has been this way since the very beginning (almost 20 years now?). Earliest examples I could find are:
It has not helped that there was no example of the "\U" escape sequence anywhere, and the only description was in the "String Escape Sequences" section of the C# Strings documentation which said, "for surrogate pairs.", which is not exactly correct, but definitely misleading (I recently corrected that page). That wording, coupled with the lack of any examples to guide interpretation, leads one to think that you specify the two surrogate code points as there are 8 hex digits available and that's what is required for two UTF-16 surrogate code points. This misunderstanding has not stayed within the confines of this document as it has shown up in:
Cha cha cha changes to this page
The changes noted above, and implemented in the associated PR, need to find their way into the specification published by ECMA, as that contains the same errors and omissions.
No, not done. There are still some areas that need correction, but they are out of scope for now, and can be handled separately (and might require additional discussion).
For more info on this project, please see: