Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve Unicode escape sequence info (C#) #13162

Merged
merged 1 commit into from Jul 1, 2019

Conversation

Projects
None yet
2 participants
@srutzky
Copy link
Contributor

commented Jun 28, 2019

  1. Remove erroneous note regarding \U being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.

    • Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the \U escape can also be used for BMP characters.
    • Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One
    • In creating the test noted above, I found a bug in the Mono C# compiler, so I submitted that here:
      "\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456
    • Runnable example code showing that invalid code point (U+110000) raises an exception, on IDE One
  2. Correctly indicated that \U is for a 4-byte UTF-32 value, and \u is for a 2-byte UTF-16 value.

  3. Show the pattern and an example to be more readable / helpful. Please note that \U00nnnnnn has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.

  4. Properly formatted escape sequences as being inline-code

  5. Added warning about using \x escape with less than 4 hex digits. For more info on this, please see:
    Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

Fix and improve Unicode escape sequence info
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair raises an exception, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
    * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
    * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/deoylQ)
   * In creating the test noted above, I found a bug in the Mono C\# compiler, so I submitted that here:  
       ["\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456](mono/mono#15456)
  * Runnable example code showing that invalid code point (U+110000) raises an exception, on [IDE One](https://ideone.com/jpVxL4)

2. Correctly indicated that `\U` is for a 4-byte UTF-32 value, and `\u` is for a 2-byte UTF-16 value.

3. Show the pattern _and_ an example to be more readable / helpful. Please note that `\U00nnnnnn` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.

4. Properly formatted escape sequences as being inline-code

5. Added warning about using `\x` escape with less than 4 hex digits. For more info on this, please see:
     [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.wordpress.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)

@srutzky srutzky requested a review from BillWagner as a code owner Jun 28, 2019

@srutzky srutzky changed the title Fix and improve Unicode escape sequence info Fix and improve Unicode escape sequence info (C#) Jun 28, 2019

@BillWagner
Copy link
Member

left a comment

Thank you for adding these clarifying comments @srutzky
We appreciate it.

I’ve reviewed the changes, and I’ll :shipit: now.

Thanks again!

@BillWagner BillWagner merged commit 9b6f355 into dotnet:master Jul 1, 2019

7 checks passed

Docs Content Validation Status: Succeeded
Details
OpenPublishing.Build Validation status: passed
Details
OpenPublishing.Build (1 of 3) Waiting for processor completed at 06:42:57 PST
OpenPublishing.Build (2 of 3) Preparing completed at 06:45:23 PST
OpenPublishing.Build (3 of 3) Building completed at 06:46:12 PST
WIP Ready for review
Details
license/cla All CLA requirements met.
@srutzky

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

@BillWagner You are welcome.

I forgot to mention that this update has a companion F# update: #13168

@srutzky srutzky deleted the srutzky:patch-1 branch Jul 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.