Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct and improve Unicode escape sequence info (F#) #13168

Merged
merged 2 commits into from Jun 28, 2019

Conversation

Projects
None yet
2 participants
@srutzky
Copy link
Contributor

commented Jun 28, 2019

"Literals" page

  1. Remove erroneous note regarding \U being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.

    • Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the \U escape can also be used for BMP characters.
    • Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One
  2. Show the exact hex value range for \u and \U to be more readable / helpful. This not only reduces confusion (especially for \U), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).

"Strings" page

  1. Correctly indicated that \u is for a 2-byte UTF-16 value, and \U is for a 4-byte UTF-32 value.

  2. Show a more accurate pattern for \U to be more readable / helpful. Please note that \U00XXXXXX has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One.

FYI: I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.


For more info on all of this, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

srutzky added some commits Jun 28, 2019

Correct and improve Unicode escape sequence info
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
    * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
    * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5)

2. Show the exact hex value range for `\u` and `\U` to be more readable / helpful. This not only reduces confusion (especially for `\U`), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).

For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)
Correct and improve Unicode escape sequence info
1. Correctly indicated that `\u` is for a 2-byte UTF-16 value, and `\U` is for a 4-byte UTF-32 value.

2. Show a more accurate pattern for `\U` to be more readable / helpful. Please note that `\U00XXXXXX` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5).

**FYI:** I found an undocumented escape sequence, `\xXX`, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)

@srutzky srutzky requested a review from cartermp as a code owner Jun 28, 2019

@srutzky srutzky changed the title Correct and improve Unicode escape sequence info Correct and improve Unicode escape sequence info (F#) Jun 28, 2019

@cartermp
Copy link
Contributor

left a comment

Thank you @srutzky!

@cartermp cartermp merged commit 3c8367a into dotnet:master Jun 28, 2019

7 checks passed

Docs Content Validation Status: Succeeded
Details
OpenPublishing.Build Validation status: passed
Details
OpenPublishing.Build (1 of 3) Waiting for processor completed at 09:34:22 PST
OpenPublishing.Build (2 of 3) Preparing completed at 09:36:56 PST
OpenPublishing.Build (3 of 3) Building completed at 09:37:41 PST
WIP Ready for review
Details
license/cla All CLA requirements met.
Details
@cartermp

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

Regarding this:

I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

@srutzky

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

@cartermp

Re:

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

Yes, I figured if it was something to deal with, then it would be dealt with separately. I just didn't want to add it in now since it could have been something "experimental" and never completed, or something intentionally hidden. I dunno, maybe I am just conditioned by doing most of my work with SQL Server where there is quite a bit of "undocumented" stuff ;-). If nobody knows of a reason why \x shouldn't be documented, then I can submit another update for that early next week...

@srutzky

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

I forgot to mention that this update has a companion C# update: #13162

@srutzky srutzky deleted the srutzky:patch-2 branch Jul 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.