Skip to content

Commit 7c72eb0

Browse files
srutzkybaronfel
authored andcommitted
Fix Supplementary Character / Surrogate Pair info (no code changes) (#7221)
Terminology and info regarding Supplementary Characters / Surrogate Pairs is either incorrect, or at least incomplete (which then leads to incorrect statements and/or code). 1. Introduce the term "Supplementary Character" since that is often what we are dealing with, not "surrogate pairs" since that is an encoding-specific concept (UTF-16 only). 2. Add comment re: Supplementary Character code point range, which helps to explain the `elif high > 0x10 then Invalid` condition (line 173). 3. Fix URI for Unicode Standard PDF, Chapter 3, and specify the name of the section (i.e. "Surrogates") instead of the section number (i.e. 3.8) since the section number was 3.7 but is now 3.8 (line 174). 4. Add comment for definition of a valid "surrogate pair" because why make the reader guess or have to go look it up when it will never change? (line 175) 5. Correct and expand comment with example long Unicode escape sequence (line 64): `"\UDEADBEEF"` is _not_ a valid escape sequence. Usage of the `\U` escape has been misstated from the very beginning, both in this documentation as well as the C# Specification documentation, and the language references for "String" for both F# and C#: 1. `\U` is used to specify a Unicode code point (or UTF-32 code unit, which maps 1:1 with all Unicode code points, hence they are synonymous), not surrogate pairs. Hence the valid range is `00000000` - `0010FFFF`, hence the first two digits are static `0`s, and the third digit can only ever be a `0` or `1`. This escape sequence can specify either a BMP character or a Supplementary character. Supplementary characters are then encoded as a surrogate pair in UTF-16 only, not in UTF-8 or UTF-32. If you want to specify an actual surrogate pair, then use the `\u` escape, e.g. `\uD83D\uDC7D` == `\U0001F47D`. 2. Even if you could specify a surrogate pair using `\U`, "DEADBEEF" is not valid. U+DEAD is a valid surrogate, _but_ it's a low surrogate code point and cannot be specified first in the pair (meaning, at best one could use `\UxxxxDEAD`). Also, U+BEEF is _not_ a valid surrogate code point, high or low. Surrogate code points are in the range of U+D800 to U+DFFF. For more info, please see: https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp
1 parent 135a1ae commit 7c72eb0

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

src/fsharp/lexhelp.fs

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ type lexargs =
6060
applyLineDirectives: bool
6161
pathMap: PathMap }
6262

63-
/// possible results of lexing a long unicode escape sequence in a string literal, e.g. "\UDEADBEEF"
63+
/// possible results of lexing a long Unicode escape sequence in a string literal, e.g. "\U0001F47D",
64+
/// "\U000000E7", or "\UDEADBEEF" returning SurrogatePair, SingleChar, or Invalid, respectively
6465
type LongUnicodeLexResult =
6566
| SurrogatePair of uint16 * uint16
6667
| SingleChar of uint16
@@ -169,7 +170,9 @@ let unicodeGraphLong (s:string) =
169170
if high = 0 then SingleChar(uint16 low)
170171
// invalid encoding
171172
elif high > 0x10 then Invalid
172-
// valid surrogate pair - see http://www.unicode.org/unicode/uni2book/ch03.pdf, section 3.7 *)
173+
// valid supplementary character: code points U+10000 to U+10FFFF
174+
// valid surrogate pair: see http://www.unicode.org/versions/latest/ch03.pdf , "Surrogates" section
175+
// high-surrogate code point (U+D800 to U+DBFF) followed by low-surrogate code point (U+DC00 to U+DFFF)
173176
else
174177
let codepoint = high * 0x10000 + low
175178
let hiSurr = uint16 (0xD800 + ((codepoint - 0x10000) / 0x400))

0 commit comments

Comments
 (0)