Tweak Some Unicode-Related Text #1103

RexJaeschke · 2024-05-04T15:58:08Z

Recently, I started writing a formal spec for the V11 feature, UTF-8 string literals. However, as I was reading the Draft V8 Ecma C# spec, it became clear to me that it didn’t say enough and/or wasn’t as clear as it could be w.r.t the (presumed) current intention regarding Unicode support. So, before I go back to writing that V11 feature spec, I created this PR, which contains a set of proposed improvements to the current Ecma spec. Basically, I want a better base on which to add the V11 (and possibly later) extensions. This PR includes the following kinds of edits:

Use grammar rule name instead of the descriptive English equivalent.
Use string instead of "string," as we'll have different kinds of string once UTF-8 support is added.
Use char instead of "character" where that makes it more precise.
Remove duplication of normative text (w.r.t conformance).
Use terms consistently.

I don't expect this to be controversial. The only new normative text has to do with explicitly saying that type string uses UTF-16 encoding.

@jskeet I added you as a reviewer, as I know you have written about some Unicode issues.
@KalleOlaviNiemitalo If you have expertise in this area, I'd appreciate your feedback.

KalleOlaviNiemitalo · 2024-05-04T16:18:19Z

standard/types.md

@@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type)

 ### 8.2.5 The string type

-The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings.
+The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined.


If the representation (including endianness) of UTF-16 code units in strings were not the same as in char, then memory allocation would be necessary in the conversion from string to ReadOnlySpan<char>, and in fixed (char* p = str).

However, I'm not sure the standard should require implementations to store all strings in UTF-16 all the time. It could allow implementations to use a tighter encoding such as ISO-8859-1 for some strings, provided that the implementation can expand the string to UTF-16 when a ReadOnlySpan<char> or char* is needed. (A variable-length encoding such as UTF-8 could be too costly to implement, because of the string[int] indexer.)

In other words, I'd slightly prefer if the standard said that a string represents a sequence of UTF-16 code units (which do not have to be valid UTF-16) but did not prescribe any in-memory format or endianness for strings.

AFAICT, the endianness of int is currently unspecified rather than implementation-defined. I don't see why the endianness of char (and ushort, which has the same representation) would be more important for implementations to document.

@KalleOlaviNiemitalo Regarding "Instances of the string class represent a sequence of UTF-16 code units, whose endianness is implementation-defined." note the comma. This is intended to say that the ordering of the UTF-16 code units in the (singular) string instance is implementation-defined, rather than the bits in any char in that string (which is how I would read that sentence without the comma. I say that because I'm confused by your mention of endianness with scalar types.

Why would you say that that the ordering of the UTF-16 code units is implementation-defined? What kind of freedom would implementations have there?

For example, in the string "cat", the UTF-16 code units must be (U+0063, U+0061, U+0074). Do you think some implementation would prefer storing those in the opposite order in memory, i.e. { 0x0074, 0x0061, 0x0063 }? As "cat"[2] must still be 't', I expect that would only be harder to implement.

Or, did you mean an implementation might swap the code units in each surrogate pair? I mean, store "👁eye" (U+1F441, U+0065, U+0079, U+0065) as { 0xDC41, 0xD83D, 0x0065, 0x0079, 0x0065 } rather than { 0xD83D, 0xDC41, 0x0065, 0x0079, 0x0065 }. I expect this would make sorting slower.

RexJaeschke added 4 commits May 4, 2024 11:28

tweak Unicode-related text

1ada208

tweak Unicode-related text

31be46d

tweak Unicode-related text

39725dd

tweak Unicode-related text

7f4aa4b

RexJaeschke added the type: clarity While not technically incorrect, the Standard is potentially confusing label May 4, 2024

RexJaeschke requested a review from jskeet May 4, 2024 15:58

RexJaeschke self-assigned this May 4, 2024

KalleOlaviNiemitalo reviewed May 4, 2024

View reviewed changes

TweakUnicodeStuff

c2674d6

RexJaeschke closed this Jun 24, 2024

RexJaeschke deleted the TweakUnicodeStuff branch June 24, 2024 20:29

BillWagner restored the TweakUnicodeStuff branch June 24, 2024 20:36

BillWagner reopened this Jun 24, 2024

RexJaeschke added the meeting: proposal There is an informal proposal in the issue, worth discussing in a meeting label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak Some Unicode-Related Text #1103

Tweak Some Unicode-Related Text #1103

RexJaeschke commented May 4, 2024

KalleOlaviNiemitalo May 4, 2024 •

edited

Loading

KalleOlaviNiemitalo May 4, 2024

RexJaeschke May 24, 2024

KalleOlaviNiemitalo Jun 24, 2024

Tweak Some Unicode-Related Text #1103

Are you sure you want to change the base?

Tweak Some Unicode-Related Text #1103

Conversation

RexJaeschke commented May 4, 2024

KalleOlaviNiemitalo May 4, 2024 • edited Loading

Choose a reason for hiding this comment

KalleOlaviNiemitalo May 4, 2024

Choose a reason for hiding this comment

RexJaeschke May 24, 2024

Choose a reason for hiding this comment

KalleOlaviNiemitalo Jun 24, 2024

Choose a reason for hiding this comment

KalleOlaviNiemitalo May 4, 2024 •

edited

Loading