-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak Some Unicode-Related Text #1103
base: draft-v8
Are you sure you want to change the base?
Conversation
@@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type) | |||
|
|||
### 8.2.5 The string type | |||
|
|||
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings. | |||
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the representation (including endianness) of UTF-16 code units in strings were not the same as in char
, then memory allocation would be necessary in the conversion from string
to ReadOnlySpan<char>
, and in fixed (char* p = str)
.
However, I'm not sure the standard should require implementations to store all strings in UTF-16 all the time. It could allow implementations to use a tighter encoding such as ISO-8859-1 for some strings, provided that the implementation can expand the string to UTF-16 when a ReadOnlySpan<char>
or char*
is needed. (A variable-length encoding such as UTF-8 could be too costly to implement, because of the string[int] indexer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I'd slightly prefer if the standard said that a string represents a sequence of UTF-16 code units (which do not have to be valid UTF-16) but did not prescribe any in-memory format or endianness for strings.
AFAICT, the endianness of int
is currently unspecified rather than implementation-defined. I don't see why the endianness of char
(and ushort
, which has the same representation) would be more important for implementations to document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KalleOlaviNiemitalo Regarding "Instances of the string
class represent a sequence of UTF-16 code units, whose endianness is implementation-defined." note the comma. This is intended to say that the ordering of the UTF-16 code units in the (singular) string instance is implementation-defined, rather than the bits in any char
in that string (which is how I would read that sentence without the comma. I say that because I'm confused by your mention of endianness with scalar types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you say that that the ordering of the UTF-16 code units is implementation-defined? What kind of freedom would implementations have there?
For example, in the string "cat", the UTF-16 code units must be (U+0063, U+0061, U+0074). Do you think some implementation would prefer storing those in the opposite order in memory, i.e. { 0x0074, 0x0061, 0x0063 }? As "cat"[2]
must still be 't'
, I expect that would only be harder to implement.
Or, did you mean an implementation might swap the code units in each surrogate pair? I mean, store "👁eye"
(U+1F441, U+0065, U+0079, U+0065) as { 0xDC41, 0xD83D, 0x0065, 0x0079, 0x0065 } rather than { 0xD83D, 0xDC41, 0x0065, 0x0079, 0x0065 }. I expect this would make sorting slower.
Recently, I started writing a formal spec for the V11 feature, UTF-8 string literals. However, as I was reading the Draft V8 Ecma C# spec, it became clear to me that it didn’t say enough and/or wasn’t as clear as it could be w.r.t the (presumed) current intention regarding Unicode support. So, before I go back to writing that V11 feature spec, I created this PR, which contains a set of proposed improvements to the current Ecma spec. Basically, I want a better base on which to add the V11 (and possibly later) extensions. This PR includes the following kinds of edits:
string
instead of "string," as we'll have different kinds of string once UTF-8 support is added.char
instead of "character" where that makes it more precise.I don't expect this to be controversial. The only new normative text has to do with explicitly saying that type
string
uses UTF-16 encoding.@jskeet I added you as a reviewer, as I know you have written about some Unicode issues.
@KalleOlaviNiemitalo If you have expertise in this area, I'd appreciate your feedback.