-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CWG2779 [lex.ccon] What are the types of single code units of character-literal and string-literral? #285
Comments
I think the key point should be that a code unit (at compilation time) should have an unsigned type. In practice, possibly signed Alternative suggested resolution: Change [lex.charset] p8 to
Change [lex.ccon] p3.1 to
Change [lex.string] p10 to
|
What does this wording intend to mean? char c = '@'; The code point of |
|
But, the intent of this wording is still unclear. character-literal without encoding prefix is specified with type |
For characters that are not in the basic literal character set, how such characters are encoded for ordinary character literals and wide character literals is implementation-defined because the character encoding used is implementation-defined in those cases. Thus, there is no requirement that the value be positive. |
Could you please supply the link about either GCC or Clang documenting this part? I wonder how it places the utf-8 encodings(commonly known, which is the encoding they adopted and are the positive values) into |
Oh... do you mean that when |
I don't think the "code unit type" is observable, so why would the standard need to specify it? As Tom pointed out, encodings of ordinary and wide character/string literals are implementation-defined. If you feel there are gaps in the documentation for implementation-defined behavior for some compilers, feel free to post bug reports to them. |
Consider this example, which is specified by the standard to have utf-8 encoding: auto* ptr = u8"牛逼"; The code points of such two characters are
So, what is the type of each code unit value in the sequence |
Since the literal encoding is implementation-defined, yes, that would be a conforming implementation.
In that example,
I don't see why a type is relevant. As Jens stated, these intermediate values are not observable. For the encodings specified by the standard (UTF-8, UTF-16, and UTF-32) the code unit values are guaranteed to be in the range of the element type of the character or string literal. |
Per [lex.string] table 12, the string literal What kind of checking do you envision for the array object? Note that we're not doing brace-initialization here, but initializing each element individually, so we don't apply narrowing checks. |
So, why it is necessary to say
We totally could say that the code unit is an integer value of the implementation-defined chosen character type. In other words, is it the intent that the implementation can use code unit of type For example: u8"牛逼"; Can the array object of type |
Yes, I think that's conforming. You couldn't tell the difference even if |
I think that change [lex.charset] p8 to that
This makes sense here. |
Perhaps we don't need to assign types (in the C++ type system) to code units, since mathematical integer values seem sufficient. However, [lex.ccon] p3.1 currently says (emphasis mine):
which possibly implies that the code unit value in such case has the same type as the character literal. |
I disagree. The term "code unit" has more broad applicability; it isn't intended to be used solely for the initialization of character and string literals. The initialized array continues to hold code unit values following the initialization. |
The initialized array just holds the integer value of the code units. Is there any conflict between the integer value of a code unit and its type? It is similar to we can use |
Agreed.
I don't believe so.
Agreed, but there are associated semantics. If an |
I just suggest that the type of the code unit is implementation chose of the character type. Both its value and its type are not specified by the standard. |
I don't see how additional discussion of an unobservable type would make the specification more clear. I scrolled back up to re-read the discussion so far. This issue started by quoting this wording:
I wonder if there is a misunderstanding here. The intent of the "cannot be encoded as a single code unit" has nothing to do with the range of the code unit type; it has to do with whether the encoding specifies that multiple code units are encoded for a given character (e.g., UTF-8 specifies that U+00E9 (é) is encoded as two code units (0xC3 0xA9). Perhaps the following change would make this more clear.
|
I do think whether a character can be encoded in a single code unit is sensitive to the type. For example, |
It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present. I'm not sure what would be helpful though; there is no observable distinction between encoding a (unsigned) sequence of {0xC3 0xA9} UTF-8 code units in a sequence of |
Yes, I just meant this, as well as, you have suggested in the above comment(#285 (comment)) |
I guess the type-related issue might be that we haven't forbidden pathological choices where
which would make a non-null character literal equal to Perhaps it would be better to specify that the range of valid code unit values of ordinary or wide literal encoding is not wider than the range of |
Full name of submitter (unless configured in github; will be published with the issue): Jim X
The original issue is cplusplus/draft#5247
[lex.ccon] p1 says
[lex.charset] p8 only says
So, what is the type of a single code unit for ordinary character literal, wide character literal, UTF-8 character literal, UTF-16 character literal, UTF-32 character literal? We didn't clearly specify them. The coloum
Type
in the table just specify what the type of the character literal is.We didn't specify what the type of a single code unit for the corresponding literal is, then we cannot determine what the range of the representable values of a single code unit is, that is, we cannot determine whether a character can be encoded in a single code unit or not for that character-literal.
Similarly, we didn't specify what the type of a single code unit for a string-literal is. At best, [lex.string] p10 can barely imply the single code unit's type is relevant to the element type of the string-literal.
Because the string literal object is of type array of n
T
, since the array is initialized by the sequence of code unit, that is, the element corresponds to a code unit, which may imply that the element type is relevant to the type of a single code unit in the string-literal. There is an exception to the ordinary string literal, assume it uses utf-8 encodingThe sequence of the code unit values will be
{195, 149, 0 }
. Both195
and149
cannot be represented in the typechar
if we assumechar
issigned char
with width8
. If the array object is initialized with this sequence, there is narrow conversions, which cause the program ill-formed.Anyway, we don't have any implication wording for character-literal.
Suggested Resolution
Presumably, the unsigned version of the
Type
column in the table should also be the type of the single code unit for that character-literal, and the unsigned version of the element type of a string-literral should also be the single code unit for that string-literral, we should clearly specify the type of a single code unit for the character-literal and string-literal.The text was updated successfully, but these errors were encountered: