Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CWG2779 [lex.ccon] What are the types of single code units of character-literal and string-literral? #285

Open
xmh0511 opened this issue Mar 28, 2023 · 25 comments

Comments

@xmh0511
Copy link

xmh0511 commented Mar 28, 2023

Full name of submitter (unless configured in github; will be published with the issue): Jim X

The original issue is cplusplus/draft#5247

[lex.ccon] p1 says

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

[lex.charset] p8 only says

A code unit is an integer value of character type ([basic.fundamental]).

So, what is the type of a single code unit for ordinary character literal, wide character literal, UTF-8 character literal, UTF-16 character literal, UTF-32 character literal? We didn't clearly specify them. The coloum Type in the table just specify what the type of the character literal is.

The kind of a character-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding-prefix and its c-char-sequence as defined by Table 9.

We didn't specify what the type of a single code unit for the corresponding literal is, then we cannot determine what the range of the representable values of a single code unit is, that is, we cannot determine whether a character can be encoded in a single code unit or not for that character-literal.

Similarly, we didn't specify what the type of a single code unit for a string-literal is. At best, [lex.string] p10 can barely imply the single code unit's type is relevant to the element type of the string-literal.

String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence...

Because the string literal object is of type array of n T, since the array is initialized by the sequence of code unit, that is, the element corresponds to a code unit, which may imply that the element type is relevant to the type of a single code unit in the string-literal. There is an exception to the ordinary string literal, assume it uses utf-8 encoding

const char* ptr = "Õ";

The sequence of the code unit values will be {195, 149, 0 }. Both 195 and 149 cannot be represented in the type char if we assume char is signed char with width 8. If the array object is initialized with this sequence, there is narrow conversions, which cause the program ill-formed.

Anyway, we don't have any implication wording for character-literal.

Suggested Resolution

Presumably, the unsigned version of the Type column in the table should also be the type of the single code unit for that character-literal, and the unsigned version of the element type of a string-literral should also be the single code unit for that string-literral, we should clearly specify the type of a single code unit for the character-literal and string-literal.

@xmh0511 xmh0511 changed the title [lex.] [lex.ccon] What is a single code unit of character-literal Mar 28, 2023
@xmh0511 xmh0511 changed the title [lex.ccon] What is a single code unit of character-literal [lex.ccon] What is the range of the representable values in a single code unit of character-literal? Mar 28, 2023
@xmh0511 xmh0511 changed the title [lex.ccon] What is the range of the representable values in a single code unit of character-literal? [lex.ccon] What are types of the single code units of character-literal and string-literral? Mar 28, 2023
@xmh0511 xmh0511 changed the title [lex.ccon] What are types of the single code units of character-literal and string-literral? [lex.ccon] What are the types of the single code units of character-literal and string-literral? Mar 28, 2023
@xmh0511 xmh0511 changed the title [lex.ccon] What are the types of the single code units of character-literal and string-literral? [lex.ccon] What are the types of single code units of character-literal and string-literral? Mar 28, 2023
@frederick-vs-ja
Copy link

frederick-vs-ja commented Mar 29, 2023

I think the key point should be that a code unit (at compilation time) should have an unsigned type.

In practice, possibly signed char and wchar_t are often used to represent UTF code units (at runtime). But the term "code unit" in the standard seems only used for translation, so IMO we don't need to cover such usages.

Alternative suggested resolution:

Change [lex.charset] p8 to

A code unit of a character-literal or a string-literal is an integer value of the unsigned version of the underlying type of the character type ([basic.fundamental]) determined by its kind ([lex.ccon], [lex.string]), except that the character type is always char for a character-literal without encoding prefix. [...]

Change [lex.ccon] p3.1 to

[...] is the code unit value of the specified character as encoded in the literal's associated character encoding, converted to the character-literal's type.

Change [lex.string] p10 to

String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-char (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 NULL character, each converted to the string-literal's array element type, in order as follows: [...]

@xmh0511
Copy link
Author

xmh0511 commented Mar 29, 2023

except that the character type is always char for a character-literal without encoding prefix.

What does this wording intend to mean?

char c = '@';

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

@frederick-vs-ja
Copy link

except that the character type is always char for a character-literal without encoding prefix.

What does this wording intend to mean?

char c = '@';

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

'@' will be added to the basic literal character set by P2558 in C++26 (see also https://wg21.link/p2558/github). Perhaps no CWG issue is needed for this.

@xmh0511
Copy link
Author

xmh0511 commented Mar 29, 2023

except that the character type is always char for a character-literal without encoding prefix.

But, the intent of this wording is still unclear. character-literal without encoding prefix is specified with type char. The type of the code unit does not modify the type of character-literal's type, I think. Even though it is a character-literal without an encoding prefix, we can still say the code unit has the unsigned version of that type.

@tahonermann
Copy link

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

For characters that are not in the basic literal character set, how such characters are encoded for ordinary character literals and wide character literals is implementation-defined because the character encoding used is implementation-defined in those cases. Thus, there is no requirement that the value be positive.

@xmh0511
Copy link
Author

xmh0511 commented Mar 30, 2023

the character encoding used is implementation-defined in those cases.

Could you please supply the link about either GCC or Clang documenting this part? I wonder how it places the utf-8 encodings(commonly known, which is the encoding they adopted and are the positive values) into char for ordinary literal.

@frederick-vs-ja
Copy link

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

For characters that are not in the basic literal character set, how such characters are encoded for ordinary character literals and wide character literals is implementation-defined because the character encoding used is implementation-defined in those cases. Thus, there is no requirement that the value be positive.

Oh... do you mean that when char is signed and used for representing UTF-8 code units in character literals, the code unit values are first considered to be converted into [-128, 127] before further process in translation?

@jensmaurer
Copy link
Member

I don't think the "code unit type" is observable, so why would the standard need to specify it? As Tom pointed out, encodings of ordinary and wide character/string literals are implementation-defined.

If you feel there are gaps in the documentation for implementation-defined behavior for some compilers, feel free to post bug reports to them.

@xmh0511
Copy link
Author

xmh0511 commented Mar 30, 2023

I don't think the "code unit type" is observable

Consider this example, which is specified by the standard to have utf-8 encoding:

auto* ptr = u8"牛逼";

The code points of such two characters are \u725B, \u903C, the sequence of code units values in utf-8 encoding are {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }, we say

String literal objects are initialized with the sequence of code unit values...

So, what is the type of each code unit value in the sequence {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }? If we didn't specify the type, how do we check the initialization for array object?

@tahonermann
Copy link

Oh... do you mean that when char is signed and used for representing UTF-8 code units in character literals, the code unit values are first considered to be converted into [-128, 127] before further process in translation?

Since the literal encoding is implementation-defined, yes, that would be a conforming implementation.

Consider this example, which is specified by the standard to have utf-8 encoding:

In that example, auto will be deduced as const char8_t (which has unsigned char as its underlying type; [basic.fundamental]p9).

So, what is the type of each code unit value in the sequence {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }? If we didn't specify the type, how do we check the initialization for array object?

I don't see why a type is relevant. As Jens stated, these intermediate values are not observable. For the encodings specified by the standard (UTF-8, UTF-16, and UTF-32) the code unit values are guaranteed to be in the range of the element type of the character or string literal.

@jensmaurer
Copy link
Member

Per [lex.string] table 12, the string literal u8"牛逼" has type "array of const char8_t", which we initialize with the code unit values you gave. We know the underlying type, so we know how the values end up.

What kind of checking do you envision for the array object? Note that we're not doing brace-initialization here, but initializing each element individually, so we don't apply narrowing checks.

@xmh0511
Copy link
Author

xmh0511 commented Mar 31, 2023

which we initialize with the code unit values you gave. We know the underlying type, so we know how the values end up.

So, why it is necessary to say

A code unit is an integer value of character type ([basic.fundamental]).

We totally could say that the code unit is an integer value of the implementation-defined chosen character type. In other words, is it the intent that the implementation can use code unit of type char to initialized character object with type char8_t?

For example:

u8"牛逼";

Can the array object of type array of N char8_t be initialized by the sequence (char)0xE7, (char)0x89, (char)0x9B, (char)0xE9, (char)0x80, (char)0xBC? Is it a conforming implementation?

@jensmaurer
Copy link
Member

Yes, I think that's conforming.

You couldn't tell the difference even if char is signed (because conversion from/to signed/unsigned integer variants is simply a modulo 2^N operation).

@xmh0511
Copy link
Author

xmh0511 commented Mar 31, 2023

I think that change [lex.charset] p8 to that

A code unit is an integer value of the implementation-defined chosen character type.

This makes sense here.

@frederick-vs-ja
Copy link

frederick-vs-ja commented Mar 31, 2023

Perhaps we don't need to assign types (in the C++ type system) to code units, since mathematical integer values seem sufficient.

However, [lex.ccon] p3.1 currently says (emphasis mine):

A character-literal with a c-char-sequence consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name is the code unit value of the specified character as encoded in the literal's associated character encoding.

which possibly implies that the code unit value in such case has the same type as the character literal.

@tahonermann
Copy link

I think that change [lex.charset] p8 to that

A code unit is an integer value of the implementation-defined chosen character type.

This makes sense here.

I disagree. The term "code unit" has more broad applicability; it isn't intended to be used solely for the initialization of character and string literals. The initialized array continues to hold code unit values following the initialization.

@xmh0511
Copy link
Author

xmh0511 commented Apr 1, 2023

The initialized array continues to hold code unit values following the initialization.

The initialized array just holds the integer value of the code units. Is there any conflict between the integer value of a code unit and its type? It is similar to we can use int object to hold the value of type char that represents a character in the basic character set. It does not change anything, we just use that object holds the value as long as the whole value representation can be held.

@tahonermann
Copy link

The initialized array just holds the integer value of the code units.

Agreed.

Is there any conflict between the integer value of a code unit and its type?

I don't believe so.

It is similar to we can use int object to hold the value of type char that represents a character in the basic character set. It does not change anything, we just use that object holds the value as long as the whole value representation can be held.

Agreed, but there are associated semantics. If an int holds a distance, it is important to know if that distance is specified in SI or English units. Likewise, when considering code unit values, it is important to know for which encoding. 0xFF is a valid code unit value for ISO-8859-1 but not for UTF-8. Since the encoding is implementation-defined in the case of ordinary and wide character and string literals, the standard must defer to the implementation with regard to code unit values (or other encoding specific properties).

@xmh0511
Copy link
Author

xmh0511 commented Apr 3, 2023

Since the encoding is implementation-defined in the case of ordinary and wide character and string literals, the standard must defer to the implementation with regard to code unit values (or other encoding specific properties).

I just suggest that the type of the code unit is implementation chose of the character type. Both its value and its type are not specified by the standard.

@tahonermann
Copy link

I just suggest that the type of the code unit is implementation chose of the character type. Both its value and its type are not specified by the standard.

I don't see how additional discussion of an unobservable type would make the specification more clear.

I scrolled back up to re-read the discussion so far. This issue started by quoting this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

I wonder if there is a misunderstanding here. The intent of the "cannot be encoded as a single code unit" has nothing to do with the range of the code unit type; it has to do with whether the encoding specifies that multiple code units are encoded for a given character (e.g., UTF-8 specifies that U+00E9 (é) is encoded as two code units (0xC3 0xA9). Perhaps the following change would make this more clear.

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or is encoded with multiple code units.

@xmh0511
Copy link
Author

xmh0511 commented Apr 3, 2023

I wonder if there is a misunderstanding here. The intent of the "cannot be encoded as a single code unit" has nothing to do with the range of the code unit type;

I do think whether a character can be encoded in a single code unit is sensitive to the type. For example, U+00E9 (é) cannot be encoded as a single code unit, but UTF-16 or UTF-32 can, that is why we specify the type of the character literal for such two encodings are char16_t, char32_t, because we have to guarantee the value of the code unit can be representable within the object of that type.

@tahonermann
Copy link

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

I'm not sure what would be helpful though; there is no observable distinction between encoding a (unsigned) sequence of {0xC3 0xA9} UTF-8 code units in a sequence of char when char is an 8-bit signed type and encoding {-0x3D, -0x57} and claiming an implementation-defined encoding that works just like UTF-8 except that code unit values above 0x7F are encoded in two's complement.

@xmh0511
Copy link
Author

xmh0511 commented Apr 3, 2023

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

Yes, I just meant this, as well as, you have suggested in the above comment(#285 (comment))

@frederick-vs-ja
Copy link

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

I guess the type-related issue might be that we haven't forbidden pathological choices where

  • char is 8-bit, but
  • 256 is a valid code unit value in the ordinary literal encoding,

which would make a non-null character literal equal to '\0'.

Perhaps it would be better to specify that the range of valid code unit values of ordinary or wide literal encoding is not wider than the range of char or wchar_t respectively.

@jensmaurer
Copy link
Member

CWG2779

@jensmaurer jensmaurer changed the title [lex.ccon] What are the types of single code units of character-literal and string-literral? CWG2779 [lex.ccon] What are the types of single code units of character-literal and string-literral? Jul 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants