New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8String Constants #909

Open
tannergooding opened this Issue Sep 15, 2017 · 72 comments

Comments

Projects
None yet
@tannergooding
Member

tannergooding commented Sep 15, 2017

Summary

Provide a general-purpose and safe way for declaring UTF8String constants values.

Motivation

CoreFX and CoreCLR are expected to get support for UTF8Strings somewhere in the near future (work is currently being done in https://github.com/dotnet/corefxlab, to my knowledge).

When this functionality RTMs, it will not be possible to create or declare a UTF8String in C# without first either getting the raw bytes from some external source (such as returned by a File or Network stream) or by converting from a UTF16 based string.

Detailed Design

The design for this is very similar to the design for #688 in that there are basically two ways this could be supported today: bytearray literals or data declarations. There are many downsides to using bytearray literals, so this proposal only covers data declarations.

Overview (CIL)

This feature is outlined in II.16.3 Embedding data in a PE file.

CIL Grammar

DataDecl ::= [ DataLabel ‘=’ ] DdBody

DdBody ::= DdItem 
         | ‘{’ DdItemList ‘}’

DdItemList ::= DdItem [ ‘,’ DdItemList ]

DdItem ::= ‘&’ ‘(’ Id ‘)’
         | bytearray ‘(’ Bytes ‘)’
         | char ‘*’ ‘(’ QSTRING ‘)’
         | float32 [ ‘(’ Float64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | float64 [ ‘(’ Float64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | int8 [ ‘(’ Int32 ‘)’ ] [‘[’ Int32 ‘]’ ]
         | int16 [ ‘(’ Int32 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | int32 [ ‘(’ Int32 ‘)’ ] [‘[’ Int32 ‘]’ ]
         | int64 [ ‘(’ Int64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]

Accessing Data (IL)

Accessing the data is then defined as:

The data stored in a PE File using the .data directive can be accessed through a static variable, either global or a member of a type, declared at a particular position of the data

FieldDecl ::= FieldAttr* Type Id at DataLabel

The data is then accessed by a program as it would access any other static variable, using instructions such as ldsfld, ldsflda, and so on.

The ability to access data from within the PE File can be subject to platform-specific rules, typically related to section access permissions within the PE File format itself.

Overview (C#)

A new keyword should likely be provided: utf8string.

It should behave in a manner similar to the string type:

  • represent a sequence of zero or more UTF8 characters,
  • be an alias for the System.UTF8String type
  • should be considered immutable

Drawbacks

This would be the preferred mechanism, but comes with the caveat that data declarations don't appear to be supported by any of the major languages today, and may not have extensive testing in the runtime proper.

Alternatives

The IL metadata format currently supports declaring bytearray literals. This functions just as other literals do in that the runtime does not actually do anything with the metadata and it is instead only read and consumed by a higher-level compiler (such as the C# compiler).

The issue with this approach is that the data is not considered directly accessible and would still incur runtime cost to initialize the data before having it passed around.

So, while it would allow users to declare UTF8String literals, they would be barely any more performant than what we have today.

@Unknown6656

This comment has been minimized.

Show comment
Hide comment
@Unknown6656

Unknown6656 Sep 15, 2017

Contributor

I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.

Also:
#184
#789 (https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-07-05.md#utf8-strings)

Contributor

Unknown6656 commented Sep 15, 2017

I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.

Also:
#184
#789 (https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-07-05.md#utf8-strings)

@qrli

This comment has been minimized.

Show comment
Hide comment
@qrli

qrli Sep 15, 2017

Or, the compiler transparently convert the string literal to utf8 when the target is System.UTF8String.

System.UTF8String u8str = "blabla";

So we don't need any new syntax.

qrli commented Sep 15, 2017

Or, the compiler transparently convert the string literal to utf8 when the target is System.UTF8String.

System.UTF8String u8str = "blabla";

So we don't need any new syntax.

@yaakov-h

This comment has been minimized.

Show comment
Hide comment
@yaakov-h

yaakov-h Sep 15, 2017

Contributor

@Unknown6656 isn't wstring in C++ UTF-16, not UTF-8? This would be backwards...

Contributor

yaakov-h commented Sep 15, 2017

@Unknown6656 isn't wstring in C++ UTF-16, not UTF-8? This would be backwards...

@Mafii

This comment has been minimized.

Show comment
Hide comment
@Mafii

Mafii Sep 15, 2017

I would prefer not to join the chaos of c++ in terms of strings. I'd heavely favor something new (e.g. utf8string) over a already known keyword (like wstring), that might be misunderstood, misinterpreted or create confusion.

For confusion, see: https://stackoverflow.com/a/402918/5962841

Mafii commented Sep 15, 2017

I would prefer not to join the chaos of c++ in terms of strings. I'd heavely favor something new (e.g. utf8string) over a already known keyword (like wstring), that might be misunderstood, misinterpreted or create confusion.

For confusion, see: https://stackoverflow.com/a/402918/5962841

@sharwell

This comment has been minimized.

Show comment
Hide comment
@sharwell

sharwell Sep 15, 2017

Member

I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.

@Unknown6656 std::wstring from C++ is already string in C#. UTF-8 strings in C++ would be declared as std::string and you could initialize them with u8"Literal text".

Member

sharwell commented Sep 15, 2017

I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.

@Unknown6656 std::wstring from C++ is already string in C#. UTF-8 strings in C++ would be declared as std::string and you could initialize them with u8"Literal text".

@eyalsk

This comment has been minimized.

Show comment
Hide comment
@eyalsk

eyalsk Sep 15, 2017

Contributor

Not sure how people feel about it but maybe u8string as opposed to utf8string? slightly less verbose but I wouldn't mind either way, really. :)

In favor of utf8 as opposed to utf8string.

Contributor

eyalsk commented Sep 15, 2017

Not sure how people feel about it but maybe u8string as opposed to utf8string? slightly less verbose but I wouldn't mind either way, really. :)

In favor of utf8 as opposed to utf8string.

@YaakovDavis

This comment has been minimized.

Show comment
Hide comment
@YaakovDavis

YaakovDavis Sep 15, 2017

u8string

Thought about this as well, but decided not to propose, as I prefer clarity over terseness :)

YaakovDavis commented Sep 15, 2017

u8string

Thought about this as well, but decided not to propose, as I prefer clarity over terseness :)

@eyalsk

This comment has been minimized.

Show comment
Hide comment
@eyalsk

eyalsk Sep 15, 2017

Contributor

@YaakovDavis Yeah me too but I'm not sure whether clarity is an issue here, maybe it is. :D

p.s. utf8 can work too.

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it but I do for utf8 as it's slightly terser and yet quite clear, even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

Contributor

eyalsk commented Sep 15, 2017

@YaakovDavis Yeah me too but I'm not sure whether clarity is an issue here, maybe it is. :D

p.s. utf8 can work too.

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it but I do for utf8 as it's slightly terser and yet quite clear, even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Sep 15, 2017

Member

I just proposed utf8string as it was "simple" and "clear".

I like utf8 as well, but wonder if it might be confused for something else (not a string), or if it might be used as a local name somewhere already (at least it seems more likely/common as an existing local name than utf8string).

Member

tannergooding commented Sep 15, 2017

I just proposed utf8string as it was "simple" and "clear".

I like utf8 as well, but wonder if it might be confused for something else (not a string), or if it might be used as a local name somewhere already (at least it seems more likely/common as an existing local name than utf8string).

@eyalsk

This comment has been minimized.

Show comment
Hide comment
@eyalsk

eyalsk Sep 15, 2017

Contributor

@tannergooding Yeah maybe, it's a minor thing, I wouldn't mind it either way but if I had to choose and utf8 is fine then I'd go with it. :)

Contributor

eyalsk commented Sep 15, 2017

@tannergooding Yeah maybe, it's a minor thing, I wouldn't mind it either way but if I had to choose and utf8 is fine then I'd go with it. :)

@YaakovDavis

This comment has been minimized.

Show comment
Hide comment
@YaakovDavis

YaakovDavis Sep 15, 2017

@eyalsk

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

Assuming you refer to the name length, the same can be said about System.String & string, or Double & double, yet we still have a keyword.

YaakovDavis commented Sep 15, 2017

@eyalsk

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

Assuming you refer to the name length, the same can be said about System.String & string, or Double & double, yet we still have a keyword.

@eyalsk

This comment has been minimized.

Show comment
Hide comment
@eyalsk

eyalsk Sep 15, 2017

Contributor

@YaakovDavis

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

I'll cite myself:

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

Another option is string8.

This is anything but clear. :)

Contributor

eyalsk commented Sep 15, 2017

@YaakovDavis

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

I'll cite myself:

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

Another option is string8.

This is anything but clear. :)

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Sep 15, 2017

Member

@YaakovDavis, I had briefly considered string8 as well, but decided against it. Both because it could be potentially ambiguous and because it might be used as the name of a local already.

Member

tannergooding commented Sep 15, 2017

@YaakovDavis, I had briefly considered string8 as well, but decided against it. Both because it could be potentially ambiguous and because it might be used as the name of a local already.

@YaakovDavis

This comment has been minimized.

Show comment
Hide comment
@YaakovDavis

YaakovDavis Sep 15, 2017

@tannergooding
Yeah, I don't like it either :)

YaakovDavis commented Sep 15, 2017

@tannergooding
Yeah, I don't like it either :)

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Sep 15, 2017

Member

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

@eyalsk, Pushing the shift key slows down my code 😆

Member

tannergooding commented Sep 15, 2017

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

@eyalsk, Pushing the shift key slows down my code 😆

@jnm2

This comment has been minimized.

Show comment
Hide comment
@jnm2

jnm2 Sep 15, 2017

Contributor

Pushing the shift key slows down my code 😆

😃 There's also that keywords are highlighted differently than struct names.

Contributor

jnm2 commented Sep 15, 2017

Pushing the shift key slows down my code 😆

😃 There's also that keywords are highlighted differently than struct names.

@AtsushiKan

This comment has been minimized.

Show comment
Hide comment
@AtsushiKan

AtsushiKan Sep 15, 2017

I vote for utf8 over utf8string. I'd also vote for no keyword at all over a keyword that's just the same as the BCL type name with the wrong casing. Brevity would be useful for a preferred and common type and on principle, if the keyword doesn't provide brevity or at least commonality with a prior C-based language, I don't want a keyword at all. I hate with a passion those string and object keywords (and the fact that certain repo's try to force them on me with their coding standards.) They add absolutely no value and make every program written with them look like a serial violator of the ".NET Type Names use Pascal casing" pattern.

AtsushiKan commented Sep 15, 2017

I vote for utf8 over utf8string. I'd also vote for no keyword at all over a keyword that's just the same as the BCL type name with the wrong casing. Brevity would be useful for a preferred and common type and on principle, if the keyword doesn't provide brevity or at least commonality with a prior C-based language, I don't want a keyword at all. I hate with a passion those string and object keywords (and the fact that certain repo's try to force them on me with their coding standards.) They add absolutely no value and make every program written with them look like a serial violator of the ".NET Type Names use Pascal casing" pattern.

@bondsbw

This comment has been minimized.

Show comment
Hide comment
@bondsbw

bondsbw Sep 15, 2017

@sharwell

you could initialize them with u8"Literal text".

👍 if this is supported, due to var and let.

We might also need @u8"Literal text" and $u8"Literal text".

(or u8@"Literal text" and u8$"Literal text"?)

bondsbw commented Sep 15, 2017

@sharwell

you could initialize them with u8"Literal text".

👍 if this is supported, due to var and let.

We might also need @u8"Literal text" and $u8"Literal text".

(or u8@"Literal text" and u8$"Literal text"?)

@jnm2

This comment has been minimized.

Show comment
Hide comment
@jnm2

jnm2 Sep 15, 2017

Contributor

So @AtsushiKan hates string, object, char, byte, double, and decimal. 😄

Fair enough. We also have existing primitive types with no keyword: IntPtr and UIntPtr.

Contributor

jnm2 commented Sep 15, 2017

So @AtsushiKan hates string, object, char, byte, double, and decimal. 😄

Fair enough. We also have existing primitive types with no keyword: IntPtr and UIntPtr.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding
Member

tannergooding commented Sep 15, 2017

FYI. @jaredpar.

@rikimaru0345

This comment has been minimized.

Show comment
Hide comment
@rikimaru0345

rikimaru0345 Sep 16, 2017

I vote for 1. no keyword if possible
and 2. utf8 if we really need one. / u8 literal prefix.

I really dislike "utf8string"!

rikimaru0345 commented Sep 16, 2017

I vote for 1. no keyword if possible
and 2. utf8 if we really need one. / u8 literal prefix.

I really dislike "utf8string"!

@ufcpp

This comment has been minimized.

Show comment
Hide comment
@ufcpp

ufcpp Sep 16, 2017

As long as I'm experimenting with the current implementation of corefxlab, the Utf8String is not so easy to use because it's a ref (stack-only) struct. So it might be impossible to line up with predefined types.

ufcpp commented Sep 16, 2017

As long as I'm experimenting with the current implementation of corefxlab, the Utf8String is not so easy to use because it's a ref (stack-only) struct. So it might be impossible to line up with predefined types.

@jnm2

This comment has been minimized.

Show comment
Hide comment
@jnm2

jnm2 Sep 16, 2017

Contributor

Why would it be stack only? That's limiting.

Contributor

jnm2 commented Sep 16, 2017

Why would it be stack only? That's limiting.

@ufcpp

This comment has been minimized.

Show comment
Hide comment
@ufcpp

ufcpp commented Sep 17, 2017

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Sep 17, 2017

Member

Even if it is stack only, having a way to declare string constants without having to manually copy/initialize it at runtime is beneficial (especially for interop code).

Although I do hope it eventually gets built in runtime support, like System.String has.

Member

tannergooding commented Sep 17, 2017

Even if it is stack only, having a way to declare string constants without having to manually copy/initialize it at runtime is beneficial (especially for interop code).

Although I do hope it eventually gets built in runtime support, like System.String has.

@ufcpp

This comment has been minimized.

Show comment
Hide comment
@ufcpp

ufcpp Sep 17, 2017

@tannergooding
I want utf8 constants too. However I'd like user-defined constants/literals for arbitrary types rather than "predefined" utf8string type. As well as this proposal, many types need constants:

ufcpp commented Sep 17, 2017

@tannergooding
I want utf8 constants too. However I'd like user-defined constants/literals for arbitrary types rather than "predefined" utf8string type. As well as this proposal, many types need constants:

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding
Member

tannergooding commented Sep 17, 2017

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Sep 17, 2017

Member

It's a very similar proposal to this. However, utf8string constants are slightly less problematic due to not needing to worry about endianness conversions.

#688 via data declarations would allow structured constants to be declared and I would hope would work with big endian as well via the runtime, but I'm not for certain

Member

tannergooding commented Sep 17, 2017

It's a very similar proposal to this. However, utf8string constants are slightly less problematic due to not needing to worry about endianness conversions.

#688 via data declarations would allow structured constants to be declared and I would hope would work with big endian as well via the runtime, but I'm not for certain

@markrendle

This comment has been minimized.

Show comment
Hide comment
@markrendle

markrendle Dec 1, 2017

Could just have a compiler switch to specify which String constants and literals should be. Pretty much any application I build with .NET Core is going to be working with UTF8, why would I want String literals at all?

Exemplar.csproj:

<PropertyGroup>
  <TargetFramework>netcoreapp2.0</TargetFramework>
  <LangVersion>latest</LangVersion>
  <LiteralStringEncoding>utf8</LiteralStringEncoding>
</PropertyGroup>

markrendle commented Dec 1, 2017

Could just have a compiler switch to specify which String constants and literals should be. Pretty much any application I build with .NET Core is going to be working with UTF8, why would I want String literals at all?

Exemplar.csproj:

<PropertyGroup>
  <TargetFramework>netcoreapp2.0</TargetFramework>
  <LangVersion>latest</LangVersion>
  <LiteralStringEncoding>utf8</LiteralStringEncoding>
</PropertyGroup>
@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Dec 1, 2017

@markrendle I think the entire .net framework from the deepiest C++ CLR up to fancy C# code is expecting that string is always UTF16, as spec says. I don't believe we can afford such a change, and it would be the most breaking change ever.

Pzixel commented Dec 1, 2017

@markrendle I think the entire .net framework from the deepiest C++ CLR up to fancy C# code is expecting that string is always UTF16, as spec says. I don't believe we can afford such a change, and it would be the most breaking change ever.

@markrendle

This comment has been minimized.

Show comment
Hide comment
@markrendle

markrendle Dec 1, 2017

@Pzixel I'm not suggesting changing the internal representation of the existing String type. The compiler switch, which would default to utf16 to prevent a breaking change, would simply change the type of string literals to Utf8String instead of String. The string keyword would still point to the existing UTF-16 System.String type.

markrendle commented Dec 1, 2017

@Pzixel I'm not suggesting changing the internal representation of the existing String type. The compiler switch, which would default to utf16 to prevent a breaking change, would simply change the type of string literals to Utf8String instead of String. The string keyword would still point to the existing UTF-16 System.String type.

@KrzysztofCwalina KrzysztofCwalina referenced this issue Apr 24, 2018

Open

Productize Utf8String #2165

0 of 4 tasks complete
@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 2, 2018

There is also an ongoing discussion about UTF8 character (proposal is to call it rune): dotnet/corefx#24093.

Perhaps the name can be considered in light of:

char is to string as rune is to utf8string?

If compiler flag option is to be entertained without introducing a new language syntax: treat "String" and "literal string" as Utf8String, then the same can be extended to treat "Char" and "literal char" as Rune / Utf8Char?

ghost commented May 2, 2018

There is also an ongoing discussion about UTF8 character (proposal is to call it rune): dotnet/corefx#24093.

Perhaps the name can be considered in light of:

char is to string as rune is to utf8string?

If compiler flag option is to be entertained without introducing a new language syntax: treat "String" and "literal string" as Utf8String, then the same can be extended to treat "Char" and "literal char" as Rune / Utf8Char?

@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel May 2, 2018

@kasper3 UTF8 is very different from UTF16. For example, it's quite common to write something like

const string s = "Hello world";
char lastChar = s[s.Length - 1];

You cannot do it in UTF8 because you don't have O(1) access to chars. As result your experience is much different, because common tools like Substring, IndexOf and so on stops to work or work very inefficiently. You can see API that provides languages with native UTF8 support, link. It's differs a lot because all strings are just bunch of bytes that you have to interpret, and you don't know if s[100] is a valid codepoint or you're somewhere in the middle of the char.

Pzixel commented May 2, 2018

@kasper3 UTF8 is very different from UTF16. For example, it's quite common to write something like

const string s = "Hello world";
char lastChar = s[s.Length - 1];

You cannot do it in UTF8 because you don't have O(1) access to chars. As result your experience is much different, because common tools like Substring, IndexOf and so on stops to work or work very inefficiently. You can see API that provides languages with native UTF8 support, link. It's differs a lot because all strings are just bunch of bytes that you have to interpret, and you don't know if s[100] is a valid codepoint or you're somewhere in the middle of the char.

@ufcpp

This comment has been minimized.

Show comment
Hide comment
@ufcpp

ufcpp May 2, 2018

@kasper3

from dotnet/corefx#24093 (comment) :

C# keyword Ugly Long form Size
ubyte <=> System.CodeUnit 8 bit - Assumed Utf8 in absence of encoding param
uchar <=> System.CodePoint 32 bit

ufcpp commented May 2, 2018

@kasper3

from dotnet/corefx#24093 (comment) :

C# keyword Ugly Long form Size
ubyte <=> System.CodeUnit 8 bit - Assumed Utf8 in absence of encoding param
uchar <=> System.CodePoint 32 bit
@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick May 2, 2018

Contributor

@kasper3 The proposed rune type is not the 8-bit UTF8 code unit, it's the 32-bit Unicode code point (which is equivalent to UTF32 code unit). So it doesn't have anything to do with UTF8 strings.

Contributor

svick commented May 2, 2018

@kasper3 The proposed rune type is not the 8-bit UTF8 code unit, it's the 32-bit Unicode code point (which is equivalent to UTF32 code unit). So it doesn't have anything to do with UTF8 strings.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 2, 2018

@svick, you are right. I read similar looking proposals and mixed up. Utf8Char one is at dotnet/corefxlab#1799.

ghost commented May 2, 2018

@svick, you are right. I read similar looking proposals and mixed up. Utf8Char one is at dotnet/corefxlab#1799.

@MichalStrehovsky

This comment has been minimized.

Show comment
Hide comment
@MichalStrehovsky

MichalStrehovsky May 24, 2018

Member

.field public static uint8 MyData at _MyData

If we're going to use RVA static fields to hold the data, can we make sure there's a mechanism by which IL rewriters (or AoT compilers) can determine how big the data is so they can copy it? E.g. the .NET Native compiler would only copy a single byte for your example since it has no way of knowing how long the data actually is.

This could either be achieved by giving the field a type with the required size (would work out of the box in most places), or e.g. through a custom attribute that marks this as a UTF8 string (so they can interpret it).

Member

MichalStrehovsky commented May 24, 2018

.field public static uint8 MyData at _MyData

If we're going to use RVA static fields to hold the data, can we make sure there's a mechanism by which IL rewriters (or AoT compilers) can determine how big the data is so they can copy it? E.g. the .NET Native compiler would only copy a single byte for your example since it has no way of knowing how long the data actually is.

This could either be achieved by giving the field a type with the required size (would work out of the box in most places), or e.g. through a custom attribute that marks this as a UTF8 string (so they can interpret it).

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding May 24, 2018

Member

@MichalStrehovsky, the tooling would need to know that it needs to follow the data constant pointer (indicated by the at _MyData) and that it would need to inspect that data.

Member

tannergooding commented May 24, 2018

@MichalStrehovsky, the tooling would need to know that it needs to follow the data constant pointer (indicated by the at _MyData) and that it would need to inspect that data.

@MichalStrehovsky

This comment has been minimized.

Show comment
Hide comment
@MichalStrehovsky

MichalStrehovsky May 25, 2018

Member

@MichalStrehovsky, the tooling would need to know that it needs to follow the data constant pointer (indicated by the at _MyData) and that it would need to inspect that data.

Right. But the size of the data is not encoded in the format. The field will be an RVA static field (as per II.22.18 of the ECMA-335 spec). Only offset to the beginning of the data is encoded in the format. An IL rewriter (or AOT compiler) would have no way of knowing how much data to copy. If they assumed uint8 is the size of the data from your sample (because the fields type is uint8), they would have cut off the remaining 3 bytes.

Member

MichalStrehovsky commented May 25, 2018

@MichalStrehovsky, the tooling would need to know that it needs to follow the data constant pointer (indicated by the at _MyData) and that it would need to inspect that data.

Right. But the size of the data is not encoded in the format. The field will be an RVA static field (as per II.22.18 of the ECMA-335 spec). Only offset to the beginning of the data is encoded in the format. An IL rewriter (or AOT compiler) would have no way of knowing how much data to copy. If they assumed uint8 is the size of the data from your sample (because the fields type is uint8), they would have cut off the remaining 3 bytes.

@davidwrighton

This comment has been minimized.

Show comment
Hide comment
@davidwrighton

davidwrighton Jun 2, 2018

Member

How about instead of attempting to use a data declaration (that will expose substantial problems for tools such as IL rewriters, that we continue to use the ldstr instruction, and simply store utf8 data in the utf16 string pool. Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

For instance, the code sequence for generating a utf8 string could be as simple as...

ldstr ""
call System.Utf8String System.Runtime.CompilerServices.Utf8StringServices.FromUtf8LiteralString(string)

The encoding of the utf8 string data in this case can be something like
HeaderByte, rest of bytes

Psuedo-code for FromUtf8LiteralString function

Calculate the length of the stored utf8string by performing the following calculation
string sUtf8Encoded = "";
int utf8Length = sUtf8Encoded.Length * 2 - 2 + ((byte)sUtf8Encoded[0] != 0) ? 1 : 0;
fixed (char* pData = &sUtf8Encoded)
{
byte* pUtf8Data = ((byte*)pData) + 1;
ReadOnlySpan utf8Data = new ReadOnlySpan(pUtf8Data, utf8Length);

return new Utf8String(utf8Data);

}

The above logic isn't endian safe, but should show the general concept. This logic wouldn't actually run at execution time, as we would just have the equivalent logic to the above be implemented such that it ran at approximately jit time via a jit intrinsic, and have it stuff the decoded utf8 literal into an intern table just like we do with normal utf8 strings.

Alternatively, following the same model, one could use ldstr on an actual utf16 string, and call such a jit intrinsic, but that implies 2 things.

  1. That we wouldn't be able to represent invalid utf8 strings.
  2. That we certainly couldn't make a fairly efficient implementation that doesn't require a jit intrinsic.
Member

davidwrighton commented Jun 2, 2018

How about instead of attempting to use a data declaration (that will expose substantial problems for tools such as IL rewriters, that we continue to use the ldstr instruction, and simply store utf8 data in the utf16 string pool. Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

For instance, the code sequence for generating a utf8 string could be as simple as...

ldstr ""
call System.Utf8String System.Runtime.CompilerServices.Utf8StringServices.FromUtf8LiteralString(string)

The encoding of the utf8 string data in this case can be something like
HeaderByte, rest of bytes

Psuedo-code for FromUtf8LiteralString function

Calculate the length of the stored utf8string by performing the following calculation
string sUtf8Encoded = "";
int utf8Length = sUtf8Encoded.Length * 2 - 2 + ((byte)sUtf8Encoded[0] != 0) ? 1 : 0;
fixed (char* pData = &sUtf8Encoded)
{
byte* pUtf8Data = ((byte*)pData) + 1;
ReadOnlySpan utf8Data = new ReadOnlySpan(pUtf8Data, utf8Length);

return new Utf8String(utf8Data);

}

The above logic isn't endian safe, but should show the general concept. This logic wouldn't actually run at execution time, as we would just have the equivalent logic to the above be implemented such that it ran at approximately jit time via a jit intrinsic, and have it stuff the decoded utf8 literal into an intern table just like we do with normal utf8 strings.

Alternatively, following the same model, one could use ldstr on an actual utf16 string, and call such a jit intrinsic, but that implies 2 things.

  1. That we wouldn't be able to represent invalid utf8 strings.
  2. That we certainly couldn't make a fairly efficient implementation that doesn't require a jit intrinsic.
@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jun 2, 2018

Member

Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

@davidwrighton, I don't think the spec agrees with you:
image

Member

tannergooding commented Jun 2, 2018

Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

@davidwrighton, I don't think the spec agrees with you:
image

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jun 2, 2018

Member

How about instead of attempting to use a data declaration (that will expose substantial problems for tools such as IL rewriters

I'm also not sure that this is a "substantial" issue.

Almost every version of the C# compiler has added new features, new code patterns, new attributes to recognize, etc.

Adding support for yet another data type/attribute combination would be nothing new and would easily work without having to modify the runtime spec and without needing to get crazy creative with a bunch of stuff.

It can really be as simple as:

.field public static initonly uint8 MyData at _MyData
    .custom instance void Utf8StringConstantAttribute::.ctor() = (
        01 00 00 00
    )
Member

tannergooding commented Jun 2, 2018

How about instead of attempting to use a data declaration (that will expose substantial problems for tools such as IL rewriters

I'm also not sure that this is a "substantial" issue.

Almost every version of the C# compiler has added new features, new code patterns, new attributes to recognize, etc.

Adding support for yet another data type/attribute combination would be nothing new and would easily work without having to modify the runtime spec and without needing to get crazy creative with a bunch of stuff.

It can really be as simple as:

.field public static initonly uint8 MyData at _MyData
    .custom instance void Utf8StringConstantAttribute::.ctor() = (
        01 00 00 00
    )
@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Jun 2, 2018

@davidwrighton

Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

Okay, here is a counter-example.

This UTF-8 string is not 2 byte aligned thus cannot be stored there. Any UTF-8 char from ASCII range is not 2 byte aligned. QED.

int utf8Length = sUtf8Encoded.Length * 2 - 2 + ((byte)sUtf8Encoded[0] != 0) ? 1 : 0;

null terminator is a valid character, why don't you allow it to be in the string? Because you need some filling character for strings aren't 2 byte aligned? Well, it won't fly this way.

Pzixel commented Jun 2, 2018

@davidwrighton

Be aware that the utf16 string pool isn't actually defined to hold utf16 strings, its designed to hold counted length streams of data that are 2 byte aligned.

Okay, here is a counter-example.

This UTF-8 string is not 2 byte aligned thus cannot be stored there. Any UTF-8 char from ASCII range is not 2 byte aligned. QED.

int utf8Length = sUtf8Encoded.Length * 2 - 2 + ((byte)sUtf8Encoded[0] != 0) ? 1 : 0;

null terminator is a valid character, why don't you allow it to be in the string? Because you need some filling character for strings aren't 2 byte aligned? Well, it won't fly this way.

@davidwrighton

This comment has been minimized.

Show comment
Hide comment
@davidwrighton

davidwrighton Jun 5, 2018

Member

@Pzixel I believe I wasn't clear with my encoding, the intent is to allow any number of bytes to be used in the utf8 string, null is certainly valid embedded in the middle.

@tannergooding the encoding is a very important detail. If we don't pick something small and simple to decode, we will be penalizing the startup performance of applications substantially. For instance, the representation as an initonly field and an attribute will imply 6-10 bytes for a Field record in metadata, 6-12 bytes in the CustomAttributeTable, your MyData string to give a name to the utf8 string of arbitrary length, which is likely to turn into some sort guid, or index, so at least 5-6 bytes there, 4-N bytes of UTF8 data, and decoding it at IL time will require fairly complex logic in the JIT and VM that will slow down jitting, or it will require fairly expensive runtime logic at each ldstr equivalent.

While the utf16 string pool seems to indicate that it should only hold UTF16 data, that's actually a should relationship, unlike the strings heap which is required to contain valid utf8 strings, the User String heap is not documented to contain valid utf16 data, and in fact we have quite a few tests that validate behavior outside of containing valid utf16 data.

There certainly are other approaches, but minimizing the cost of things as basic as strings is generally pretty important as they tend to be ubiquitous.

Member

davidwrighton commented Jun 5, 2018

@Pzixel I believe I wasn't clear with my encoding, the intent is to allow any number of bytes to be used in the utf8 string, null is certainly valid embedded in the middle.

@tannergooding the encoding is a very important detail. If we don't pick something small and simple to decode, we will be penalizing the startup performance of applications substantially. For instance, the representation as an initonly field and an attribute will imply 6-10 bytes for a Field record in metadata, 6-12 bytes in the CustomAttributeTable, your MyData string to give a name to the utf8 string of arbitrary length, which is likely to turn into some sort guid, or index, so at least 5-6 bytes there, 4-N bytes of UTF8 data, and decoding it at IL time will require fairly complex logic in the JIT and VM that will slow down jitting, or it will require fairly expensive runtime logic at each ldstr equivalent.

While the utf16 string pool seems to indicate that it should only hold UTF16 data, that's actually a should relationship, unlike the strings heap which is required to contain valid utf8 strings, the User String heap is not documented to contain valid utf16 data, and in fact we have quite a few tests that validate behavior outside of containing valid utf16 data.

There certainly are other approaches, but minimizing the cost of things as basic as strings is generally pretty important as they tend to be ubiquitous.

@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Jun 5, 2018

@davidwrighton

@Pzixel I believe I wasn't clear with my encoding, the intent is to allow any number of bytes to be used in the utf8 string, null is certainly valid embedded in the middle.

Then if I save single char string, e.g. "\n", then it violates this rule:

its designed to hold counted length streams of data that are 2 byte aligned.

This string is not 2 byte aligned.

Pzixel commented Jun 5, 2018

@davidwrighton

@Pzixel I believe I wasn't clear with my encoding, the intent is to allow any number of bytes to be used in the utf8 string, null is certainly valid embedded in the middle.

Then if I save single char string, e.g. "\n", then it violates this rule:

its designed to hold counted length streams of data that are 2 byte aligned.

This string is not 2 byte aligned.

@jaredpar

This comment has been minimized.

Show comment
Hide comment
@jaredpar

jaredpar Jun 5, 2018

Member

@Pzixel

This string is not 2 byte aligned.

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

Member

jaredpar commented Jun 5, 2018

@Pzixel

This string is not 2 byte aligned.

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jun 5, 2018

Member

While the utf16 string pool seems to indicate that it should only hold UTF16 data, that's actually a should relationship, unlike the strings heap which is required to contain valid utf8 strings, the User String heap is not documented to contain valid utf16 data, and in fact we have quite a few tests that validate behavior outside of containing valid utf16 data.

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

Also wondering, how will you differentiate a UTF8 string from a UTF16 string in the heap?

Member

tannergooding commented Jun 5, 2018

While the utf16 string pool seems to indicate that it should only hold UTF16 data, that's actually a should relationship, unlike the strings heap which is required to contain valid utf8 strings, the User String heap is not documented to contain valid utf16 data, and in fact we have quite a few tests that validate behavior outside of containing valid utf16 data.

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

Also wondering, how will you differentiate a UTF8 string from a UTF16 string in the heap?

@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Jun 5, 2018

@jaredpar

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

Pzixel commented Jun 5, 2018

@jaredpar

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

@jaredpar

This comment has been minimized.

Show comment
Hide comment
@jaredpar

jaredpar Jun 5, 2018

Member

@Pzixel

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

You can use a prefix encoding scheme. At the worst you end up reserving the first 1-2 bytes for an tracking whether it's an odd length string. In the case of odd length strings you pay one byte, in even length strings you pay two bytes. @davidwrighton and I thought we could get it down smaller but didn't have time to dig into the details.

@tannergooding

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

This doesn't give me much pause. If they don't then we'd need to do the work to get them to support it.

Member

jaredpar commented Jun 5, 2018

@Pzixel

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

You can use a prefix encoding scheme. At the worst you end up reserving the first 1-2 bytes for an tracking whether it's an odd length string. In the case of odd length strings you pay one byte, in even length strings you pay two bytes. @davidwrighton and I thought we could get it down smaller but didn't have time to dig into the details.

@tannergooding

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

This doesn't give me much pause. If they don't then we'd need to do the work to get them to support it.

@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Jun 5, 2018

@jaredpar So it's basically not a string but rather some special struct with paddings/oddness info/... Why then store it in string section? There are section for raw bytes, why just not use them?

Pzixel commented Jun 5, 2018

@jaredpar So it's basically not a string but rather some special struct with paddings/oddness info/... Why then store it in string section? There are section for raw bytes, why just not use them?

@jaredpar

This comment has been minimized.

Show comment
Hide comment
@jaredpar

jaredpar Jun 5, 2018

Member

@Pzixel

So it's basically not a string but rather some special struct with paddings/oddness info/.

Nope. It's still very much a string. It just potentially changes the offset at which the string begins.

Why then store it in string section?

@davidwrighton has laid out a good case for this already. It has lots of advantages for the runtime and supporting languages.

Member

jaredpar commented Jun 5, 2018

@Pzixel

So it's basically not a string but rather some special struct with paddings/oddness info/.

Nope. It's still very much a string. It just potentially changes the offset at which the string begins.

Why then store it in string section?

@davidwrighton has laid out a good case for this already. It has lots of advantages for the runtime and supporting languages.

@davidwrighton

This comment has been minimized.

Show comment
Hide comment
@davidwrighton

davidwrighton Jun 5, 2018

Member

@tannergooding CoreClr, desktop CLR, Mono, CoreRT will all accept strings that look like this. While they aren't actually valid UTF-16 strings, they are valid sequences of UTF16 16 bit code units (which is a very low bar, a utf16 code unit is any 16 bit number), and that's what actually matters to the runtimes. Additionally, I've checked the logic of tools such as ildasm, ilasm, and ilspy, and they all handle these weird cases as well.

Member

davidwrighton commented Jun 5, 2018

@tannergooding CoreClr, desktop CLR, Mono, CoreRT will all accept strings that look like this. While they aren't actually valid UTF-16 strings, they are valid sequences of UTF16 16 bit code units (which is a very low bar, a utf16 code unit is any 16 bit number), and that's what actually matters to the runtimes. Additionally, I've checked the logic of tools such as ildasm, ilasm, and ilspy, and they all handle these weird cases as well.

@mjsabby

This comment has been minimized.

Show comment
Hide comment
@mjsabby

mjsabby Jun 17, 2018

Member

The main argument I can think of against the ldstr approach is that it will require a JIT intrinsic to be efficient, hence a new runtime and therefore delay adoption of such a feature. Perf conscious library authors that want to target .NET Core 2.1 which is (or soon will be) LTS may avoid it.

On the other hand, the C# compiler already supports embedding a data declaration and the necessary optimization to not allocate a compile time constant byte array on the heap via dotnet/roslyn#24621 and expose it as a ReadOnlySpan<byte>.

The compromise I'm thinking of is if we can have the C# compiler support,

ReadOnlySpan<byte> myUtf8String = "Foo";

because there is precedent and a cumbersome way to achieve it,

ReadOnlySpan<byte> myUtf8String = new byte[] { 0x46, 0x6F, 0x6F };

This would make the task of defining these data declarations less onerous for the perf-conscious developer who would like to target the existing runtime and still derive benefit from a newer C# compiler.

Member

mjsabby commented Jun 17, 2018

The main argument I can think of against the ldstr approach is that it will require a JIT intrinsic to be efficient, hence a new runtime and therefore delay adoption of such a feature. Perf conscious library authors that want to target .NET Core 2.1 which is (or soon will be) LTS may avoid it.

On the other hand, the C# compiler already supports embedding a data declaration and the necessary optimization to not allocate a compile time constant byte array on the heap via dotnet/roslyn#24621 and expose it as a ReadOnlySpan<byte>.

The compromise I'm thinking of is if we can have the C# compiler support,

ReadOnlySpan<byte> myUtf8String = "Foo";

because there is precedent and a cumbersome way to achieve it,

ReadOnlySpan<byte> myUtf8String = new byte[] { 0x46, 0x6F, 0x6F };

This would make the task of defining these data declarations less onerous for the perf-conscious developer who would like to target the existing runtime and still derive benefit from a newer C# compiler.

@jpierson

This comment has been minimized.

Show comment
Hide comment
@jpierson

jpierson Aug 22, 2018

I would prefer to have the ability to specify custom literals as a potential compile time feature for better extensibility.

Related proposals:

jpierson commented Aug 22, 2018

I would prefer to have the ability to specify custom literals as a potential compile time feature for better extensibility.

Related proposals:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment